非剧本强化学习的州分配匹配方法

论文标题

非剧本强化学习的州分配匹配方法

A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning

论文作者

Sharma, Archit, Ahmad, Rehaan, Finn, Chelsea

论文摘要

尽管加强学习（RL）提供了通过反复试验学习的框架，但将RL算法转化为现实世界仍然充满挑战。实际应用的主要障碍是在每次试验后重置环境的算法的发展，与人类和机器人等体现的代理商所遇到的现实世界的持续和非情节性相反。先前的工作已经考虑了一种交替的方法，即远期政策学会解决任务，而后退政策学会重置环境，但是向后分配应将代理重置为哪些初始状态分布？假设访问了一些演示，我们提出了一种新方法，即奖牌，该方法训练向后的政策，以匹配提供的演示中的状态分布。这样可以使代理商接近与任务相关的状态，从而使远期政策的简单起始状态混合在一起。我们的实验表明，奖牌匹配或优于从伯爵基准的三个稀疏奖励连续控制任务上的先验方法，在最困难的任务上获得40％的收益，同时做出的假设少于先前的工作。

While reinforcement learning (RL) provides a framework for learning through trial and error, translating RL algorithms into the real world has remained challenging. A major hurdle to real-world application arises from the development of algorithms in an episodic setting where the environment is reset after every trial, in contrast with the continual and non-episodic nature of the real-world encountered by embodied agents such as humans and robots. Prior works have considered an alternating approach where a forward policy learns to solve the task and the backward policy learns to reset the environment, but what initial state distribution should the backward policy reset the agent to? Assuming access to a few demonstrations, we propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. This keeps the agent close to the task-relevant states, allowing for a mix of easy and difficult starting states for the forward policy. Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks from the EARL benchmark, with 40% gains on the hardest task, while making fewer assumptions than prior works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题