基于技能的元增强学习

论文标题

基于技能的元增强学习

Skill-based Meta-Reinforcement Learning

论文作者

Nam, Taewook, Sun, Shao-Hua, Pertsch, Karl, Hwang, Sung Ju, Lim, Joseph J

论文摘要

尽管深度强化学习方法在机器人学习中显示出了令人印象深刻的结果，但它们的样本效率低下使学习具有真正的机器人系统的复杂，长途行为。为了减轻这个问题，元提倡学习方法旨在通过学习学习方法来快速学习新任务。然而，该应用程序仅限于具有浓厚奖励的短距离任务。为了使学习长途行为能够学习，最近的作品探索了以离线数据集的形式利用先前的经验，而无需奖励或任务注释。尽管这些方法提高了样品效率，但仍需要数百万与环境的互动来解决复杂的任务。在这项工作中，我们设计了一种方法，该方法可以实现长跑，稀疏的奖励任务的元学习，从而使我们能够以较少的环境交互的数量级来解决看不见的目标任务。我们的核心思想是利用元学习期间从离线数据集中提取的先前经验。具体而言，我们建议（1）从离线数据集中提取可重复使用的技能和先前的技能，（2）元训练一项高级政策，学会学会地将学习的技能有效地构成长期胜利行为，（3）快速适应元培训的政策，以解决一项无与伦比的目标任务。关于导航和操纵中连续控制任务的实验结果表明，所提出的方法可以通过结合元学习的强度和离线数据集的使用来有效地解决长途新颖的目标任务，而先前的RL，Meta-RL和Multi-Task RL则需要实质上更多的环境互动来求解任务。

While deep reinforcement learning methods have shown impressive results in robot learning, their sample inefficiency makes the learning of complex, long-horizon behaviors with real robot systems infeasible. To mitigate this issue, meta-reinforcement learning methods aim to enable fast learning on novel tasks by learning how to learn. Yet, the application has been limited to short-horizon tasks with dense rewards. To enable learning long-horizon behaviors, recent works have explored leveraging prior experience in the form of offline datasets without reward or task annotations. While these approaches yield improved sample efficiency, millions of interactions with environments are still required to solve complex tasks. In this work, we devise a method that enables meta-learning on long-horizon, sparse-reward tasks, allowing us to solve unseen target tasks with orders of magnitude fewer environment interactions. Our core idea is to leverage prior experience extracted from offline datasets during meta-learning. Specifically, we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task. Experimental results on continuous control tasks in navigation and manipulation demonstrate that the proposed method can efficiently solve long-horizon novel target tasks by combining the strengths of meta-learning and the usage of offline datasets, while prior approaches in RL, meta-RL, and multi-task RL require substantially more environment interactions to solve the tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题