任务相位：自动化课程从演示中学习

论文标题

任务相位：自动化课程从演示中学习

Task Phasing: Automated Curriculum Learning from Demonstrations

论文作者

Bajaj, Vaibhav, Sharon, Guni, Stone, Peter

论文摘要

由于指导信号不足，将加固学习（RL）应用于稀疏的奖励领域是充满挑战的。解决此类领域的常见RL技术包括（1）从演示中学习和（2）课程学习。尽管这两种方法已详细研究，但很少考虑它们。本文旨在通过引入一种原则性的任务相位方法来实现这一目标，该方法使用演示来自动生成课程序列。使用（次优）演示的反RL我们定义一个简单的初始任务。然后，我们的任务相位方法提供了一个框架，以逐步将任务的复杂性一直延长到目标任务，同时在每个阶段迭代中重新调整RL代理。考虑了两种方法：（1）逐渐增加RL代理所控制的时间步骤的比例，（2）逐步淘汰指导性的信息奖励功能。我们提出的条件可以保证这些方法与最佳政策的融合。 3个稀疏奖励域的实验结果表明，我们的任务方法在渐近性能方面表现优于最先进的方法。

Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common RL techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task phasing approaches outperform state-of-the-art approaches with respect to asymptotic performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题