奖励基于模型的增强学习的奖励子任务

论文标题

奖励基于模型的增强学习的奖励子任务

Reward-Respecting Subtasks for Model-Based Reinforcement Learning

论文作者

Sutton, Richard S., Machado, Marlos C., Holland, G. Zacharias, Szepesvari, David, Timbers, Finbarr, Tanner, Brian, White, Adam

论文摘要

为了实现人工智能的雄心勃勃的目标，强化学习必须包括在状态和时间上抽象的世界模型进行计划。深度学习在国家抽象方面取得了进步，但是尽管基于选项框架广泛发展的理论，但很少使用时间抽象。这样做的原因之一是，可能的选项空间是巨大的，并且先前提出的选项发现的方法不考虑如何在计划中使用选项模型。通常通过提出子公司任务来发现选项，例如达到瓶颈状态或最大化奖励以外的感觉信号的累积总和。每个子任务都可以解决，以产生一个选项，然后学习了该选项的模型并可用于计划过程。在以前的大多数工作中，子任务忽略了原始问题的奖励，而我们提出的子任务是根据选项终止时使用原始奖励加上基于状态的特征的奖励的子任务。我们表明，从此类奖励子任务中获得的期权模型比本征特征，基于瓶颈状态的最短路径选项或期权批判性产生的奖励选择选项更有用。奖励尊重子任务强烈限制了选项的空间，从而为期权发现问题提供了部分解决方案。最后，我们展示了如何使用标准算法和一般价值函数在线学习价值，策略，选项和模型。

To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题