评论家PI2：通过路径积分和深层参与者的批判性强化学习，通过改进政策的持续计划大师持续计划

论文标题

评论家PI2：通过路径积分和深层参与者的批判性强化学习，通过改进政策的持续计划大师持续计划

Critic PI2: Master Continuous Planning via Policy Improvement with Path Integrals and Deep Actor-Critic Reinforcement Learning

论文作者

Fan, Jiajun, Ba, He, Guo, Xian, Hao, Jianye

论文摘要

长期以来，具有计划能力的构造代理一直是追求人工智能的主要挑战之一。从Alphago到Muzero的基于树木的计划方法在离散域（例如国际象棋和Go）中取得了巨大成功。不幸的是，在机器人控制和倒置的现实应用程序中，其动作空间通常是连续的，这些基于树的计划技术将在挣扎。为了解决这些局限性，在本文中，我们提出了一个新型的基于模型的增强学习框架，称为评论家PI2，该框架结合了轨迹优化，深入的参与者 - 批判性学习和基于模型的增强性学习的好处。我们的方法将用于对许多连续控制系统适用的倒摆模型。广泛的实验表明，评论家PI2在一系列具有挑战性的连续领域中实现了新的最新状态。此外，我们表明与批评家的计划大大提高了样本效率和实时性能。我们的工作为学习基于模型的计划系统以及如何使用它们的组件打开了一个新的方向。

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods from AlphaGo to Muzero have enjoyed huge success in discrete domains, such as chess and Go. Unfortunately, in real-world applications like robot control and inverted pendulum, whose action space is normally continuous, those tree-based planning techniques will be struggling. To address those limitations, in this paper, we present a novel model-based reinforcement learning frameworks called Critic PI2, which combines the benefits from trajectory optimization, deep actor-critic learning, and model-based reinforcement learning. Our method is evaluated for inverted pendulum models with applicability to many continuous control systems. Extensive experiments demonstrate that Critic PI2 achieved a new state of the art in a range of challenging continuous domains. Furthermore, we show that planning with a critic significantly increases the sample efficiency and real-time performance. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.

下载PDF全文

下载文献需遵守相关版权规定

论文标题