元梯度加强学习，并在线发现的目标

论文标题

元梯度加强学习，并在线发现的目标

Meta-Gradient Reinforcement Learning with an Objective Discovered Online

论文作者

Xu, Zhongwen, van Hasselt, Hado, Hessel, Matteo, Oh, Junhyuk, Singh, Satinder, Silver, David

论文摘要

深厚的强化学习包括广泛的算法系列，这些算法将内部表示形式（例如价值函数或政策）通过深层神经网络进行参数。每种算法都针对定义其语义的目标（例如Q学习或策略梯度）优化其参数。在这项工作中，我们提出了一种基于元梯度下降的算法，该算法仅来自深层神经网络的灵活参数，仅来自与环境的互动体验。随着时间的流逝，这使代理商可以学习如何越来越有效地学习。此外，由于目标是在线发现的，因此它可以随着时间的推移而适应变化。我们证明该算法发现了如何解决RL中的几个重要问题，例如自举，非平稳性和违反政策学习。在ATARI学习环境中，元级算法随着时间的推移而适应以更高的效率来学习，最终优于强大的参与者批评基线的中位数。

Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network, solely from interactive experience with its environment. Over time, this allows the agent to learn how to learn increasingly effectively. Furthermore, because the objective is discovered online, it can adapt to changes over time. We demonstrate that the algorithm discovers how to address several important issues in RL, such as bootstrapping, non-stationarity, and off-policy learning. On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency, eventually outperforming the median score of a strong actor-critic baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题