Munchausen的增强学习

论文标题

Munchausen的增强学习

Munchausen Reinforcement Learning

论文作者

Vieillard, Nino, Pietquin, Olivier, Geist, Matthieu

论文摘要

引导是增强学习（RL）的核心机制。大多数算法基于时间差异，将过境状态的真实价值取代其当前对该值的估计。然而，可以将另一个估计值用于Bootstrap RL：当前的政策。我们的核心贡献是一个非常简单的想法：将缩放的日志 - 政策添加到即时奖励中。我们表明，以这种方式对深层Q-NETWORK（DQN）进行了稍微修改，为Atari Games上的分配方法具有竞争力提供了一种代理，而无需使用分配RL，N-Step返回或优先重播。为了证明此想法的多功能性，我们还将其与隐式分位网络（IQN）一起使用。最终的代理在Atari上的表现优于Rainbow，安装了新的最新技术，对原始算法进行了很少的修改。为了加上这项实证研究，我们提供了有关引擎盖下发生的事情的强烈理论见解 - 隐式kullback-leibler正规化和动作差距的增加。

Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most algorithms, based on temporal differences, replace the true value of a transiting state by their current estimate of this value. Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward. We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games, without making use of distributional RL, n-step returns or prioritized replay. To demonstrate the versatility of this idea, we also use it together with an Implicit Quantile Network (IQN). The resulting agent outperforms Rainbow on Atari, installing a new State of the Art with very little modifications to the original algorithm. To add to this empirical study, we provide strong theoretical insights on what happens under the hood -- implicit Kullback-Leibler regularization and increase of the action-gap.

下载PDF全文

下载文献需遵守相关版权规定

论文标题