论文标题
具有分配加固学习的多样本目标值的探索
Exploration with Multi-Sample Target Values for Distributional Reinforcement Learning
论文作者
论文摘要
分配加固学习(RL)旨在学习一个价值网络,以预测给定状态的收益的完整分布,通常是通过基于分数的评论家建模的。该方法已成功整合到常见的RL方法中,以进行连续控制,从而引起算法,例如分布软角色 - 批评(DSAC)。在本文中,我们引入了用于分布RL的多样本目标值(MTV),作为当前实践中通常使用的单样本目标值估计的原则替代品。改进的分布估计进一步介绍了基于UCB的探索。将这两个想法结合在一起,以产生我们的分布RL算法E2DC(带有分销评论家的额外探索)。我们在一系列连续的控制任务上评估了我们的方法,并在诸如人体机器人控制之类的困难任务上展示了最先进的模型性能。我们通过可视化和分析学习分布及其在训练过程中的演变提供了进一步的了解。
Distributional reinforcement learning (RL) aims to learn a value-network that predicts the full distribution of the returns for a given state, often modeled via a quantile-based critic. This approach has been successfully integrated into common RL methods for continuous control, giving rise to algorithms such as Distributional Soft Actor-Critic (DSAC). In this paper, we introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation, as commonly employed in current practice. The improved distributional estimates further lend themselves to UCB-based exploration. These two ideas are combined to yield our distributional RL algorithm, E2DC (Extra Exploration with Distributional Critics). We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control. We provide further insight into the method via visualization and analysis of the learned distributions and their evolution during training.