论文标题
学习价值功能的强大损失
Robust Losses for Learning Value Functions
论文作者
论文摘要
强化学习中的大多数价值功能学习算法都是基于平方平方(投影)的贝尔曼错误。然而,已知平方误差对离群值敏感,既偏向物镜的溶液,又导致高稳定性和高方差梯度。为了控制这些高磁性更新,RL的典型策略涉及剪辑梯度,剪辑奖励,恢复奖励或剪辑错误。尽管这些策略似乎与强大的损失有关(例如Huber损失),但它们是建立在半差的更新规则上的,这些规则不会最大程度地减少已知损失。在这项工作中,我们基于最新的见解,将平方的贝尔曼错误作为鞍点优化问题,并为Huber Bellman错误和绝对的Bellman错误提出了鞍点的重新印象。我们从强大的损失的形式化开始,然后得出基于声音梯度的方法,以最大程度地减少在线非政策预测和控制设置中的这些损失。我们表征了强劲损失的解决方案,从而洞悉了问题设置,在这些设置中,强大的损失明显比平均平方钟误差更好的解决方案。最后,我们表明,对于预测和对照,所得的基于梯度的算法更稳定,对元参数的敏感性较小。
Most value function learning algorithms in reinforcement learning are based on the mean squared (projected) Bellman error. However, squared errors are known to be sensitive to outliers, both skewing the solution of the objective and resulting in high-magnitude and high-variance gradients. To control these high-magnitude updates, typical strategies in RL involve clipping gradients, clipping rewards, rescaling rewards, or clipping errors. While these strategies appear to be related to robust losses -- like the Huber loss -- they are built on semi-gradient update rules which do not minimize a known loss. In this work, we build on recent insights reformulating squared Bellman errors as a saddlepoint optimization problem and propose a saddlepoint reformulation for a Huber Bellman error and Absolute Bellman error. We start from a formalization of robust losses, then derive sound gradient-based approaches to minimize these losses in both the online off-policy prediction and control settings. We characterize the solutions of the robust losses, providing insight into the problem settings where the robust losses define notably better solutions than the mean squared Bellman error. Finally, we show that the resulting gradient-based algorithms are more stable, for both prediction and control, with less sensitivity to meta-parameters.