通过深网的合奏减少时间差异值估计的差异

论文标题

通过深网的合奏减少时间差异值估计的差异

Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks

论文作者

Liang, Litian, Xu, Yaosheng, McAleer, Stephen, Hu, Dailin, Ihler, Alexander, Abbeel, Pieter, Fox, Roy

论文摘要

在时间差异增强学习算法中，价值估计的差异会导致最大目标值的不稳定性和高估。已经提出了许多算法来减少高估，包括最近的几种集合方法，但是，没有通过解决估计方差作为高估的根本原因来表现出在样本效率学习方面的成功。在本文中，我们提出了一种简单的集合方法，将目标值估计为集合均值。尽管它很简单，但卑鄙的人在Atari学习环境基准的实验中显示出显着的样本效率。重要的是，我们发现大小5的合奏充分降低了估计差异以消除滞后目标网络，从而消除了它作为偏见的来源并进一步获得样本效率。我们以直观和经验的方式为曲线的设计选择证明了合理性，包括需要独立经验抽样的必要性。在一组26个基准ATARI环境中，曲线均优于所有经过测试的基线，包括最佳的基线，日出，在16/26环境中的100K交互步骤，平均为68％。在21/26的环境中，曲线还优于500K步骤的Rainbow DQN，平均均优于49％，并使用200K（$ \ pm $ 100k）的交互步骤实现平均人级绩效。我们的实施可从https://github.com/indylab/meanq获得。

In temporal-difference reinforcement learning algorithms, variance in value estimation can cause instability and overestimation of the maximal target value. Many algorithms have been proposed to reduce overestimation, including several recent ensemble methods, however none have shown success in sample-efficient learning through addressing estimation variance as the root cause of overestimation. In this paper, we propose MeanQ, a simple ensemble method that estimates target values as ensemble means. Despite its simplicity, MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark. Importantly, we find that an ensemble of size 5 sufficiently reduces estimation variance to obviate the lagging target network, eliminating it as a source of bias and further gaining sample efficiency. We justify intuitively and empirically the design choices in MeanQ, including the necessity of independent experience sampling. On a set of 26 benchmark Atari environments, MeanQ outperforms all tested baselines, including the best available baseline, SUNRISE, at 100K interaction steps in 16/26 environments, and by 68% on average. MeanQ also outperforms Rainbow DQN at 500K steps in 21/26 environments, and by 49% on average, and achieves average human-level performance using 200K ($\pm$100K) interaction steps. Our implementation is available at https://github.com/indylab/MeanQ.

下载PDF全文

下载文献需遵守相关版权规定

论文标题