减少批处理时间差异学习中的采样错误

论文标题

减少批处理时间差异学习中的采样错误

Reducing Sampling Error in Batch Temporal Difference Learning

论文作者

Pavse, Brahma, Durugkar, Ishan, Hanna, Josiah, Stone, Peter

论文摘要

时间差异（TD）学习是现代强化学习的主要基础之一。本文研究了TD（0）（规范性TD算法）的使用，以从一批数据中估算给定策略的价值函数。在此批处理设置中，我们表明TD（0）可能会收敛到不准确的值函数，因为根据批处理中发生的操作的次数加权后的更新，而不是给定策略下操作的真实概率。为了解决此限制，我们介绍\ textit {策略采样错误纠正} -td（0）（psec-td（0））。 PSEC-TD（0）首先估计批处理中每个状态中动作的经验分布，然后使用重要性采样来纠正每个动作后经验加权与正确权重之间的不匹配。我们完善了确定性等效性估计值的概念，并认为PSEC-TD（0）是固定数据的数据有效估计器比TD（0）更有效率。最后，我们对三个批处理值函数学习任务进行了PSEC-TD（0）的经验评估，并通过超参数灵敏度分析进行了经验评估，并表明PSEC-TD（0）产生的值函数估计值均衡误差低于TD（0）。

Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch -- not the true probability of the action under the given policy. To address this limitation, we introduce \textit{policy sampling error corrected}-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct for the mismatch between the empirical weighting and the correct weighting for updates following each action. We refine the concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).

下载PDF全文

下载文献需遵守相关版权规定

论文标题