论文标题

与全球融合保证的非政策RL协调升级

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

论文作者

Su, Hsin-En, Chen, Yen-Ju, Hsieh, Ping-Chun, Liu, Xi

论文摘要

我们从坐标上升的角度重新审视了RL的非政策政策优化的领域。一种常用的方法是利用违反政策梯度来优化替代目标 - 目标政策的预期回报率在行为政策的状态分布方面的总折扣。但是,已证明这种方法遭受分配不匹配问题的困扰,因此,通过状态分配校正或反事实方法纠正此不匹配所需的重大努力。在本文中,我们通过坐标上升策略优化(CAPO)重新考虑了非政策的学习,这是一种非政策批评算法,该算法将策略改进与行为策略的状态分布相抵消,而无需使用策略梯度。这种设计避免了在政策改进的政策改进步骤中进行分配校正或重要性采样的需求。我们建立了CAPO与一般坐标选择的全局收敛性,然后进一步量化了CAPO的几个实例的收敛速率,以及流行的坐标选择规则,包括CAPO的循环和随机变体。然后,我们将CAPO扩展到神经政策以进行更实际的实施。通过实验,我们证明CAPO在实践中提供了一种竞争方法。

We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源