单次演员与评论的有限时间分析

论文标题

单次演员与评论的有限时间分析

Finite-time analysis of single-timescale actor-critic

论文作者

Chen, Xuyang, Zhao, Lin

论文摘要

参与者批评方法在许多具有挑战性的应用中取得了重大成功。但是，其有限的时间融合在最实用的单时间尺度形式中仍然对其进行了鲜为人知。现有的分析单时间演员批评的作品仅限于I.I.D.采样或表格设置，以实现简单性。我们研究了连续状态空间上更实用的在线单时间演员 - 批评算法，在该空间中，评论家假定线性函数近似和更新，每个演员步骤单个马尔可夫样本。以前的分析无法为这种挑战性的情况建立融合。我们证明，在线单时间演员 - 批评方法可以证明，在标准假设下，具有$ \ widetilde {\ mathcal {o}}} $ \ widetilde {\ Mathcal {o}}}（ε^{ - 2}）$样品复杂性在标准假设下的复杂性，可以进一步提高到$ \ \ \ \ \ \ \ \ \ \ \ \ \ i} $ 2，采样。我们的新框架系统地评估并控制了演员与评论家之间的错误传播。它提供了一种有希望的方法，用于分析其他单时间尺度的增强学习算法。

Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in the most practical single-timescale form. Existing works on analyzing single-timescale actor-critic have been limited to i.i.d. sampling or tabular setting for simplicity. We investigate the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic assumes linear function approximation and updates with a single Markovian sample per actor step. Previous analysis has been unable to establish the convergence for such a challenging scenario. We demonstrate that the online single-timescale actor-critic method provably finds an $ε$-approximate stationary point with $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(ε^{-2})$ under the i.i.d. sampling. Our novel framework systematically evaluates and controls the error propagation between the actor and critic. It offers a promising approach for analyzing other single-timescale reinforcement learning algorithms as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题