在非平稳环境中的政策优化的动态遗憾

论文标题

在非平稳环境中的政策优化的动态遗憾

Dynamic Regret of Policy Optimization in Non-stationary Environments

论文作者

Fei, Yingjie, Yang, Zhuoran, Wang, Zhaoran, Xie, Qiaomin

论文摘要

我们考虑具有对抗性全信息奖励反馈和未知的固定过渡内核的情节MDP中的增强学习（RL）。我们提出了两种无模型的政策优化算法，Power and Power ++，并为他们的动态遗憾提供保证。与静态遗憾的经典概念相比，动态遗憾是一个更强的概念，因为它明确说明了环境的非平稳性。拟议的算法在非平稳性的不同制度之间插入了插话，并满足了适应性（接近）最优性的概念，从而使其在慢变化环境下符合（接近）的最佳静态遗憾。动态遗憾的界限具有两个组成部分，一个由探索引起，涉及过渡内核的不确定性，而另一个是由适应引起的，它涉及非平稳环境。具体而言，我们表明，力量++通过通过预测积极适应非平稳性来改善动态遗憾的第二个组成部分。据我们所知，我们的工作是对非平稳环境中无模型RL算法的首次动态遗憾分析。

We consider reinforcement learning (RL) in episodic MDPs with adversarial full-information reward feedback and unknown fixed transition kernels. We propose two model-free policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret. Compared with the classical notion of static regret, dynamic regret is a stronger notion as it explicitly accounts for the non-stationarity of environments. The dynamic regret attained by the proposed algorithms interpolates between different regimes of non-stationarity, and moreover satisfies a notion of adaptive (near-)optimality, in the sense that it matches the (near-)optimal static regret under slow-changing environments. The dynamic regret bound features two components, one arising from exploration, which deals with the uncertainty of transition kernels, and the other arising from adaptation, which deals with non-stationary environments. Specifically, we show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. To the best of our knowledge, our work is the first dynamic regret analysis of model-free RL algorithms in non-stationary environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题