马尔可夫决策过程的布莱克韦尔在线学习

论文标题

马尔可夫决策过程的布莱克韦尔在线学习

Blackwell Online Learning for Markov Decision Processes

论文作者

Li, Tao, Peng, Guanze, Zhu, Quanyan

论文摘要

这项工作从在线优化角度提供了对马尔可夫决策过程（MDP）的新解释。在这种在线优化环境中，MDP的策略被视为决策变量，而相应的值函数被视为来自环境的回报。基于这种解释，我们构建了由MDP引起的Blackwell游戏，该游戏弥合了MDP的遗憾最小化，Blackwell的可接近性理论和学习理论之间的差距。具体来说，从可接近性理论中，我们提出了1）布莱克韦尔的价值迭代，用于离线计划和2）Blackwell $ q- $ q- $学习在MDP中的在线学习，这两者都显示为最佳解决方案。我们的理论保证是通过数值实验来证实的。

This work provides a novel interpretation of Markov Decision Processes (MDP) from the online optimization viewpoint. In such an online optimization context, the policy of the MDP is viewed as the decision variable while the corresponding value function is treated as payoff feedback from the environment. Based on this interpretation, we construct a Blackwell game induced by MDP, which bridges the gap among regret minimization, Blackwell approachability theory, and learning theory for MDP. Specifically, from the approachability theory, we propose 1) Blackwell value iteration for offline planning and 2) Blackwell $Q-$learning for online learning in MDP, both of which are shown to converge to the optimal solution. Our theoretical guarantees are corroborated by numerical experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题