学习对手马尔可夫的决策过程，反馈延迟

论文标题

学习对手马尔可夫的决策过程，反馈延迟

Learning Adversarial Markov Decision Processes with Delayed Feedback

论文作者

Lancewicki, Tal, Rosenberg, Aviv, Mansour, Yishay

论文摘要

加强学习通常假设代理人立即观察其行动的反馈，但是在许多现实世界中（例如推荐系统）中，反馈在延迟中观察到反馈。本文在情节马尔可夫决策过程（MDP）中研究了在线学习，其过渡不明，对抗性变化的成本和不受限制的延迟反馈。也就是说，仅在$ k + d^k $结束时向学习者揭示了$ k $的成本和轨迹，在那里，延迟$ d^k $既不相同也不相同，也是由遗忘的对手选择的。我们根据政策优化提出了新颖的算法，这些算法在全信息反馈下实现了$ \ sqrt {k + d} $的近乎最佳的高概率遗憾，其中$ k $是情节数，$ d = \ sum_ {k} d^k $是总延迟。在强盗反馈下，我们证明了类似的$ \ sqrt {k + d} $遗憾的是，如果成本是随机的，并且在一般情况下$（k + d）^{2/3} $遗憾。我们是第一个考虑在MDP的重要情况下以延迟反馈的重要环境中遗憾最小化的人。

Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode $k$ are revealed to the learner only in the end of episode $k + d^k$, where the delays $d^k$ are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of $\sqrt{K + D}$ under full-information feedback, where $K$ is the number of episodes and $D = \sum_{k} d^k$ is the total delay. Under bandit feedback, we prove similar $\sqrt{K + D}$ regret assuming the costs are stochastic, and $(K + D)^{2/3}$ regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.

下载PDF全文

下载文献需遵守相关版权规定

论文标题