无模型的增强学习：从剪切的伪重格到样品复杂性

论文标题

无模型的增强学习：从剪切的伪重格到样品复杂性

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

论文作者

Zhang, Zihan, Zhou, Yuan, Ji, Xiangyang

论文摘要

在本文中，我们考虑了折扣马尔可夫决策过程（MDP）学习$ε$最佳政策的问题。给定具有$ s $状态的MDP，$ a $ actions，折扣因子$γ\ in（0,1）$和近似阈值$ε> 0 $，我们提供了一种免费型号的算法，以学习具有样品复杂性的$ε$ - 优越的策略 $\tilde{O}(\frac{SA\ln(1/p)}{ε^2(1-γ)^{5.5}})$ (where the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S,A,1/(1-γ)$, and $1/ε$) and success probability $（1-P）$。对于足够小的$ε$，我们显示了一种改进的算法，带有样本复杂性$ \ tilde {o}（\ frac {sa \ ln（1/p）} {ε^2（1-γ）^{3}}}）$。虽然第一界对所有已知的无模型算法和基于模型的算法进行了改进，并且基于模型的算法对$ S $严格依赖，但我们的第二算法比所有已知的样本复杂性界限都击败了所有已知的样本复杂性界限，并匹配了信息理论下限与对数因素的界限。

In this paper we consider the problem of learning an $ε$-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with $S$ states, $A$ actions, the discount factor $γ\in (0,1)$, and an approximation threshold $ε> 0$, we provide a model-free algorithm to learn an $ε$-optimal policy with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{ε^2(1-γ)^{5.5}})$ (where the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S,A,1/(1-γ)$, and $1/ε$) and success probability $(1-p)$. For small enough $ε$, we show an improved algorithm with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{ε^2(1-γ)^{3}})$. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on $S$, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题