具有一般价值函数近似的强化学习：通过有限的eluder维度可证明有效的方法

论文标题

具有一般价值函数近似的强化学习：通过有限的eluder维度可证明有效的方法

Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension

论文作者

Wang, Ruosong, Salakhutdinov, Ruslan, Yang, Lin F.

论文摘要

价值函数近似已证明了增强学习（RL）的惊人经验成功。然而，尽管最近有几个以线性函数近似为RL开发理论的进展，但对一般函数近似方案的理解基本上仍然缺失。在本文中，我们建立了具有一般值函数近似值的可证明有效的RL算法。我们表明，如果该值的函数在函数类$ \ MATHCAL {F} $中接受近似值，则我们的算法将获得损失的遗憾，限制为$ \ widetilde {o}（\ Mathrm {poly}（poly}（dh）（dh）（dh）\ sqrt {t}）$ d $ d $ d $ d $ dy是$ \ mather的复杂量。 [Russo和Van Roy，2013年]和对数覆盖的数字，$ H $是计划范围，而$ t $是与环境的数字交互。我们的理论以线性值函数近似在RL上的最新进展概括，并且对环境模型没有明确的假设。此外，我们的算法是无模型的，并提供了一个框架来证明实践中使用的算法的有效性。

Value function approximation has demonstrated phenomenal empirical success in reinforcement learning (RL). Nevertheless, despite a handful of recent progress on developing theory for RL with linear function approximation, the understanding of general function approximation schemes largely remains missing. In this paper, we establish a provably efficient RL algorithm with general value function approximation. We show that if the value functions admit an approximation with a function class $\mathcal{F}$, our algorithm achieves a regret bound of $\widetilde{O}(\mathrm{poly}(dH)\sqrt{T})$ where $d$ is a complexity measure of $\mathcal{F}$ that depends on the eluder dimension [Russo and Van Roy, 2013] and log-covering numbers, $H$ is the planning horizon, and $T$ is the number interactions with the environment. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment. Moreover, our algorithm is model-free and provides a framework to justify the effectiveness of algorithms used in practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题