VO $ Q $ L：具有非线性功能近似的最佳遗憾

论文标题

VO $ Q $ L：具有非线性功能近似的最佳遗憾

VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

论文作者

Agarwal, Alekh, Jin, Yujia, Zhang, Tong

论文摘要

我们研究了一般功能近似和稀疏奖励下的时间均匀的情节增强学习（RL）。我们根据$ q $ - 学习设计了一种新的算法，差异加权的乐观$ q $ - Q $ - Q $ - Q $ - Q $ L），并在回归函数类别的完整性和有限的Eluder维度上进行了遗憾。作为一种特殊情况，Vo $ Q $ l Achieves $ \ tilde {o}（d \ sqrt {ht}+d^6h^{5}）$遗憾的是$ t $ t $ poctodes t y Horizon $ h $ h $ mdp（$ d $ - d $ - d $ - d $）线性函数近似值，这是差异。我们的算法在最佳值函数上结合了基于加权回归的上限和下限，以获得这种改善的遗憾。算法在函数类别上进行了回归甲骨文，该算法在计算上是有效的，这使得这是线性MDP的第一个计算且统计上最佳的方法。

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题