论文标题

VO $ Q $ L:具有非线性功能近似的最佳遗憾

VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

论文作者

Agarwal, Alekh, Jin, Yujia, Zhang, Tong

论文摘要

我们研究了一般功能近似和稀疏奖励下的时间均匀的情节增强学习(RL)。我们根据$ q $ - 学习设计了一种新的算法,差异加权的乐观$ q $ - Q $ - Q $ - Q $ - Q $ L),并在回归函数类别的完整性和有限的Eluder维度上进行了遗憾。作为一种特殊情况,Vo $ Q $ l Achieves $ \ tilde {o}(d \ sqrt {ht}+d^6h^{5})$遗憾的是$ t $ t $ poctodes t y Horizo​​n $ h $ h $ mdp($ d $ - d $ - d $ - d $)线性函数近似值,这是差异。我们的算法在最佳值函数上结合了基于加权回归的上限和下限,以获得这种改善的遗憾。算法在函数类别上进行了回归甲骨文,该算法在计算上是有效的,这使得这是线性MDP的第一个计算且统计上最佳的方法。

We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源