论文标题
VO $ Q $ L:具有非线性功能近似的最佳遗憾
VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation
论文作者
论文摘要
我们研究了一般功能近似和稀疏奖励下的时间均匀的情节增强学习(RL)。我们根据$ q $ - 学习设计了一种新的算法,差异加权的乐观$ q $ - Q $ - Q $ - Q $ - Q $ L),并在回归函数类别的完整性和有限的Eluder维度上进行了遗憾。作为一种特殊情况,Vo $ Q $ l Achieves $ \ tilde {o}(d \ sqrt {ht}+d^6h^{5})$遗憾的是$ t $ t $ poctodes t y Horizon $ h $ h $ mdp($ d $ - d $ - d $ - d $)线性函数近似值,这是差异。我们的算法在最佳值函数上结合了基于加权回归的上限和下限,以获得这种改善的遗憾。算法在函数类别上进行了回归甲骨文,该算法在计算上是有效的,这使得这是线性MDP的第一个计算且统计上最佳的方法。
We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.