论文标题

线性表示的近乎最佳的离线增强学习:以悲观的方式利用差异信息

Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism

论文作者

Yin, Ming, Duan, Yaqi, Wang, Mengdi, Wang, Yu-Xiang

论文摘要

离线强化学习旨在利用离线/历史数据来优化顺序决策策略,在最近的研究中已获得了突出的突出性。由于适当的功能近似器可以帮助减轻现代强化学习问题的样本复杂性负担,因此现有的努力通常会执行强大的功能表示模型(例如神经网络)来学习最佳策略。但是,即使这种表示形式是线性的,对具有功能表示的统计限制的精确理解仍然难以捉摸。 为了实现这一目标,我们通过线性模型表示,研究离线增强学习的统计限制。为了得出紧密的离线学习界限,我们设计了方差感知的悲观价值迭代(VAPVI),该迭代采用了时间差异函数的条件差异信息,以实现时间均匀的情节线性线性马尔可夫决策过程(MDPS)。 VAPVI利用值的值函数的估计差异将Bellman残差重新为最小的悲观价值迭代,并比最著名的现有结果提供了改进的离线学习范围(而Bellman残留物同样由设计加权)。更重要的是,我们的学习范围是根据系统数量表示的,这些范围提供了以前结果不足的自然实例依赖性特征。我们希望我们的结果更清楚地了解提供线性表示时的离线学习应该是什么样的。

Offline reinforcement learning, which seeks to utilize offline/historical data to optimize sequential decision-making strategies, has gained surging prominence in recent studies. Due to the advantage that appropriate function approximators can help mitigate the sample complexity burden in modern reinforcement learning problems, existing endeavors usually enforce powerful function representation models (e.g. neural networks) to learn the optimal policies. However, a precise understanding of the statistical limits with function representations, remains elusive, even when such a representation is linear. Towards this goal, we study the statistical limits of offline reinforcement learning with linear model representations. To derive the tight offline learning bound, we design the variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs). VAPVI leverages estimated variances of the value functions to reweight the Bellman residuals in the least-square pessimistic value iteration and provides improved offline learning bounds over the best-known existing results (whereas the Bellman residuals are equally weighted by design). More importantly, our learning bounds are expressed in terms of system quantities, which provide natural instance-dependent characterizations that previous results are short of. We hope our results draw a clearer picture of what offline learning should look like when linear representations are provided.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源