论文标题

信任条件的价值功能,用于离线增强学习

Confidence-Conditioned Value Functions for Offline Reinforcement Learning

论文作者

Hong, Joey, Kumar, Aviral, Levine, Sergey

论文摘要

离线增强学习(RL)承诺能够仅使用现有的静态数据集学习有效政策,而无需任何昂贵的在线互动。为此,离线RL方法必须处理数据集和学习策略之间的分配变化。最常见的方法是学习保守或较低的价值功能,该功能低估了分布外(OOD)动作的回报。但是,这种方法表现出一个显着的缺点:对这种价值功能进行优化的政策只能根据固定的,可能是次优的保守主义行为。但是,如果我们能够在训练时以不同程度的保守主义学习政策,并设计一种方法来在评估中动态选择其中一种,则可以缓解这一点。为此,在这项工作中,我们提出了学习价值功能,这些功能还可以根据保守主义的程度来调节,我们将信心调节的价值功能赋予了函数。我们得出了一种新形式的贝尔曼备用形式,同时以高可能性的任何信心学习Q值。通过基于信心的条件,我们的价值功能可以通过使用观察史来控制在线评估期间的自适应策略,以控制置信度。可以通过在信心上从现有保守算法的Q功能来调节Q功能来实施这种方法。从理论上讲,我们学到的价值函数在任何所需的信心下都会产生对真实价值的保守估计。最后,我们从经验上表明,我们的算法在多个离散控制域上优于现有的保守离线RL算法。

Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence.We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源