强大的离线加强学习，并置于梯度惩罚和约束放松

论文标题

强大的离线加强学习，并置于梯度惩罚和约束放松

Robust Offline Reinforcement Learning with Gradient Penalty and Constraint Relaxation

论文作者

Gao, Chengqian, Xu, Ke, Liu, Liu, Ye, Deheng, Zhao, Peilin, Xu, Zhiqiang

论文摘要

离线增强学习（RL）的有前途的范式是限制学习的政策，以保持与数据集行为的亲密关系，称为策略约束离线RL。但是，现有作品在很大程度上依赖于数据的纯度，从污染的数据集中学习时表现出性能降解甚至灾难性失败，这些数据集包含不纯净水平的不纯轨迹。例如，专家级别，中级等，而离线污染的数据日志通常存在于现实世界中。为了减轻这种情况，我们首先对学习的价值函数进行梯度惩罚，以应对爆炸Q功能。然后，我们放松对非最佳行为的紧密约束，并加权约束放松。实验结果表明，提出的技术有效地驯服了在一组受污染的D4RL Mujoco和Adroit数据集上评估的策略约束RL方法的非最佳轨迹。

A promising paradigm for offline reinforcement learning (RL) is to constrain the learned policy to stay close to the dataset behaviors, known as policy constraint offline RL. However, existing works heavily rely on the purity of the data, exhibiting performance degradation or even catastrophic failure when learning from contaminated datasets containing impure trajectories of diverse levels. e.g., expert level, medium level, etc., while offline contaminated data logs exist commonly in the real world. To mitigate this, we first introduce gradient penalty over the learned value function to tackle the exploding Q-functions. We then relax the closeness constraints towards non-optimal actions with critic weighted constraint relaxation. Experimental results show that the proposed techniques effectively tame the non-optimal trajectories for policy constraint offline RL methods, evaluated on a set of contaminated D4RL Mujoco and Adroit datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题