TD3带有反向KL正常化程序，用于离线加固从混合数据集学习

论文标题

TD3带有反向KL正常化程序，用于离线加固从混合数据集学习

TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets

论文作者

Cai, Yuanying, Zhang, Chuheng, Zhao, Li, Shen, Wei, Zhang, Xuyun, Song, Lei, Bian, Jiang, Qin, Tao, Liu, Tieyan

论文摘要

我们考虑一个离线增强学习（RL）设置，代理需要通过推出多个行为策略从收集的数据集中学习。这种设置有两个挑战：1）优化RL信号与行为克隆（BC）信号在不同状态的最佳权衡，这是由于不同行为策略引起的动作覆盖率的变化。以前的方法仅通过控制全球权衡而无法处理这一点。 2）对于给定状态，不同行为策略产生的动作分布可能具有多种模式。在许多以前的方法中，BC正则化器是均值寻求的，导致在模式中间选择分数外（OOD）动作的策略。在本文中，我们通过基于TD3算法使用自适应加权的反向Kullback-Leibler（KL）Divergence作为BC正常化程序来解决这两个挑战。我们的方法不仅可以用每态权重的RL和BC信号（即，在范围范围内覆盖范围较窄的状态下，BC正规化强，反之亦然），而且还避免了通过反向KL的模式寻求属性来选择OOD动作。从经验上讲，我们的算法可以胜过与标准D4RL数据集的Mujoco运动任务中的现有离线RL算法以及结合标准数据集合的混合数据集。

We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm. Our method not only trades off the RL and BC signals with per-state weights (i.e., strong BC regularization on the states with narrow action coverage, and vice versa) but also avoids selecting OOD actions thanks to the mode-seeking property of reverse KL. Empirically, our algorithm can outperform existing offline RL algorithms in the MuJoCo locomotion tasks with the standard D4RL datasets as well as the mixed datasets that combine the standard datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题