通过逐渐域适应的最佳传输课程增强学习

论文标题

通过逐渐域适应的最佳传输课程增强学习

Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation

论文作者

Huang, Peide, Xu, Mengdi, Zhu, Jiacheng, Shi, Laixi, Fang, Fei, Zhao, Ding

论文摘要

课程增强学习（CRL）旨在创建一系列任务，从简单的任务开始，逐渐学习到困难的任务。在这项工作中，我们关注将CRL作为源（辅助）和目标任务分布之间插值的想法。尽管现有的研究表明了这一想法的巨大潜力，但尚不清楚如何正式量化和生成任务分布之间的运动。受到半监督学习中渐进域适应的见解的启发，我们通过将CRL中潜在的大型任务分布转移分解为较小的转移而创建了自然课程。我们提出了梯度，该梯度将CRL作为最佳运输问题，并在任务之间定制距离度量。具体而言，我们在源和目标分布之间生成一系列任务分布作为地理插值（即Wasserstein Barycenter）。与许多现有方法不同，我们的算法考虑了与任务有关的上下文距离度量，并且能够在连续和离散上下文设置中处理非参数分布。此外，我们从理论上表明，在某些条件下，梯度可以在课程中的后续阶段之间进行平滑传递。我们在运动和操纵任务上进行了广泛的实验，并表明我们所提出的梯度在学习效率和渐近表现方面比基线更高的性能。

Curriculum Reinforcement Learning (CRL) aims to create a sequence of tasks, starting from easy ones and gradually learning towards difficult tasks. In this work, we focus on the idea of framing CRL as interpolations between a source (auxiliary) and a target task distribution. Although existing studies have shown the great potential of this idea, it remains unclear how to formally quantify and generate the movement between task distributions. Inspired by the insights from gradual domain adaptation in semi-supervised learning, we create a natural curriculum by breaking down the potentially large task distributional shift in CRL into smaller shifts. We propose GRADIENT, which formulates CRL as an optimal transport problem with a tailored distance metric between tasks. Specifically, we generate a sequence of task distributions as a geodesic interpolation (i.e., Wasserstein barycenter) between the source and target distributions. Different from many existing methods, our algorithm considers a task-dependent contextual distance metric and is capable of handling nonparametric distributions in both continuous and discrete context settings. In addition, we theoretically show that GRADIENT enables smooth transfer between subsequent stages in the curriculum under certain conditions. We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题