论文标题
D形:示范形的加固通过目标调节学习
D-Shape: Demonstration-Shaped Reinforcement Learning via Goal Conditioning
论文作者
论文摘要
在结合模仿学习(IL)和增强学习(RL)的同时,是解决自主行为采集样本效率不佳的一种有前途的方法,但这种方法通常假定必要的行为示范是由专家提供了对任务奖励的最佳行为。但是,如果提供了次优的示范,就会出现一个基本挑战,因为IL的演示匹配目标与RL的返回最大化目标冲突。本文介绍了D-Shape,这是一种结合IL和RL的新方法,该方法使用奖励成型和目标条件的RL中的想法来解决上述冲突。 D形允许从次优的演示中学习,同时保留了有关任务奖励的最佳政策的能力。我们在稀疏的奖励网格世界域中实验验证了D形,表明它在样本效率方面既改进RL,又在存在次优示范的情况下始终收敛到最佳策略。
While combining imitation learning (IL) and reinforcement learning (RL) is a promising way to address poor sample efficiency in autonomous behavior acquisition, methods that do so typically assume that the requisite behavior demonstrations are provided by an expert that behaves optimally with respect to a task reward. If, however, suboptimal demonstrations are provided, a fundamental challenge appears in that the demonstration-matching objective of IL conflicts with the return-maximization objective of RL. This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict. D-Shape allows learning from suboptimal demonstrations while retaining the ability to find the optimal policy with respect to the task reward. We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy in the presence of suboptimal demonstrations.