论文标题
通过学习特征诱导奖励学习的结构
Inducing Structure in Reward Learning by Learning Features
论文作者
论文摘要
奖励学习使机器人能够从人类的投入中学习适应性的行为。传统方法将奖励建模为手工制作功能的线性函数,但这需要指定所有相关功能先验的功能,这对于现实世界任务是不可能的。为了解决这个问题,最近的深层增强学习(IRL)方法直接从原始状态中学习奖励,但这是充满挑战的,因为机器人必须隐含地学习重要的特征以及如何同时学习它们。取而代之的是,我们提出了一种鸿沟和征服方法:专门将人类的投入专门用于分别学习特征,然后学习如何将它们结合成奖励。我们介绍了一种新型的人类输入,用于教学功能和一种利用它来从原始状态空间中学习复杂功能的算法。然后,机器人可以使用演示,更正或其他奖励学习框架学习如何将它们组合成奖励。我们在设置中演示了我们的方法,其中必须从头开始学习所有功能以及某些功能已知的位置。首先将人的输入专门针对该特征,我们的方法降低了样本的复杂性,并改善了对深室基线的奖励的概括。我们在使用物理7DOF机器人操纵器以及在模拟环境中进行的用户研究中进行了实验。
Reward learning enables robots to learn adaptable behaviors from human input. Traditional methods model the reward as a linear function of hand-crafted features, but that requires specifying all the relevant features a priori, which is impossible for real-world tasks. To get around this issue, recent deep Inverse Reinforcement Learning (IRL) methods learn rewards directly from the raw state but this is challenging because the robot has to implicitly learn the features that are important and how to combine them, simultaneously. Instead, we propose a divide and conquer approach: focus human input specifically on learning the features separately, and only then learn how to combine them into a reward. We introduce a novel type of human input for teaching features and an algorithm that utilizes it to learn complex features from the raw state space. The robot can then learn how to combine them into a reward using demonstrations, corrections, or other reward learning frameworks. We demonstrate our method in settings where all features have to be learned from scratch, as well as where some of the features are known. By first focusing human input specifically on the feature(s), our method decreases sample complexity and improves generalization of the learned reward over a deepIRL baseline. We show this in experiments with a physical 7DOF robot manipulator, as well as in a user study conducted in a simulated environment.