从物理人类反馈中学习：以对象为中心的单发适应方法

论文标题

从物理人类反馈中学习：以对象为中心的单发适应方法

Learning from Physical Human Feedback: An Object-Centric One-Shot Adaptation Method

论文作者

Shek, Alvin, Su, Bo Ying, Chen, Rui, Liu, Changliu

论文摘要

为了使机器人有效部署在新颖的环境和任务中，他们必须能够理解人类在干预过程中所表达的反馈。这可以纠正不良行为，也可以指示其他偏好。现有方法要么需要重复的互动发作，要么采用先前的已知奖励功能，该功能具有数据范围，几乎无法转移到新任务。我们通过用以对象为中心的子任务描述人类任务并解释与特定对象有关的物理干预措施来放松这些假设。我们的方法是对象偏好适应（OPA），由两个关键阶段组成：1）预先培训基本政策以产生各种行为，以及2）根据人类的反馈来在线升级。我们快速而简单的适应的关键是，固定代理和对象之间的一般交互动力学是固定的，并且只有特定于对象的偏好被更新。我们的适应性在线发生，仅需要一次人干预（一次性），并且会产生培训期间从未见过的新行为。我们的政策接受了廉价的合成数据而不是昂贵的人类示范的培训，可以正确适应人类对物理7DOF机器人的现实任务的扰动。提供了视频，代码和补充材料。

For robots to be effectively deployed in novel environments and tasks, they must be able to understand the feedback expressed by humans during intervention. This can either correct undesirable behavior or indicate additional preferences. Existing methods either require repeated episodes of interactions or assume prior known reward features, which is data-inefficient and can hardly transfer to new tasks. We relax these assumptions by describing human tasks in terms of object-centric sub-tasks and interpreting physical interventions in relation to specific objects. Our method, Object Preference Adaptation (OPA), is composed of two key stages: 1) pre-training a base policy to produce a wide variety of behaviors, and 2) online-updating according to human feedback. The key to our fast, yet simple adaptation is that general interaction dynamics between agents and objects are fixed, and only object-specific preferences are updated. Our adaptation occurs online, requires only one human intervention (one-shot), and produces new behaviors never seen during training. Trained on cheap synthetic data instead of expensive human demonstrations, our policy correctly adapts to human perturbations on realistic tasks on a physical 7DOF robot. Videos, code, and supplementary material are provided.

下载PDF全文

下载文献需遵守相关版权规定

论文标题