冲浪：半监督的奖励学习，并通过数据扩展用于反馈优先的增强学习

论文标题

冲浪：半监督的奖励学习，并通过数据扩展用于反馈优先的增强学习

SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

论文作者

Park, Jongjin, Seo, Younggyo, Shin, Jinwoo, Lee, Honglak, Abbeel, Pieter, Lee, Kimin

论文摘要

基于首选的强化学习（RL）表明，教学代理人通过学习奖励以主管在两个代理行为之间的偏好学习奖励，而无需昂贵，预定的奖励功能执行目标任务。但是，基于偏好的学习通常需要大量的人类反馈，因此很难将这种方法应用于各种应用程序。另一方面，通常在监督学习的背景下使用未标记的样本或数据扩展技术来解决此数据效率问题。在这些方法的最新成功中，我们提出了Surf，这是一个半监督的奖励学习框架，它利用大量未标记的样本具有数据增强。为了利用未标记的样本进行奖励学习，我们根据偏好预测指标的信心推断出未标记样本的伪标签。为了进一步提高奖励学习的标签效率，我们引入了一种新的数据增强，该数据从原始行为中进行了暂时性作物。我们的实验表明，我们的方法显着提高了基于各种运动和机器人操纵任务的基于最新偏好的方法的反馈效率。

Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor's preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult to apply this approach to various applications. This data-efficiency problem, on the other hand, has been typically addressed by using unlabeled samples or data augmentation techniques in the context of supervised learning. Motivated by the recent success of these approaches, we present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. To further improve the label-efficiency of reward learning, we introduce a new data augmentation that temporally crops consecutive subsequences from the original behaviors. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the state-of-the-art preference-based method on a variety of locomotion and robotic manipulation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题