通过奖励网络蒸馏，跨异构示威者的共同目标和策略推论

论文标题

通过奖励网络蒸馏，跨异构示威者的共同目标和策略推论

Joint Goal and Strategy Inference across Heterogeneous Demonstrators via Reward Network Distillation

论文作者

Chen, Letian, Paleja, Rohan, Ghuy, Muyleng, Gombolay, Matthew

论文摘要

加强学习（RL）已取得了巨大的成功，作为学习如何做出决策的一般框架。但是，这种成功依赖于RL专家对奖励功能的互动手工调节。另一方面，逆增强学习（IRL）试图从易于吸引的人类示范中学习奖励功能。但是，IRL受到了两个主要局限性：1）奖励歧义 - 存在无限数量的可能的奖励功能，可以解释专家的演示和2）异质性 - 人类专家采用各种策略和偏好，这使得从多个示威者中学习的策略和偏好很难，因为示威者的共同假设使示威者寻求最大程度地提高相同的奖励。在这项工作中，我们提出了一种通过网络蒸馏共同推断任务目标和人类战略偏好的方法。这种方法使我们能够提取强大的任务奖励（解决奖励歧义）并为每个策略的目标建模（处理异质性）。我们证明我们的算法可以更好地恢复任务奖励和策略奖励，并模仿两个模拟任务和现实世界中的乒乓球任务中的策略。

Reinforcement learning (RL) has achieved tremendous success as a general framework for learning how to make decisions. However, this success relies on the interactive hand-tuning of a reward function by RL experts. On the other hand, inverse reinforcement learning (IRL) seeks to learn a reward function from readily-obtained human demonstrations. Yet, IRL suffers from two major limitations: 1) reward ambiguity - there are an infinite number of possible reward functions that could explain an expert's demonstration and 2) heterogeneity - human experts adopt varying strategies and preferences, which makes learning from multiple demonstrators difficult due to the common assumption that demonstrators seeks to maximize the same reward. In this work, we propose a method to jointly infer a task goal and humans' strategic preferences via network distillation. This approach enables us to distill a robust task reward (addressing reward ambiguity) and to model each strategy's objective (handling heterogeneity). We demonstrate our algorithm can better recover task reward and strategy rewards and imitate the strategies in two simulated tasks and a real-world table tennis task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题