动态歧视器合奏的可转移奖励学习

论文标题

动态歧视器合奏的可转移奖励学习

Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble

论文作者

Luo, Fan-Ming, Cao, Xingchen, Qin, Rong-Jun, Yu, Yang

论文摘要

从专家演示中恢复奖励功能是强化学习的基本问题。收回的奖励功能捕捉了专家的动机。代理可以通过在其环境中遵循这些奖励功能来模仿专家，这被称为学徒学习。但是，代理商可能会面对与演示不同的环境，因此，希望转移的奖励功能。经典的奖励学习方法，例如逆增强学习（IRL）或等效地，对抗性模仿学习（AIL），恢复奖励功能以及训练动态，难以传递。以前的动态奖励学习方法依赖于诸如奖励函数必须是仅限的，限制其适用性的假设。在这项工作中，我们提出了AIL框架中的动态歧视者 - 歧视奖励学习方法（DARL），能够同时学习国家行动和仅国家奖励功能。 DARL通过将奖励功能与训练动力学分解，从而实现了这一目标，并在从原始的国家行动空间中衍生出的潜在空间上采用了动力学不可吻合的歧视器。该潜在空间被优化，以最大程度地减少有关动态的信息。此外，我们发现了降低可转让性的AIL框架的政策依赖性问题。 DARL代表奖励功能是培训期间歧视者的合奏，以消除政策依赖性。关于Mujoco任务的实证研究具有变化的动力学表明，DARL更好地恢复了奖励功能，并在转移的环境中可以更好地模仿性能，从而处理仅州和州行动的奖励情景。

Recovering reward function from expert demonstrations is a fundamental problem in reinforcement learning. The recovered reward function captures the motivation of the expert. Agents can imitate experts by following these reward functions in their environment, which is known as apprentice learning. However, the agents may face environments different from the demonstrations, and therefore, desire transferable reward functions. Classical reward learning methods such as inverse reinforcement learning (IRL) or, equivalently, adversarial imitation learning (AIL), recover reward functions coupled with training dynamics, which are hard to be transferable. Previous dynamics-agnostic reward learning methods rely on assumptions such as that the reward function has to be state-only, restricting their applicability. In this work, we present a dynamics-agnostic discriminator-ensemble reward learning method (DARL) within the AIL framework, capable of learning both state-action and state-only reward functions. DARL achieves this by decoupling the reward function from training dynamics, employing a dynamics-agnostic discriminator on a latent space derived from the original state-action space. This latent space is optimized to minimize information on the dynamics. We moreover discover the policy-dependency issue of the AIL framework that reduces the transferability. DARL represents the reward function as an ensemble of discriminators during training to eliminate policy dependencies. Empirical studies on MuJoCo tasks with changed dynamics show that DARL better recovers the reward function and results in better imitation performance in transferred environments, handling both state-only and state-action reward scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题