论文标题
POMDPS中强大的不对称学习
Robust Asymmetric Learning in POMDPs
论文作者
论文摘要
通过模仿相应观察到的马尔可夫决策过程的策略,可以有效地了解部分观察到的马尔可夫决策过程的策略。不幸的是,这种模仿学习的现有方法存在一个严重的缺陷:专家不知道受训者看不到什么,因此可以鼓励在部分信息下采取优势,甚至不安全的行动。我们得出了一个目标,可以培训专家以最大化模仿代理政策的预期奖励,并使用它来构建有效的算法,自适应不对称匕首(A2D),从而共同培训专家和代理商。我们表明,A2D制定了一项专家政策,代理可以安全地模仿该政策,进而超过模仿固定专家所学到的政策。
Policies for partially observed Markov decision processes can be efficiently learned by imitating policies for the corresponding fully observed Markov decision processes. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and so may encourage actions that are sub-optimal, even unsafe, under partial information. We derive an objective to instead train the expert to maximize the expected reward of the imitating agent policy, and use it to construct an efficient algorithm, adaptive asymmetric DAgger (A2D), that jointly trains the expert and the agent. We show that A2D produces an expert policy that the agent can safely imitate, in turn outperforming policies learned by imitating a fixed expert.