探戈需要四个：自动课程生成的多力量自我播放

论文标题

探戈需要四个：自动课程生成的多力量自我播放

It Takes Four to Tango: Multiagent Selfplay for Automatic Curriculum Generation

论文作者

Du, Yuqing, Abbeel, Pieter, Grover, Aditya

论文摘要

我们有兴趣培训可以解决各种目标的通用加强学习者。有效培训此类代理需要自动生成目标课程。这是充满挑战的，因为它需要（a）探索增加难度的目标，同时确保代理（b）以样本有效的方式暴露于各种目标，并且（c）不会灾难性地忘记先前解决的目标。我们提出了课程自我Play（CUSP），这是一个自动化的目标生成框架，旨在通过拥有四个代理商的多玩家游戏来满足这些Desiderata。我们将成对的不对称课程学习（Dennis等，2020）扩展到了一个对称的游戏，该游戏仔细地平衡了两名非政治学生学习者与两名遗憾最大化老师之间的合作和竞争。尖峰还引入了熵目标覆盖范围，并说明了学生的非平稳性，从而使我们能够自动诱导一项课程，以平衡渐进式探索与抗血液剥削。我们证明，我们的方法成功地为一系列控制任务生成了有效的目标课程，在零拍测试时间概括方面超过了其他方法，而不是新颖的分布目标。

We are interested in training general-purpose reinforcement learning agents that can solve a wide variety of goals. Training such agents efficiently requires automatic generation of a goal curriculum. This is challenging as it requires (a) exploring goals of increasing difficulty, while ensuring that the agent (b) is exposed to a diverse set of goals in a sample efficient manner and (c) does not catastrophically forget previously solved goals. We propose Curriculum Self Play (CuSP), an automated goal generation framework that seeks to satisfy these desiderata by virtue of a multi-player game with four agents. We extend the asymmetric curricula learning in PAIRED (Dennis et al., 2020) to a symmetrized game that carefully balances cooperation and competition between two off-policy student learners and two regret-maximizing teachers. CuSP additionally introduces entropic goal coverage and accounts for the non-stationary nature of the students, allowing us to automatically induce a curriculum that balances progressive exploration with anti-catastrophic exploitation. We demonstrate that our method succeeds at generating an effective curricula of goals for a range of control tasks, outperforming other methods at zero-shot test-time generalization to novel out-of-distribution goals.

下载PDF全文

下载文献需遵守相关版权规定

论文标题