临时代理协调的策略改编

论文标题

临时代理协调的策略改编

On-the-fly Strategy Adaptation for ad-hoc Agent Coordination

论文作者

Zand, Jaleh, Parker-Holder, Jack, Roberts, Stephen J.

论文摘要

合作环境中的培训代理提供了能够与现实世界中人类（和其他代理商）有效互动的AI代理的承诺。多机构增强学习（MARL）有可能实现这一目标，在一系列具有挑战性的问题中表现出成功。但是，尽管这些进步很大，但绝大多数的重点一直放在自我游戏范式上。这通常会导致协调问题，这是由于代理人在与自己玩耍时使用任意惯例所致。这意味着，即使是最强的自我播放剂也可能与其他药物具有非常低的交叉效果，包括同一算法的其他初始化。在本文中，我们建议通过对其他代理商的策略的后验信仰来改编代理策略来解决这个问题。具体而言，我们考虑了从有限的先前训练的代理商中选择策略的问题，以与不知名的合作伙伴一起玩。我们提出了经典统计技术Gibbs采样的扩展，以更新有关其他代理的信念，并获得接近最佳的临时性能。尽管它很简单，但我们的方法还是能够与挑战性的哈纳比（Hanabi）纸牌游戏中的看不见的合作伙伴实现强大的跨界比赛，从而实现了成功的临时协调，而没有对合作伙伴的策略的知识。

Training agents in cooperative settings offers the promise of AI agents able to interact effectively with humans (and other agents) in the real world. Multi-agent reinforcement learning (MARL) has the potential to achieve this goal, demonstrating success in a series of challenging problems. However, whilst these advances are significant, the vast majority of focus has been on the self-play paradigm. This often results in a coordination problem, caused by agents learning to make use of arbitrary conventions when playing with themselves. This means that even the strongest self-play agents may have very low cross-play with other agents, including other initializations of the same algorithm. In this paper we propose to solve this problem by adapting agent strategies on the fly, using a posterior belief over the other agents' strategy. Concretely, we consider the problem of selecting a strategy from a finite set of previously trained agents, to play with an unknown partner. We propose an extension of the classic statistical technique, Gibbs sampling, to update beliefs about other agents and obtain close to optimal ad-hoc performance. Despite its simplicity, our method is able to achieve strong cross-play with unseen partners in the challenging card game of Hanabi, achieving successful ad-hoc coordination without knowledge of the partner's strategy a priori.

下载PDF全文

下载文献需遵守相关版权规定

论文标题