论文标题
异构机构镜像学习:合作MARL的连续解决方案
Heterogeneous-Agent Mirror Learning: A Continuum of Solutions to Cooperative MARL
论文作者
论文摘要
智能机器之间合作的必要性已在人工智能(AI)研究社区中普及了合作的多代理增强学习(MARL)。但是,许多研究的努力一直集中在开发实用的MARL算法上,其有效性仅在经验上进行了研究,因此缺乏理论保证。正如最近的研究所表明的那样,MARL方法通常达到奖励单调性或互惠次优的性能不稳定的性能。为了解决这些问题,在本文中,我们介绍了一个名为“异构代理镜面学习”(HAML)的新颖框架,该框架为MARL算法设计提供了一般模板。我们证明,源自HAML模板的算法满足了关节奖励的单调改进的所需特性以及融合到NASH平衡的。我们通过证明当前最先进的合作社Marl算法HATRPO和HAPKO实际上是HAML实例,从而验证了HAML的实用性。接下来,作为我们理论的自然结果,我们提出了两种众所周知的RL算法HAA2C(用于A2C)和HADDPG(用于DDPG)的HAML扩展,并证明了它们针对Starcraftii和多代理Mujoco任务的强大基线的有效性。
The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavors have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper, we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL algorithmic designs. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.