从离线多代理增强学习中的良好轨迹学习

论文标题

从离线多代理增强学习中的良好轨迹学习

Learning from Good Trajectories in Offline Multi-Agent Reinforcement Learning

论文作者

Tian, Qi, Kuang, Kun, Liu, Furui, Wang, Baoxiang

论文摘要

离线多代理增强学习（MARL）旨在从预采用的数据集中学习有效的多代理策略，这是在现实世界应用程序中部署多代理系统的重要一步。但是，在实践中，每种单独的行为政策生成多代理联合轨迹通常都具有不同的性能水平。例如，代理是随机政策，而其他代理人是中等政策。在具有全球奖励的合作游戏中，现有的离线MARL学到的一位代理通常继承了这一随机政策，这危害了整个团队的表现。在本文中，我们对脱机泥浆进行了对媒介轨迹多样性的明确考虑，并提出了一个新颖的框架，称为“共享个体轨迹”（SIT）来解决这个问题。具体而言，基于注意力的奖励分解网络通过离线方式通过可区分的键值内存机制将信贷分配给每个代理。然后，这些分解的信用被用来将关节离线数据集重建为通过单个轨迹重播的优先体验重建，此后可以共享其良好的轨迹，并通过基于图形注意力网络（GAT）的评论家保守地培训其政策。我们在离散对照（即Starcraft II和多代理粒子环境）和连续控制（即多代理Mujoco）中评估我们的方法。结果表明，我们的方法在复杂和脱机多代理数据集中获得了明显更好的结果，尤其是当单个轨迹之间的数据质量差异很大时。

Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different level of how well it performs. e.g., an agent is a random policy while other agents are medium policies. In the cooperative game with global reward, one agent learned by existing offline MARL often inherits this random policy, jeopardizing the performance of the entire team. In this paper, we investigate offline MARL with explicit consideration on the diversity of agent-wise trajectories and propose a novel framework called Shared Individual Trajectories (SIT) to address this problem. Specifically, an attention-based reward decomposition network assigns the credit to each agent through a differentiable key-value memory mechanism in an offline manner. These decomposed credits are then used to reconstruct the joint offline datasets into prioritized experience replay with individual trajectories, thereafter agents can share their good trajectories and conservatively train their policies with a graph attention network (GAT) based critic. We evaluate our method in both discrete control (i.e., StarCraft II and multi-agent particle environment) and continuous control (i.e, multi-agent mujoco). The results indicate that our method achieves significantly better results in complex and mixed offline multi-agent datasets, especially when the difference of data quality between individual trajectories is large.

下载PDF全文

下载文献需遵守相关版权规定

论文标题