PAC：多代理增强学习中的反事实预测的辅助价值分解

论文标题

PAC：多代理增强学习中的反事实预测的辅助价值分解

PAC: Assisted Value Factorisation with Counterfactual Predictions in Multi-Agent Reinforcement Learning

论文作者

Zhou, Hanhan, Lan, Tian, Aggarwal, Vaneet

论文摘要

随着价值函数分解方法的发展，多机构增强学习（MARL）见证了重大进展。由于单调性，它允许通过最大程度地分解每个代理实用程序来优化联合动作值函数。在本文中，我们表明，在部分可观察到的MARL问题中，代理商对自己的行为的订购可能会对代表函数类施加并发约束（跨不同状态），从而在培训期间造成了重大估计错误。我们应对此限制并提出PAC，这是一个新的框架，该框架利用了最佳联合行动选择的反事实预测产生的辅助信息，这可以通过新颖的反事实损失通过新颖的辅助来实现价值函数分解。开发了一种基于变异推理的信息编码方法，以从估计的基线收集和编码反事实预测。为了实现分散的执行，我们还得出了受最大收入MARL框架启发的分级分解的策略。我们评估了有关多代理捕食者捕食者和一组Starcraft II微管理任务的拟议PAC。经验结果表明，在所有基准上，基于最先进的价值和基于策略的多代理增强学习算法的PAC的结果得到了改善。

Multi-agent reinforcement learning (MARL) has witnessed significant progress with the development of value function factorization methods. It allows optimizing a joint action-value function through the maximization of factorized per-agent utilities due to monotonicity. In this paper, we show that in partially observable MARL problems, an agent's ordering over its own actions could impose concurrent constraints (across different states) on the representable function class, causing significant estimation error during training. We tackle this limitation and propose PAC, a new framework leveraging Assistive information generated from Counterfactual Predictions of optimal joint action selection, which enable explicit assistance to value function factorization through a novel counterfactual loss. A variational inference-based information encoding method is developed to collect and encode the counterfactual predictions from an estimated baseline. To enable decentralized execution, we also derive factorized per-agent policies inspired by a maximum-entropy MARL framework. We evaluate the proposed PAC on multi-agent predator-prey and a set of StarCraft II micromanagement tasks. Empirical results demonstrate improved results of PAC over state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms on all benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题