通过深层Q-Networks＆Game理论解决多代理深入学习中的隐式协调

论文标题

通过深层Q-Networks＆Game理论解决多代理深入学习中的隐式协调

Resolving Implicit Coordination in Multi-Agent Deep Reinforcement Learning with Deep Q-Networks & Game Theory

论文作者

Adams, Griffin, Padmanabhan, Sarguna Janani, Shekhar, Shivang

论文摘要

我们通过将深入的Q网络与NASH平衡相结合以进行行动选择，解决了多代理深入强化学习中隐性协调的两个主要挑战：国家行动空间的非平稳性和指数增长。 Q值代理是NASH设置中的收益，而相互的最佳响应定义了联合行动的选择。协调是隐式的，因为多个/无纳什平衡是确定性解决的。我们证明了游戏类型的知识会导致与NASH-Q相比，假设镜像最佳响应和更快的收敛性。具体而言，Friend-or-Foe算法展示了与SET控制器的收敛迹象，该算法共同选择了两个代理的动作。鉴于分散协调对联合行动的高度不稳定的性质，这种令人鼓舞的。受决斗网络体系结构的启发，该架构将Q功能分解为状态和优势流以及残留网络，我们同时学习了单个和关节代理表示，并通过元素添加将它们合并。通过重新铸造为学习残留功能，这简化了协调。我们还对关键MADRL和游戏理论变量进行了高级比较见解：竞争与合作，异步与平行学习，贪婪与社会最佳的Nash Equilibria领带破坏和无NASH平衡案例的策略。我们使用Openai Gym：一种捕食者猎物环境，交替的仓库环境和同步环境评估了3个在Python编写的自定义环境。每个环境都需要依次更多的协调才能获得积极的回报。

We address two major challenges of implicit coordination in multi-agent deep reinforcement learning: non-stationarity and exponential growth of state-action space, by combining Deep-Q Networks for policy learning with Nash equilibrium for action selection. Q-values proxy as payoffs in Nash settings, and mutual best responses define joint action selection. Coordination is implicit because multiple/no Nash equilibria are resolved deterministically. We demonstrate that knowledge of game type leads to an assumption of mirrored best responses and faster convergence than Nash-Q. Specifically, the Friend-or-Foe algorithm demonstrates signs of convergence to a Set Controller which jointly chooses actions for two agents. This encouraging given the highly unstable nature of decentralized coordination over joint actions. Inspired by the dueling network architecture, which decouples the Q-function into state and advantage streams, as well as residual networks, we learn both a single and joint agent representation, and merge them via element-wise addition. This simplifies coordination by recasting it is as learning a residual function. We also draw high level comparative insights on key MADRL and game theoretic variables: competitive vs. cooperative, asynchronous vs. parallel learning, greedy versus socially optimal Nash equilibria tie breaking, and strategies for the no Nash equilibrium case. We evaluate on 3 custom environments written in Python using OpenAI Gym: a Predator Prey environment, an alternating Warehouse environment, and a Synchronization environment. Each environment requires successively more coordination to achieve positive rewards.

下载PDF全文

下载文献需遵守相关版权规定

论文标题