一种深厚的强化学习方法，用于在两人游戏中找到非爆炸性策略

论文标题

一种深厚的强化学习方法，用于在两人游戏中找到非爆炸性策略

A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games

论文作者

Ding, Zihan, Su, Dijia, Liu, Qinghua, Jin, Chi

论文摘要

本文提出了用于学习两人零和马尔可夫游戏的新的，端到端的深钢筋学习算法。与先前在训练代理上击败固定对手的努力不同，我们的目标是找到纳什均衡政策，这些政策即使是对抗性对手也没有剥削。我们提出（a）NASH-DQN算法，该算法将深度学习技术从单个DQN集成到经典的NASH Q学习算法中，以求解表格Markov游戏；（b）NASH-DQN-Exploiter算法，该算法还采用了剥削者来指导对主要代理的探索。我们对表格示例以及各种两人Atari游戏进行实验评估。我们的经验结果表明，（i）许多现有方法（包括神经虚拟的自我游戏和政策空间反应甲骨文）所发现的政策可能容易受到对抗对手的剥削；（ii）我们算法的输出策略对开发是强大的，因此优于现有方法。

This paper proposes new, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Different from prior efforts on training agents to beat a fixed set of opponents, our objective is to find the Nash equilibrium policies that are free from exploitation by even the adversarial opponents. We propose (a) Nash-DQN algorithm, which integrates the deep learning techniques from single DQN into the classic Nash Q-learning algorithm for solving tabular Markov games; (b) Nash-DQN-Exploiter algorithm, which additionally adopts an exploiter to guide the exploration of the main agent. We conduct experimental evaluation on tabular examples as well as various two-player Atari games. Our empirical results demonstrate that (i) the policies found by many existing methods including Neural Fictitious Self Play and Policy Space Response Oracle can be prone to exploitation by adversarial opponents; (ii) the output policies of our algorithms are robust to exploitation, and thus outperform existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题