论文标题
模仿对手获胜:两人竞争游戏中的对抗性政策模仿学习
Imitating Opponent to Win: Adversarial Policy Imitation Learning in Two-player Competitive Games
论文作者
论文摘要
关于深度强化学习脆弱性(RL)的最新研究表明,对手代理通过的对抗性政策可以影响目标RL代理(受害者)在多代理环境中的表现不佳。在现有研究中,根据与受害者互动的经验,对对抗性政策进行了直接培训。这种方法有一个关键的缺点。从历史互动中得出的知识可能无法正确推广到受害者的未开发政策区域,从而使受过训练的对抗性政策的效率大大降低。在这项工作中,我们设计了一种新的有效的对抗性政策学习算法,它克服了这一缺点。我们新算法的核心思想是创建一个新的模仿者来模仿受害者的政策,而对抗性政策将不仅基于与受害者的互动,而且还基于模仿者对预测受害者意图的反馈。通过这样做,我们可以利用仅基于受害者的样本轨迹来捕捉受害者政策的基本特征的模仿学习能力。我们的受害者模仿学习模型与先前的模型有所不同,因为环境的动态是由对手的政策驱动的,并且在对抗性政策培训期间将不断变化。当对手的政策变得稳定时,我们提供了可证明的界限,以确保所需的模仿政策。我们通过使模仿者成为受害者的更强大版本来进一步增强对抗性政策的学习。最后,我们使用四个竞争性的Mujoco游戏环境进行的广泛实验表明,我们提出的对抗性政策学习算法的表现优于最先进的算法。
Recent research on vulnerabilities of deep reinforcement learning (RL) has shown that adversarial policies adopted by an adversary agent can influence a target RL agent (victim agent) to perform poorly in a multi-agent environment. In existing studies, adversarial policies are directly trained based on experiences of interacting with the victim agent. There is a key shortcoming of this approach; knowledge derived from historical interactions may not be properly generalized to unexplored policy regions of the victim agent, making the trained adversarial policy significantly less effective. In this work, we design a new effective adversarial policy learning algorithm that overcomes this shortcoming. The core idea of our new algorithm is to create a new imitator to imitate the victim agent's policy while the adversarial policy will be trained not only based on interactions with the victim agent but also based on feedback from the imitator to forecast victim's intention. By doing so, we can leverage the capability of imitation learning in well capturing underlying characteristics of the victim policy only based on sample trajectories of the victim. Our victim imitation learning model differs from prior models as the environment's dynamics are driven by adversary's policy and will keep changing during the adversarial policy training. We provide a provable bound to guarantee a desired imitating policy when the adversary's policy becomes stable. We further strengthen our adversarial policy learning by making our imitator a stronger version of the victim. Finally, our extensive experiments using four competitive MuJoCo game environments show that our proposed adversarial policy learning algorithm outperforms state-of-the-art algorithms.