ESCHER：通过计算历史价值功能来估计遗憾，避免在游戏中避免重要性采样

论文标题

ESCHER：通过计算历史价值功能来估计遗憾，避免在游戏中避免重要性采样

ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret

论文作者

McAleer, Stephen, Farina, Gabriele, Lanctot, Marc, Sandholm, Tuomas

论文摘要

在非常大型游戏中近似NASH平衡的最新技术利用神经网络来学习大约最佳的政策（策略）。一条有希望的研究线使用神经网络近似反事实遗憾最小化（CFR）或其现代变体。 Dream是目前唯一的基于CFR的神经方法，它是免费模型，因此可以扩展到非常大的游戏，它在估计的遗憾目标上训练神经网络，由于从Monte Carlo CFR（MCCFR）继承的重要性采样术语，该遗憾目标可能具有极高的差异。在本文中，我们提出了一种无偏模的方法，该方法不需要任何重要的采样。我们的方法（Escher）是原则上的，并保证以高概率收敛到近似的NASH平衡。我们表明，埃舍尔估计遗憾的差异比梦和其他基线要低的数量级。然后，我们证明，埃舍尔在许多游戏上都优于先前的最新状态 - 梦和神经虚拟的自我游戏（NFSP），随着游戏规模的增加，差异变得戏剧性。在黑暗国际象棋的大型比赛中，埃舍尔能够在$ 90 \％的时间内的头对面比赛中击败Dream和NFSP。

Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability. We show that the variance of the estimated regret of ESCHER is orders of magnitude lower than DREAM and other baselines. We then show that ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- on a number of games and the difference becomes dramatic as game size increases. In the very large game of dark chess, ESCHER is able to beat DREAM and NFSP in a head-to-head competition over $90\%$ of the time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题