通过摊销的近似最大化在巨大的作用空间中学习的q学习

论文标题

通过摊销的近似最大化在巨大的作用空间中学习的q学习

Q-Learning in enormous action spaces via amortized approximate maximization

论文作者

Van de Wiele, Tom, Warde-Farley, David, Mnih, Andriy, Mnih, Volodymyr

论文摘要

由于对一组可能的动作所需的最大化，将Q学习应用于高维或连续的动作空间可能很困难。通过摊销推断的技术，我们用最大化的一小部分可能的动作取代了从学习的建议分布中采样的一小部分可能的动作，我们取代了所有动作的昂贵最大化。我们将其配音摊销Q学习（AQL）的最终方法能够处理离散，连续或混合动作空间，同时保持Q学习的好处。我们对连续控制任务的实验最多21维操作表明，AQL的表现优于D3PG（Barth-Maron等，2018）和QT-OPT（Kalashnikov等，2018）。对结构化离散作用空间的实验表明，AQL可以在具有数千个离散操作的空间中有效地学习良好的策略。

Applying Q-learning to high-dimensional or continuous action spaces can be difficult due to the required maximization over the set of possible actions. Motivated by techniques from amortized inference, we replace the expensive maximization over all actions with a maximization over a small subset of possible actions sampled from a learned proposal distribution. The resulting approach, which we dub Amortized Q-learning (AQL), is able to handle discrete, continuous, or hybrid action spaces while maintaining the benefits of Q-learning. Our experiments on continuous control tasks with up to 21 dimensional actions show that AQL outperforms D3PG (Barth-Maron et al, 2018) and QT-Opt (Kalashnikov et al, 2018). Experiments on structured discrete action spaces demonstrate that AQL can efficiently learn good policies in spaces with thousands of discrete actions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题