摊销的变分深Q网络

论文标题

摊销的变分深Q网络

Amortized Variational Deep Q Network

论文作者

Zhang, Haotian, Wang, Yuhao, Sun, Jianyong, Xu, Zongben

论文摘要

有效的探索是深度强化学习中最重要的问题之一。为了解决此问题，最近的方法将值函数参数视为随机变量，并将变异推断求近似参数的后端。在本文中，我们提出了一个摊销的变异推理框架，以近似Deep Q网络中动作值函数的后验分布。我们确定了新模型丢失与摊销变异推理损失之间的等价性。我们通过在两个阶段的训练过程中分别假设后部为库奇和高斯，从而实现了探索和剥削的平衡。我们表明，摊销的框架可以比现有的最新方法所产生的学习参数少。关于OpenAI健身房和链马尔可夫决策过程任务的经典控制任务的实验结果表明，所提出的方法的性能明显优于制作方法，并且需要更少的培训时间。

Efficient exploration is one of the most important issues in deep reinforcement learning. To address this issue, recent methods consider the value function parameters as random variables, and resort variational inference to approximate the posterior of the parameters. In this paper, we propose an amortized variational inference framework to approximate the posterior distribution of the action value function in Deep Q Network. We establish the equivalence between the loss of the new model and the amortized variational inference loss. We realize the balance of exploration and exploitation by assuming the posterior as Cauchy and Gaussian, respectively in a two-stage training process. We show that the amortized framework can results in significant less learning parameters than existing state-of-the-art method. Experimental results on classical control tasks in OpenAI Gym and chain Markov Decision Process tasks show that the proposed method performs significantly better than state-of-art methods and requires much less training time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题