深度Q学习：一种强大的控制方法

论文标题

深度Q学习：一种强大的控制方法

Deep Q-learning: a robust control approach

论文作者

Varga, Balazs, Kulcsar, Balazs, Chehreghani, Morteza Haghir

论文摘要

在本文中，我们将深度Q学习置于面向控制的视角中，并通过可靠的控制良好的技术研究其学习动力学。我们通过神经切线内核制定不确定的线性时间不变模型来描述学习。我们显示了学习和分析频域行为的不稳定。然后，我们通过稳健控制器确保融合在损失函数中充当动态奖励。我们合成了三个控制器：状态反馈增益计划H2，动态HINF和恒定增益HINF控制器。与强化学习中的启发式方法相比，使用以控制为导向的调整方法来建立学习代理更透明，并且具有完善的文献。此外，我们的方法不使用目标网络和随机重播内存。控制网络的作用被控制输入所取代，该输入还利用了样本的时间依赖性（与随机内存缓冲区相反）。在不同的OpenAI健身环境中的数值模拟表明，HINF受控学习的性能比双重Q学习略好。

In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent's behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling H2, dynamic Hinf, and constant gain Hinf controllers. Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the Hinf controlled learning performs slightly better than Double deep Q-learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题