熵正规器的最佳安排，用于连续时间线性季节增强学习

论文标题

熵正规器的最佳安排，用于连续时间线性季节增强学习

Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning

论文作者

Szpruch, Lukasz, Treetanthiploet, Tanut, Zhang, Yufei

论文摘要

这项工作使用熵调查的放松随机控制视角作为设计加固学习（RL）算法的原则框架。本文代理通过根据最佳放松政策分配的噪声控制来与环境互动。一方面，嘈杂的政策探索了空间，因此有助于学习，但另一方面，通过为非最佳行动分配积极的可能性来引入偏见。这种探索解释权取舍取决于熵正规化的强度。我们研究了两种熵正则配方产生的算法：探索性控制方法，其中熵被添加到成本目标以及近端政策更新方法中，熵惩罚了连续发作之间的政策差异。我们专注于有限的地平线连续时间线性季度（LQ）RL问题，其中具有未知漂移系数的线性动力学受到二次成本的控制。在这种情况下，两种算法都产生了高斯轻松的政策。我们量化了高斯政策的价值函数与其嘈杂评估之间的确切差异，并表明执行噪声必须在整个时间内独立。通过调整轻松策略的抽样频率和控制熵正则化强度的参数，我们证明，对于这两种学习算法而言，遗憾是$ \ Mathcal {o}（\ sqrt {n}）$（在$ n $ n $ epoinodes上）的订单，与众所周知的文献相称。

This work uses the entropy-regularised relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies on the one hand, explore the space and hence facilitate learning but, on the other hand, introduce bias by assigning a positive probability to non-optimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We study algorithms resulting from two entropy regularisation formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalises policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularisation, we prove that the regret, for both learning algorithms, is of the order $\mathcal{O}(\sqrt{N}) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题