分销汉密尔顿 - 雅各比 - 贝尔曼方程，用于连续时加固学习

论文标题

分销汉密尔顿 - 雅各比 - 贝尔曼方程，用于连续时加固学习

Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning

论文作者

Wiltzer, Harley, Meger, David, Bellemare, Marc G.

论文摘要

连续的时间加强学习提供了一种有吸引力的形式主义，用于描述时间的流逝，其中时间不会自然地分为离散的增量。在这里，我们考虑了预测在连续时间，随机环境中相互作用的代理商获得的回报分布的问题。准确的回报预测已被证明可用于确定对风险敏感控制，学习状态表示，多基因协调等的最佳策略。首先，我们建立了ITô扩散和更广泛的Feller-Dynkin过程的Hamilton-Jacobi-Bellman（HJB）方程的分布类似物。然后，我们将此方程式专注于返回分布近似于$ n $均匀加权粒子的设置，这是分销算法中常见的设计选择。我们的派生强调了由于统计扩散率而引起的其他术语，这是由于在连续时间设置中正确处理分布而产生的。基于此，我们提出了一种基于JKO方案的分布HJB的可访问算法，该方案可以在在线控制算法中实现。我们证明了这种算法在合成控制问题中的有效性。

Continuous-time reinforcement learning offers an appealing formalism for describing control problems in which the passage of time is not naturally divided into discrete increments. Here we consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time, stochastic environment. Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, learning state representations, multiagent coordination, and more. We begin by establishing the distributional analogue of the Hamilton-Jacobi-Bellman (HJB) equation for Itô diffusions and the broader class of Feller-Dynkin processes. We then specialize this equation to the setting in which the return distribution is approximated by $N$ uniformly-weighted particles, a common design choice in distributional algorithms. Our derivation highlights additional terms due to statistical diffusivity which arise from the proper handling of distributions in the continuous-time setting. Based on this, we propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm. We demonstrate the effectiveness of such an algorithm in a synthetic control problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题