通过双重域中的分配风险进行谨慎的加强学习

论文标题

通过双重域中的分配风险进行谨慎的加强学习

Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

论文作者

Zhang, Junyu, Bedi, Amrit Singh, Wang, Mengdi, Koppel, Alec

论文摘要

我们研究了由马尔可夫决策过程（MDPS）定义的，其状态和行动空间是有限的。先前的努力主要受到与对风险敏感的MDP相关的计算挑战的影响。为了改善此问题，我们提出了一种新的风险定义，我们称之为谨慎，是惩罚功能，添加了线性编程（LP）表述增强学习的双重目标。该谨慎衡量政策的分配风险，这是该政策长期国家占用分配的函数。为了以无模型的方式解决这个问题，我们提出了一种原始偶偶有方法的随机变体，该变体使用Kullback-Lieber（KL）Divergence作为其近端术语。我们确定获得该方案的大约最佳解决方案所需的迭代/样品数量与对状态和动作空间的基数相匹配，但其依赖性对风险度量梯度的无限规范的依赖性有所不同。实验证明了这种方法的优点，即提高奖励积累的可靠性而无需其他计算负担。

We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning. The caution measures the distributional risk of a policy, which is a function of the policy's long-term state occupancy distribution. To solve this problem in an online model-free manner, we propose a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the number of iterations/samples required to attain approximately optimal solutions of this scheme matches tight dependencies on the cardinality of the state and action spaces, but differs in its dependence on the infinity norm of the gradient of the risk measure. Experiments demonstrate the merits of this approach for improving the reliability of reward accumulation without additional computational burdens.

下载PDF全文

下载文献需遵守相关版权规定

论文标题