PID Lagrangian方法的增强学习中的响应安全性

论文标题

PID Lagrangian方法的增强学习中的响应安全性

Responsive Safety in Reinforcement Learning by PID Lagrangian Methods

论文作者

Stooke, Adam, Achiam, Joshua, Abbeel, Pieter

论文摘要

拉格朗日方法被广泛用于限制优化问题的算法，但是它们的学习动态表现出振荡和过冲，当应用于安全的增强学习时，会导致代理训练期间的约束侵犯行为。我们通过提出一种利用约束函数的衍生物的新颖Lagrange乘数更新方法来解决这一缺点。我们采用控制观点，其中传统的Lagrange乘数更新为\ emph {Integral} Control;我们的术语介绍了\ emph {比例}和\ emph {derivative}控制，通过阻尼和预测度量实现了有利的学习动力。我们将PID Lagrangian方法应用于Deep RL，在安全健身房中设置了新的最新技术，这是一个安全的RL基准。最后，我们引入了一种新方法，通过为奖励和成本的相对数值尺度提供不变性来简化控制器调整。我们的广泛实验表明了性能和超参数鲁棒性的提高，而我们的算法几乎与传统的拉格朗日方法一样简单地得出和实施。

Lagrangian methods are widely used algorithms for constrained optimization problems, but their learning dynamics exhibit oscillations and overshoot which, when applied to safe reinforcement learning, leads to constraint-violating behavior during agent training. We address this shortcoming by proposing a novel Lagrange multiplier update method that utilizes derivatives of the constraint function. We take a controls perspective, wherein the traditional Lagrange multiplier update behaves as \emph{integral} control; our terms introduce \emph{proportional} and \emph{derivative} control, achieving favorable learning dynamics through damping and predictive measures. We apply our PID Lagrangian methods in deep RL, setting a new state of the art in Safety Gym, a safe RL benchmark. Lastly, we introduce a new method to ease controller tuning by providing invariance to the relative numerical scales of reward and cost. Our extensive experiments demonstrate improved performance and hyperparameter robustness, while our algorithms remain nearly as simple to derive and implement as the traditional Lagrangian approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题