论文标题
平衡限制和奖励与元级奖学金D4PG
Balancing Constraints and Rewards with Meta-Gradient D4PG
论文作者
论文摘要
部署强化学习(RL)代理来解决现实世界应用程序通常需要满足复杂的系统限制。通常,由于系统的复杂性质或无法验证离线阈值(例如,不存在模拟器或合理的离线评估程序),通常会错误地设置约束阈值。这将导致解决方案,即不违反约束而无法解决任务。但是,在许多现实世界中,违反约束是不受欢迎的,但它们并不灾难性,激发了对软约束RL方法的需求。我们提出了一种软件约束的RL方法,该方法利用元梯度在预期的回报和最小化约束违规之间找到了良好的权衡。我们通过证明它始终优于四个不同的Mujoco域的基准来证明这种方法的有效性。
Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present a soft-constrained RL approach that utilizes meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of this approach by showing that it consistently outperforms the baselines across four different MuJoCo domains.