弥合基于QP和基于MPC的RL之间的差距

论文标题

弥合基于QP和基于MPC的RL之间的差距

Bridging the gap between QP-based and MPC-based RL

论文作者

Sawant, Shambhuraj, Gros, Sebastien

论文摘要

强化学习方法通常使用深层神经网络来近似马尔可夫决策过程的价值功能和政策。不幸的是，基于DNN的RL缺乏所得政策的解释性。在本文中，我们使用优化问题近似策略和价值功能，采用二次程序（QPS）的形式。我们提出了简单的工具来促进QP中的结构，并将其推动以类似于线性MPC方案。通用的非结构化QP为学习提供了很高的灵活性，而具有MPC方案结构的QP促进了所得政策的解释性，还提供了其分析的方法。我们建议的工具允许在学习过程中不断调整前者和后者之间的权衡。我们使用点质量任务说明了我们提出的方法的工作原理。

Reinforcement learning methods typically use Deep Neural Networks to approximate the value functions and policies underlying a Markov Decision Process. Unfortunately, DNN-based RL suffers from a lack of explainability of the resulting policy. In this paper, we instead approximate the policy and value functions using an optimization problem, taking the form of Quadratic Programs (QPs). We propose simple tools to promote structures in the QP, pushing it to resemble a linear MPC scheme. A generic unstructured QP offers high flexibility for learning, while a QP having the structure of an MPC scheme promotes the explainability of the resulting policy, additionally provides ways for its analysis. The tools we propose allow for continuously adjusting the trade-off between the former and the latter during learning. We illustrate the workings of our proposed method with the resulting structure using a point-mass task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题