如何学习有用的评论家？基于模型的动作级估计策略优化

论文标题

如何学习有用的评论家？基于模型的动作级估计策略优化

How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

论文作者

D'Oro, Pierluca, Jaśkowski, Wojciech

论文摘要

通过将其行动插入批评家并升高行动价值梯度，确定性的派演员 - 批评算法可以改善演员，这是通过将演员的雅各布矩阵与评论家在输入行动方面的批评者梯度链接而成。但是，批评家通常只是经过训练以准确预测预期的回报，而不是渐变，而这本身就是政策优化的无用。在本文中，我们提出了基于模型的参与者批评算法Mage，该算法基于政策梯度理论，该理论明确学习了行动值梯度。法师通过学习的动态进行反向传播，以计算时间差异学习中的梯度目标，从而导致针对改进政策的评论家。在一组Mujoco连续控制任务上，我们与无模型和基于模型的最新基准相比，证明了该算法的效率。

Deterministic-policy actor-critic algorithms for continuous control improve the actor by plugging its actions into the critic and ascending the action-value gradient, which is obtained by chaining the actor's Jacobian matrix with the gradient of the critic with respect to input actions. However, instead of gradients, the critic is, typically, only trained to accurately predict expected returns, which, on their own, are useless for policy optimization. In this paper, we propose MAGE, a model-based actor-critic algorithm, grounded in the theory of policy gradients, which explicitly learns the action-value gradient. MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement. On a set of MuJoCo continuous-control tasks, we demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题