OPAC：机会主义行为者评论

论文标题

OPAC：机会主义行为者评论

OPAC: Opportunistic Actor-Critic

论文作者

Roy, Srinjoy, Bakshi, Saptam, Maharaj, Tamal

论文摘要

Actor-Critic方法是一种无模型的加固学习（RL），在连续控制中在许多现实世界中都实现了最先进的性能。尽管他们取得了成功，但这些模型的广泛部署仍然是一个遥不可及的哭声。这些参与者批评方法中的主要问题是效率低下的探索和亚最佳政策。软批评者（SAC）和双胞胎延迟了深层确定性政策梯度（TD3），这是两种尖端的这种算法，遭受了这些问题的困扰。 SAC有效地解决了样本复杂性和收敛性对超参数的问题，因此在更艰巨的任务中超过了包括TD3在内的所有最先进的算法，而TD3在所有环境中产生了中等程度。由于其政策的高斯性质，SAC遭受了效率低下的探索，这会导致更简单的任务中的边缘性能。在本文中，我们介绍了机会性参与者评论（OPAC），这是一种新型无模型的深度RL算法，采用更好的探索政策和较小的差异。 OPAC结合了TD3和SAC最强大的功能，旨在以非政策方式优化随机策略。为了计算目标Q值，而不是两个批评家，OPAC使用了三个批评家并基于环境复杂性，因此机会主义地选择了如何从批评家的评估中计算出目标Q值。我们已经系统地评估了在Mujoco环境上的算法，在Mujoco环境中，它可以实现最先进的性能和胜过表现，或者至少等于TD3和SAC的性能。

Actor-critic methods, a type of model-free reinforcement learning (RL), have achieved state-of-the-art performances in many real-world domains in continuous control. Despite their success, the wide-scale deployment of these models is still a far cry. The main problems in these actor-critic methods are inefficient exploration and sub-optimal policies. Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3), two cutting edge such algorithms, suffer from these issues. SAC effectively addressed the problems of sample complexity and convergence brittleness to hyper-parameters and thus outperformed all state-of-the-art algorithms including TD3 in harder tasks, whereas TD3 produced moderate results in all environments. SAC suffers from inefficient exploration owing to the Gaussian nature of its policy which causes borderline performance in simpler tasks. In this paper, we introduce Opportunistic Actor-Critic (OPAC), a novel model-free deep RL algorithm that employs better exploration policy and lesser variance. OPAC combines some of the most powerful features of TD3 and SAC and aims to optimize a stochastic policy in an off-policy way. For calculating the target Q-values, instead of two critics, OPAC uses three critics and based on the environment complexity, opportunistically chooses how the target Q-value is computed from the critics' evaluation. We have systematically evaluated the algorithm on MuJoCo environments where it achieves state-of-the-art performance and outperforms or at least equals the performance of TD3 and SAC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题