论文标题
乐队限制的软演员评论家模型
Band-limited Soft Actor Critic Model
论文作者
论文摘要
软演员评论家(SAC)算法在复杂的模拟环境中表现出显着的性能。 SAC网络的一个关键要素是熵正则化,它阻止了SAC Actor针对状态行动值函数的细性特征进行优化。这会在早期培训期间提高样品效率。我们通过添加卷积过滤器人为地限制了目标评论家的空间解决方案,将这个想法进一步迈进了一步。我们在线性情况下得出了封闭的形式解决方案,并表明限制性降低了状态行动值近似的低频和高频组件之间的相互依赖性,从而使评论家更快地学习。在实验中,在许多健身房环境中,频带的SAC在许多健身环境中的表现都优于经典的双批准囊,并且在回报方面表现出了更大的稳定性。我们通过添加随机噪声干扰来获得有关SAC的新见解,该技术越来越多地用于学习良好的策略,这些策略可以很好地转移到现实世界中。
Soft Actor Critic (SAC) algorithms show remarkable performance in complex simulated environments. A key element of SAC networks is entropy regularization, which prevents the SAC actor from optimizing against fine grained features, oftentimes transient, of the state-action value function. This results in better sample efficiency during early training. We take this idea one step further by artificially bandlimiting the target critic spatial resolution through the addition of a convolutional filter. We derive the closed form solution in the linear case and show that bandlimiting reduces the interdependency between the low and high frequency components of the state-action value approximation, allowing the critic to learn faster. In experiments, the bandlimited SAC outperformed the classic twin-critic SAC in a number of Gym environments, and displayed more stability in returns. We derive novel insights about SAC by adding a stochastic noise disturbance, a technique that is increasingly being used to learn robust policies that transfer well to the real world counterparts.