正态性引导的分配加固学习以进行连续控制

论文标题

正态性引导的分配加固学习以进行连续控制

Normality-Guided Distributional Reinforcement Learning for Continuous Control

论文作者

Byun, Ju-Seung, Perrault, Andrew

论文摘要

学习平均回报或价值功能的预测模型在许多强化学习算法中起着至关重要的作用。分布强化学习（DRL）已被证明可以通过对价值分布进行建模，而不仅仅是平均值来提高性能。我们研究了几个连续控制任务中的价值分布，并发现学习价值分布在经验上非常接近正常。我们设计了一种利用此属性的方法，该方法采用了从方差网络预测的差异以及回报，以分析计算代表我们分布值函数正常的目标刻痕条。此外，我们根据根据标准值函数中不存在的价值分布的结构特征来衡量的正确性提出了策略更新策略。我们概述的方法与许多DRL结构兼容。我们将两种代表性的上政策算法PPO和TRPO用作测试床。我们的方法在16个连续任务设置中的10个中的10个在统计上具有显着的改进，同时使用减少的权重并实现更快的训练时间，而基于集合的方法来量化价值分布不确定性。

Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) has been shown to improve performance by modeling the value distribution, not just the mean. We study the value distribution in several continuous control tasks and find that the learned value distribution is empirically quite close to normal. We design a method that exploits this property, employing variances predicted from a variance network, along with returns, to analytically compute target quantile bars representing a normal for our distributional value function. In addition, we propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds. Our method yields statistically significant improvements in 10 out of 16 continuous task settings, while utilizing a reduced number of weights and achieving faster training time compared to an ensemble-based method for quantifying value distribution uncertainty.

下载PDF全文

下载文献需遵守相关版权规定

论文标题