论文标题
重新思考价值功能学习以进行加强学习中的概括
Rethinking Value Function Learning for Generalization in Reinforcement Learning
论文作者
论文摘要
我们的工作着重于培训RL代理在多种视觉上多样化的环境上,以提高观察性泛化性能。在先前的方法中,使用不相交的网络体系结构分别优化了策略和价值网络,以避免干扰并获得更准确的价值函数。我们确定在多环境环境中的价值网络比在传统的单环境环境中优化和易于记住训练数据更具挑战性。此外,我们发现价值网络上的适当正规化对于改善训练和测试性能是必要的。为此,我们提出了延迟批判性政策梯度(DCPG),这是一种策略梯度算法,通过比策略网络更少的培训数据优化价值网络来隐式惩罚价值估计。这可以使用单个统一的网络体系结构来实现。此外,我们引入了一个简单的自我监督任务,该任务使用单个歧视器来学习环境的前进和反向动态,可以通过价值网络共同优化。我们提出的算法显着提高了Procgen基准的观察概括性能和样本效率。
Our work focuses on training RL agents on multiple visually diverse environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that a value network in the multi-environment setting is more challenging to optimize and prone to memorizing the training data than in the conventional single-environment setting. In addition, we find that appropriate regularization on the value network is necessary to improve both training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network. This can be implemented using a single unified network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency on the Procgen Benchmark.