在增强学习中奖励探索的情节探访差异

论文标题

在增强学习中奖励探索的情节探访差异

Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning

论文作者

Yuan, Mingqi, Li, Bo, Jin, Xin, Zeng, Wenjun

论文摘要

探索对于具有高维观测和稀疏奖励的复杂环境中的深入增强学习至关重要。为了解决这个问题，最新提出的方法是利用内在奖励来改善探索，例如基于新颖的探索和基于预测的探索。但是，许多固有的奖励模块需要复杂的结构和表示学习，从而导致计算复杂性和性能不稳定。在本文中，我们提出了一种奖励情节访问差异（REVD），这是一种计算有效且量化的探索方法。更具体地说，RevD通过评估情节之间的基于RényiDivergence的访问差异来提供内在的奖励。为了进行有效的差异估计，使用随机定义状态编码器使用k-near最邻居估计器。最后，对Atari游戏和Pybullet机器人环境进行了测试。广泛的实验表明，REVD可以显着提高增强学习算法的样本效率，并优于基准测试方法。

Exploration is critical for deep reinforcement learning in complex environments with high-dimensional observations and sparse rewards. To address this problem, recent approaches proposed to leverage intrinsic rewards to improve exploration, such as novelty-based exploration and prediction-based exploration. However, many intrinsic reward modules require sophisticated structures and representation learning, resulting in prohibitive computational complexity and unstable performance. In this paper, we propose Rewarding Episodic Visitation Discrepancy (REVD), a computation-efficient and quantified exploration method. More specifically, REVD provides intrinsic rewards by evaluating the Rényi divergence-based visitation discrepancy between episodes. To make efficient divergence estimation, a k-nearest neighbor estimator is utilized with a randomly-initialized state encoder. Finally, the REVD is tested on Atari games and PyBullet Robotics Environments. Extensive experiments demonstrate that REVD can significantly improves the sample efficiency of reinforcement learning algorithms and outperforms the benchmarking methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题