利用返回序列的差异进行勘探政策

论文标题

利用返回序列的差异进行勘探政策

Leveraging the Variance of Return Sequences for Exploration Policy

论文作者

Xi, Zerong, Sukthankar, Gita

论文摘要

本文介绍了一种使用返回序列的加权差异或加权时间差（TD）误差来构建探索策略上限的方法。我们证明，特定州行动对的返回序列的差异是一个重要的信息源，可以利用它来指导强化学习中的探索。直觉是，返回顺序的波动表明在不久的将来的回报中的不确定性更大。由于基于价值的增强学习的循环性质，这种差异发生了。不断发展的价值函数会促进策略改进，从而改变价值函数。尽管差异和TD错误都捕获了这种不确定性的不同方面，但我们的分析表明，这两者都可以指导探索很有价值。我们为我们的探索方法提出了一个两流网络体系结构，以估算DQN代理中加权方差/TD错误，并表明它在广泛的Atari游戏上的表现优于基线。

This paper introduces a method for constructing an upper bound for exploration policy using either the weighted variance of return sequences or the weighted temporal difference (TD) error. We demonstrate that the variance of the return sequence for a specific state-action pair is an important information source that can be leveraged to guide exploration in reinforcement learning. The intuition is that fluctuation in the return sequence indicates greater uncertainty in the near future returns. This divergence occurs because of the cyclic nature of value-based reinforcement learning; the evolving value function begets policy improvements which in turn modify the value function. Although both variance and TD errors capture different aspects of this uncertainty, our analysis shows that both can be valuable to guide exploration. We propose a two-stream network architecture to estimate weighted variance/TD errors within DQN agents for our exploration method and show that it outperforms the baseline on a wide range of Atari games.

下载PDF全文

下载文献需遵守相关版权规定

论文标题