论文标题
Seren:知道何时探索和何时开发
SEREN: Knowing When to Explore and When to Exploit
论文作者
论文摘要
有效的加强学习(RL)涉及“剥削”行动之间的权衡,从而最大程度地提高了预期奖励和“探索性”的“探索性”。为了鼓励探索,最近提出的方法为行动增加了随机性,将探索和剥削阶段分开,或将不确定性减少等同于奖励。但是,这些技术不一定提供使这种权衡的完全系统的方法。在这里,我们介绍了选择性增强探索网络(SEREN),该网络将探索 - 探索折衷的权衡作为RL代理之间的游戏 - \ exploiter之间的游戏,该游戏纯粹利用已知的奖励和另一个RL代理 - \ Switcher,该\ Switcher选择在哪个国家中,该州以哪些状态为激活纯勘探政策而进行了训练,该政策被训练以最小化系统不确定的extriestytytytytytytyty和Overridebroveriestytytytytytyty和OverrideRideTiretytytyty toprideTyty和OverrideRideToberliety。 \ Switter使用一种称为Impulse Control的策略的形式,能够确定在Exploiter可以自由执行其在其他地方执行其操作的最佳状态集。我们证明Seren会迅速收敛,并促使自然的时间表朝着纯粹的开发。通过在离散(Miligrid)和连续(Mujoco)对照基准的广泛实证研究中,我们表明,Seren可以很容易地与现有的RL算法相结合,从而相对于最新的算法,可以显着提高性能。
Efficient reinforcement learning (RL) involves a trade-off between "exploitative" actions that maximise expected reward and "explorative'" ones that sample unvisited states. To encourage exploration, recent approaches proposed adding stochasticity to actions, separating exploration and exploitation phases, or equating reduction in uncertainty with reward. However, these techniques do not necessarily offer entirely systematic approaches making this trade-off. Here we introduce SElective Reinforcement Exploration Network (SEREN) that poses the exploration-exploitation trade-off as a game between an RL agent -- \exploiter, which purely exploits known rewards, and another RL agent -- \switcher, which chooses at which states to activate a pure exploration policy that is trained to minimise system uncertainty and override Exploiter. Using a form of policies known as impulse control, \switcher is able to determine the best set of states to switch to the exploration policy while Exploiter is free to execute its actions everywhere else. We prove that SEREN converges quickly and induces a natural schedule towards pure exploitation. Through extensive empirical studies in both discrete (MiniGrid) and continuous (MuJoCo) control benchmarks, we show that SEREN can be readily combined with existing RL algorithms to yield significant improvement in performance relative to state-of-the-art algorithms.