通过自适应行为政策共享，用于增强学习的数据有效培训

论文标题

通过自适应行为政策共享，用于增强学习的数据有效培训

Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing

论文作者

Liu, Ge, Wu, Rui, Cheng, Heng-Tze, Wang, Jing, Ooi, Jayden, Li, Lihong, Li, Ang, Li, Wai Lok Sibon, Boutilier, Craig, Chi, Ed

论文摘要

深入加强学习（RL）被证明是在模拟环境中决策的强大功能。但是，由于互动的消耗和部署预算的限制，培训深度RL模型在现实世界中的应用程序（例如生产规模的医疗保健或推荐系统）都具有挑战性。数据效率低下的一个方面来自优化深层神经网络时昂贵的高参数调整。我们提出了自适应行为政策共享（ABP），这是一种数据效率高的培训算法，允许共享由行为政策收集的经验，这些经验是从经过培训的一组超级参数培训的代理商中自适应选择的。我们进一步扩展了ABP，以在培训期间通过与改编版的基于人群的培训（ABPS-PBT）杂交ABP来进化超参数。我们使用多个Atari游戏进行实验，其中最多16个高参数/架构设置。与传统的超参数调谐相比，ABP的总体性能较高，最佳代理的差异降低，以及最佳代理的同等性能，即使ABP仅需要与训练单个代理相同数量的环境相互作用。我们还表明，ABPS-PBT进一步提高了收敛速度并降低了方差。

Deep Reinforcement Learning (RL) is proven powerful for decision making in simulated environments. However, training deep RL model is challenging in real world applications such as production-scale health-care or recommender systems because of the expensiveness of interaction and limitation of budget at deployment. One aspect of the data inefficiency comes from the expensive hyper-parameter tuning when optimizing deep neural networks. We propose Adaptive Behavior Policy Sharing (ABPS), a data-efficient training algorithm that allows sharing of experience collected by behavior policy that is adaptively selected from a pool of agents trained with an ensemble of hyper-parameters. We further extend ABPS to evolve hyper-parameters during training by hybridizing ABPS with an adapted version of Population Based Training (ABPS-PBT). We conduct experiments with multiple Atari games with up to 16 hyper-parameter/architecture setups. ABPS achieves superior overall performance, reduced variance on top 25% agents, and equivalent performance on the best agent compared to conventional hyper-parameter tuning with independent training, even though ABPS only requires the same number of environmental interactions as training a single agent. We also show that ABPS-PBT further improves the convergence speed and reduces the variance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题