有效的基于人群的强化学习

论文标题

有效的基于人群的强化学习

Effective Diversity in Population Based Reinforcement Learning

论文作者

Parker-Holder, Jack, Pacchiano, Aldo, Choromanski, Krzysztof, Roberts, Stephen

论文摘要

探索是增强学习的关键问题，因为代理只能从环境中获得的数据中学习。考虑到这一点，保持代理人是一种有吸引力的方法，因为它允许通过各种行为来收集数据。这种行为多样性通常通过多目标损失函数来提高。但是，这些方法通常基于成对距离利用平均场更新，这使它们容易受到骑自行车行为的影响和增加的冗余。此外，明确提高多样性通常会对优化已经有成果的奖励行为产生有害的影响。因此，奖励多样性的权衡通常依赖于启发式方法。最后，这种方法需要特定于手工制作和域的行为表示。在本文中，我们介绍了一种同时优化人口所有成员的方法。我们没有使用成对距离，而是在行为歧管中测量整个总体的体积，这是由任务无关的行为嵌入定义的。此外，我们的算法多样性（DVD）可以适应使用在线学习技术培训期间的多样性。我们介绍了DVD的进化和基于梯度的实例，并表明当不需要更好的探索时，它们可以有效地改善探索而无需降低性能。

Exploration is a key problem in reinforcement learning, since agents can only learn from data they acquire in the environment. With that in mind, maintaining a population of agents is an attractive method, as it allows data be collected with a diverse set of behaviors. This behavioral diversity is often boosted via multi-objective loss functions. However, those approaches typically leverage mean field updates based on pairwise distances, which makes them susceptible to cycling behaviors and increased redundancy. In addition, explicitly boosting diversity often has a detrimental impact on optimizing already fruitful behaviors for rewards. As such, the reward-diversity trade off typically relies on heuristics. Finally, such methods require behavioral representations, often handcrafted and domain specific. In this paper, we introduce an approach to optimize all members of a population simultaneously. Rather than using pairwise distance, we measure the volume of the entire population in a behavioral manifold, defined by task-agnostic behavioral embeddings. In addition, our algorithm Diversity via Determinants (DvD), adapts the degree of diversity during training using online learning techniques. We introduce both evolutionary and gradient-based instantiations of DvD and show they effectively improve exploration without reducing performance when better exploration is not required.

下载PDF全文

下载文献需遵守相关版权规定

论文标题