论文标题
深入增强学习中的建议
Shaping Advice in Deep Reinforcement Learning
论文作者
论文摘要
强化学习涉及代理与环境互动以完成任务。当环境提供的奖励很少时,代理商可能不会立即收到有关其采取行动质量的反馈,从而影响政策的学习。在本文中,我们建议通过在单一和多机构增强学习中提供额外的奖励,以增加环境的奖励信号。塑形建议被指定为连续时间步骤的潜在功能的差异。每个潜在函数都是观测和作用的函数。潜在功能的使用是由一个见解的基础,即从任何状态开始并返回同一状态的总电位始终等于零。我们通过理论分析和实验验证表明,塑形建议不会分散奖励奖励指定的任务。从理论上讲,我们证明,使用塑形建议时政策梯度和价值功能的融合意味着在没有塑造建议的情况下,这些数量的收敛性。我们设计了两种算法 - 单位强化学习(SAS)中的塑料建议,并在多机构增强学习中塑造建议(SAM)。在培训开始时,只需要一次指定一次SAS和SAM的塑造建议,并且可以通过非专家轻松提供。在实验上,我们评估了SAS和SAM在单代理环境中的两个任务以及具有稀疏奖励的多机构环境中的三个任务。我们观察到,使用塑造建议会导致代理学习政策以更快地完成任务,并获得比不使用塑形建议的算法更高的奖励。
Reinforcement learning involves agents interacting with an environment to complete tasks. When rewards provided by the environment are sparse, agents may not receive immediate feedback on the quality of actions that they take, thereby affecting learning of policies. In this paper, we propose to methods to augment the reward signal from the environment with an additional reward termed shaping advice in both single and multi-agent reinforcement learning. The shaping advice is specified as a difference of potential functions at consecutive time-steps. Each potential function is a function of observations and actions of the agents. The use of potential functions is underpinned by an insight that the total potential when starting from any state and returning to the same state is always equal to zero. We show through theoretical analyses and experimental validation that the shaping advice does not distract agents from completing tasks specified by the environment reward. Theoretically, we prove that the convergence of policy gradients and value functions when using shaping advice implies the convergence of these quantities in the absence of shaping advice. We design two algorithms- Shaping Advice in Single-agent reinforcement learning (SAS) and Shaping Advice in Multi-agent reinforcement learning (SAM). Shaping advice in SAS and SAM needs to be specified only once at the start of training, and can easily be provided by non-experts. Experimentally, we evaluate SAS and SAM on two tasks in single-agent environments and three tasks in multi-agent environments that have sparse rewards. We observe that using shaping advice results in agents learning policies to complete tasks faster, and obtain higher rewards than algorithms that do not use shaping advice.