论文标题
在基于模型的增强学习中建模生存
Modeling Survival in model-based Reinforcement Learning
论文作者
论文摘要
尽管最新的无模型增强学习算法已被证明能够掌握复杂的决策任务,但这些方法的样本复杂性仍然是在许多现实世界应用中使用它们的障碍。在这方面,基于模型的强化学习提出了一些补救措施。然而,从固有的角度来看,基于模型的方法在计算上更加昂贵,并且容易受到亚临时性的影响。原因之一是,模型生成的数据始终不如实际数据准确,这通常会导致过渡和奖励函数模型不准确。为了缓解这个问题,这项工作通过讨论代理人目标生存的案例及其最大化预期奖励的案例来提出生存的概念。为此,引入了奖励功能近似值的替代模型,该模型学会避免终端状态,而不是最大程度地提高来自安全状态的累积奖励。专注于终端状态(一小部分状态空间)大大减少了训练工作。接下来,提出了一种基于模型的强化学习方法(生存)来训练代理商,以通过在终端国家附近的时间信用分配的安全地图模型中避免危险状态。最后,研究了提出的算法的性能,以及所提出的方法和当前方法之间的比较。
Although recent model-free reinforcement learning algorithms have been shown to be capable of mastering complicated decision-making tasks, the sample complexity of these methods has remained a hurdle to utilizing them in many real-world applications. In this regard, model-based reinforcement learning proposes some remedies. Yet, inherently, model-based methods are more computationally expensive and susceptible to sub-optimality. One reason is that model-generated data are always less accurate than real data, and this often leads to inaccurate transition and reward function models. With the aim to mitigate this problem, this work presents the notion of survival by discussing cases in which the agent's goal is to survive and its analogy to maximizing the expected rewards. To that end, a substitute model for the reward function approximator is introduced that learns to avoid terminal states rather than to maximize accumulated rewards from safe states. Focusing on terminal states, as a small fraction of state-space, reduces the training effort drastically. Next, a model-based reinforcement learning method is proposed (Survive) to train an agent to avoid dangerous states through a safety map model built upon temporal credit assignment in the vicinity of terminal states. Finally, the performance of the presented algorithm is investigated, along with a comparison between the proposed and current methods.