论文标题

对深度学习的自我惩罚和回报

Self Punishment and Reward Backfill for Deep Q-Learning

论文作者

Bonyadi, Mohammad Reza, Wang, Rui, Ziaei, Maryam

论文摘要

加强学习者通过鼓励行为来学习,从而最大程度地提高其全部奖励,通常是由环境提供的。但是,在许多环境中,奖励是在一系列行动而不是每个行动之后提供的奖励,这使代理商在这些行动是否有效的情况下经历了歧义,这是一个称为信用分配问题的问题。在本文中,我们提出了受到行为心理学启发的两种策略,以使代理人能够内在估计无奖励的行动的奖励价值。第一个称为自我惩罚(SP)的策略不鼓励代理商犯错,导致不良终端状态。第二种称为奖励回填(RB)的策略将两种有回报的行动之间的奖励反射。我们证明,根据某些假设和不管使用的强化学习算法如何,这两种策略在所有可能的政策的总奖励方面都保持了政策的顺序,并扩大了最佳政策。因此,我们提出的策略将与任何强化学习算法集成,该学习算法通过经验来学习价值或行动价值功能。我们将这两种策略纳入了三种流行的深入学习方法中,并评估了三十次Atari游戏的结果。参数调整后,我们的结果表明,提议的策略将超过65%的测试游戏中的测试方法提高了25倍以上的性能。

Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment. In many environments, however, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective, an issue known as the credit assignment problem. In this paper, we propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward. The first strategy, called self-punishment (SP), discourages the agent from making mistakes that lead to undesirable terminal states. The second strategy, called the rewards backfill (RB), backpropagates the rewards between two rewarded actions. We prove that, under certain assumptions and regardless of the reinforcement learning algorithm used, these two strategies maintain the order of policies in the space of all possible policies in terms of their total reward, and, by extension, maintain the optimal policy. Hence, our proposed strategies integrate with any reinforcement learning algorithm that learns a value or action-value function through experience. We incorporated these two strategies into three popular deep reinforcement learning approaches and evaluated the results on thirty Atari games. After parameter tuning, our results indicate that the proposed strategies improve the tested methods in over 65 percent of tested games by up to over 25 times performance improvement.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源