论文标题
在凸强化学习中挑战常见的假设
Challenging Common Assumptions in Convex Reinforcement Learning
论文作者
论文摘要
经典的增强学习(RL)配方涉及标量奖励功能的最大化。最近,引入了凸RL,以将RL公式扩展到策略引起的状态分布的所有目标。值得注意的是,凸RL涵盖了几种不属于标量配方的相关应用,包括模仿学习,规避风险的RL和纯探索。在经典的RL中,通常优化无限试验目标,该目标是国家分布而不是经验国家探视频率,即使实际轨迹的实际数量在实践中始终是有限的。从理论上讲,这是合理的,因为可以证明无限试验和有限试验目标是重合的,从而导致了相同的最佳政策。在本文中,我们表明,这种隐藏的假设在凸RL设置中不存在。特别是,我们表明,错误地优化了无限试验目标,以代替实际的有限试验,通常会导致明显的近似误差。由于有限试验设置是模拟和现实世界中的默认设置,因此我们认为,阐明有关此问题的阐明将导致凸RL的更好方法和方法,从而影响相关的研究领域,例如模仿学习,避免风险的RL,以及纯粹的探索。
The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.