论文标题
朝着稳健的非政策学习以实现运行时不确定性
Towards Robust Off-policy Learning for Runtime Uncertainty
论文作者
论文摘要
非政策学习在在线部署之前在优化和评估政策方面起着关键作用。但是,在实时服务期间,我们观察到各种干预措施和约束,在线和离线设置之间导致不一致,我们将其概述并称为运行时不确定性。由于其异常和稀有性质,无法从记录的数据中学到这种不确定性。为了断言一定程度的鲁棒性,我们鉴于运行时不确定性,我们沿对抗方向驱动了板球估计器。它允许所得估计器不仅可以观察到,而且可以进行意外的运行时间不确定性。利用这个想法,我们将运行时 - 不确定性鲁棒性带入了三种主要的非政策学习方法:反向倾向得分方法,奖励模型方法和双重稳健方法。从理论上讲,我们证明了我们方法的鲁棒性对于运行时不确定性,并使用模拟和现实世界在线实验证明了它们的有效性。
Off-policy learning plays a pivotal role in optimizing and evaluating policies prior to the online deployment. However, during the real-time serving, we observe varieties of interventions and constraints that cause inconsistency between the online and offline settings, which we summarize and term as runtime uncertainty. Such uncertainty cannot be learned from the logged data due to its abnormality and rareness nature. To assert a certain level of robustness, we perturb the off-policy estimators along an adversarial direction in view of the runtime uncertainty. It allows the resulting estimators to be robust not only to observed but also unexpected runtime uncertainties. Leveraging this idea, we bring runtime-uncertainty robustness to three major off-policy learning methods: the inverse propensity score method, reward-model method, and doubly robust method. We theoretically justify the robustness of our methods to runtime uncertainty, and demonstrate their effectiveness using both the simulation and the real-world online experiments.