从匪徒反馈中的多任务外部学习

论文标题

从匪徒反馈中的多任务外部学习

Multi-Task Off-Policy Learning from Bandit Feedback

论文作者

Hong, Joey, Kveton, Branislav, Katariya, Sumeet, Zaheer, Manzil, Ghavamzadeh, Mohammad

论文摘要

许多实际应用，例如推荐系统和学习排名，涉及解决多个类似的任务。一个例子是学习针对具有类似电影偏好的用户的推荐政策，在这些用户中，用户可能仍然对单个电影的排名略有不同。这些任务可以在层次结构中组织，在层次结构中，通过共享结构相关的类似任务。在这项工作中，我们将该问题作为上下文的非政策优化在记录的匪徒反馈中的层次图形模型中。为了解决该问题，我们提出了一种层次结构的非政策优化算法（hieropo），该算法估算了分层模型的参数，然后对它们进行悲观。我们在线性高斯模型中实例化Hieropo，为此我们还提供了有效的实施和分析。我们证明了对学习策略的次级次级范围的界限，这表明对不使用层次模型的明显改善。我们还通过经验评估政策。我们的理论和经验结果表明，使用层次结构而不是独立解决每个任务的明显优势。

Many practical applications, such as recommender systems and learning to rank, involve solving multiple similar tasks. One example is learning of recommendation policies for users with similar movie preferences, where the users may still rank the individual movies slightly differently. Such tasks can be organized in a hierarchy, where similar tasks are related through a shared structure. In this work, we formulate this problem as a contextual off-policy optimization in a hierarchical graphical model from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them. We instantiate HierOPO in linear Gaussian models, for which we also provide an efficient implementation and analysis. We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model. We also evaluate the policies empirically. Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.

下载PDF全文

下载文献需遵守相关版权规定

论文标题