论文标题
大规模推荐的快速离线政策优化
Fast Offline Policy Optimization for Large Scale Recommendation
论文作者
论文摘要
个性化的交互式系统(例如推荐系统)需要从基于上下文的大规模目录中选择相关项目。通过放松离散问题,可以实现奖励驱动的脱机优化,从而实现政策学习或增强样式学习算法的情况。不幸的是,这种放松步骤需要在整个目录上计算一个总和,从而使梯度评估的复杂性(因此每个随机梯度下降迭代)在目录大小中线性线性。在许多现实世界中,例如大型目录推荐系统,这一计算是站不住脚的,严重限制了该方法在实践中的实用性。在本文中,我们得出了这些政策学习算法的近似值,这些算法与目录大小对数进行对数。我们的贡献是基于结合三个新颖想法的结合:对政策梯度,自我归一化的重要性抽样估计器以及在训练时使用快速最大内部产品搜索的新蒙特卡洛估计。广泛的实验表明,我们的算法比幼稚的方法快,但产生同样好的策略。
Personalised interactive systems such as recommender systems require selecting relevant items from massive catalogs dependent on context. Reward-driven offline optimisation of these systems can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.