加强基于2D网格的推荐面板重新排列

论文标题

加强基于2D网格的推荐面板重新排列

Reinforcement Re-ranking with 2D Grid-based Recommendation Panels

论文作者

Chen, Sirui, Zhang, Xiao, Chen, Xu, Li, Zhiyu, Wang, Yuan, Lin, Quan, Xu, Jun

论文摘要

现代推荐系统通常将项目作为流式的一维排名列表。最近，电子商务有一种趋势，推荐的项目是有组织的基于网格的面板，其中有两个维度，用户可以在垂直和水平方向上查看项目。在基于网格的结果面板中呈现项目对推荐系统提出了新的挑战，因为现有模型均设计为输出顺序列表，而基于网格的面板中的插槽没有明确的顺序。将项目排名直接转换为网格（例如，在插槽上预定订单）可忽略基于网格的面板上的用户特定行为模式，并且不可避免地会损害用户体验。为了解决这个问题，我们提出了一个新颖的马尔可夫决策过程（MDP），将这些项目放在推荐系统的最终重新排列阶段中。该模型（称为面板MDP）从早期阶段将初始项目排名为输入。然后，它将\ emph {MDP离散时间步骤定义为初始排名列表中的等级，而动作作为用户项目的预测和插槽的选择}。在每个时间步骤中，面板MDP顺序执行两个子动作：首先确定用户最初的排名列表中的当前项目是否受到用户的首选；然后选择一个插槽以放置项目，如果首选，或者否则会跳过项目。继续该过程，直到所有面板插槽填充为止。 PPO的增强学习算法用于实施和学习面板MDP中的参数。从广泛使用的电子商务应用程序收集的数据集上的仿真和实验证明了面板MDP的优越性，以推荐基于2D网格的结果面板。

Modern recommender systems usually present items as a streaming, one-dimensional ranking list. Recently there is a trend in e-commerce that the recommended items are organized grid-based panels with two dimensions where users can view the items in both vertical and horizontal directions. Presenting items in grid-based result panels poses new challenges to recommender systems because existing models are all designed to output sequential lists while the slots in a grid-based panel have no explicit order. Directly converting the item rankings into grids (e.g., pre-defining an order on the slots) overlooks the user-specific behavioral patterns on grid-based panels and inevitably hurts the user experiences. To address this issue, we propose a novel Markov decision process (MDP) to place the items in 2D grid-based result panels at the final re-ranking stage of the recommender systems. The model, referred to as Panel-MDP, takes an initial item ranking from the early stages as the input. Then, it defines \emph{the MDP discrete time steps as the ranks in the initial ranking list, and the actions as the prediction of the user-item preference and the selection of the slots}. At each time step, Panel-MDP sequentially executes two sub-actions: first deciding whether the current item in the initial ranking list is preferred by the user; then selecting a slot for placing the item if preferred, or skipping the item otherwise. The process is continued until all of the panel slots are filled. The reinforcement learning algorithm of PPO is employed to implement and learn the parameters in the Panel-MDP. Simulation and experiments on a dataset collected from a widely-used e-commerce app demonstrated the superiority of Panel-MDP in terms of recommending 2D grid-based result panels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题