重新思考强化学习以推荐：及时的视角

论文标题

重新思考强化学习以推荐：及时的视角

Rethinking Reinforcement Learning for Recommendation: A Prompt Perspective

论文作者

Xin, Xin, Pimentel, Tiago, Karatzoglou, Alexandros, Ren, Pengjie, Christakopoulou, Konstantina, Ren, Zhaochun

论文摘要

现代推荐系统旨在改善用户体验。随着加强学习（RL）自然符合这一目标 - 最大程度地提高用户每次会议的奖励 - 它已成为推荐系统中新兴的话题。但是，由于\ emph {离线培训挑战}，开发基于RL的推荐方法并不是一件容易的事。具体来说，传统RL的基石是在此过程中培训大量在线探索的代理商，在此过程中造成许多“错误”。但是，在推荐设置中，我们负担不起在线犯“错误”的代价。结果，必须通过根据不同的建议政策收集的离线历史隐式反馈对代理进行培训；传统的RL算法可能会导致这些离线培训设置下的次优政策。在这里，我们提出了一个新的学习范式（即基于迅速的增强型学习（PRL）），以供基于RL的推荐代理进行离线培训。尽管传统的RL算法试图将状态行动输入对映射到其预期奖励（例如Q值），但PRL直接从状态奖励输入中直接删除动作（即推荐的项目）。简而言之，鉴于先前的互动和观察到的奖励价值，对代理进行了培训，以预测推荐的项目 - 简单的监督学习。在部署时，此历史（培训）数据充当知识库，而状态奖励对被用作提示。因此，代理被用来回答以下问题：\ emph {鉴于先前的交互\＆提前的奖励值}应该建议使用哪一项？我们通过四个值得注意的建议模型实现PRL，并在两个现实世界的电子商务数据集上进行实验。实验结果证明了我们提出的方法的出色性能。

Modern recommender systems aim to improve user experience. As reinforcement learning (RL) naturally fits this objective -- maximizing an user's reward per session -- it has become an emerging topic in recommender systems. Developing RL-based recommendation methods, however, is not trivial due to the \emph{offline training challenge}. Specifically, the keystone of traditional RL is to train an agent with large amounts of online exploration making lots of `errors' in the process. In the recommendation setting, though, we cannot afford the price of making `errors' online. As a result, the agent needs to be trained through offline historical implicit feedback, collected under different recommendation policies; traditional RL algorithms may lead to sub-optimal policies under these offline training settings. Here we propose a new learning paradigm -- namely Prompt-Based Reinforcement Learning (PRL) -- for the offline training of RL-based recommendation agents. While traditional RL algorithms attempt to map state-action input pairs to their expected rewards (e.g., Q-values), PRL directly infers actions (i.e., recommended items) from state-reward inputs. In short, the agents are trained to predict a recommended item given the prior interactions and an observed reward value -- with simple supervised learning. At deployment time, this historical (training) data acts as a knowledge base, while the state-reward pairs are used as a prompt. The agents are thus used to answer the question: \emph{ Which item should be recommended given the prior interactions \& the prompted reward value}? We implement PRL with four notable recommendation models and conduct experiments on two real-world e-commerce datasets. Experimental results demonstrate the superior performance of our proposed methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题