针对具有自适应偏好的代理商的多元化建议

论文标题

针对具有自适应偏好的代理商的多元化建议

Diversified Recommendations for Agents with Adaptive Preferences

论文作者

Agarwal, Arpit, Brown, William

论文摘要

当代理商访问一个平台推荐一个内容菜单要选择时，他们的项目的选择不仅取决于固定的偏好，还取决于他们先前与平台的交流。推荐人的主要目标通常是为了鼓励内容消耗，以优化一些奖励，例如广告收入，但它们通常还旨在确保代理商随着时间的推移而被代理商消费多种内容。我们将这个问题正式作为对抗性匪徒任务。在每个步骤中，推荐人都会向代理商提供$ k $（不超出$ n $）的菜单，他们根据其未知优先模型在菜单中选择一个项目，该模型将其过去项目的历史记录到相对选择概率。然后，推荐人观察代理商选择的物品，并收到该项目奖励的匪徒反馈。除了优化选定项目的奖励外，建议者还必须确保所选项目的总分布足够高。我们定义了一类偏爱模型，这些模型在本地可以学习，即可以通过仅在小区域内观察行为来估算整个领域的行为；这包括由有限度多项式代表的模型以及较少傅立叶的函数。对于此类课程，我们为推荐人提供了一种算法，该算法获得了$ \ tilde {o}（t^{3/4}）$遗憾的是，对所有满足两个条件的项目分布感到遗憾：它们已经足够多样化，并且它们在任何分布中都可以通过菜单上的任何分布来实现任何历史记录。我们表明这些条件是密切相关的：在任何项目历史记录上，所有足够的高渗透分布都可以立即实现。我们还为我们的假设提供了一组负面的结果，以非本地学习的运行时下限和线性遗憾的下限为替代基准的形式。

When an Agent visits a platform recommending a menu of content to select from, their choice of item depends not only on fixed preferences, but also on their prior engagements with the platform. The Recommender's primary objective is typically to encourage content consumption which optimizes some reward, such as ad revenue, but they often also aim to ensure that a wide variety of content is consumed by the Agent over time. We formalize this problem as an adversarial bandit task. At each step, the Recommender presents a menu of $k$ (out of $n$) items to the Agent, who selects one item in the menu according to their unknown preference model, which maps their history of past items to relative selection probabilities. The Recommender then observes the Agent's chosen item and receives bandit feedback of the item's reward. In addition to optimizing reward from selected items, the Recommender must also ensure that the total distribution of chosen items has sufficiently high entropy. We define a class of preference models which are locally learnable, i.e. behavior over the entire domain can be estimated by only observing behavior in a small region; this includes models representable by bounded-degree polynomials as well as functions with a sparse Fourier basis. For this class, we give an algorithm for the Recommender which obtains $\tilde{O}(T^{3/4})$ regret against all item distributions satisfying two conditions: they are sufficiently diversified, and they are instantaneously realizable at any history by some distribution over menus. We show that these conditions are closely connected: all sufficiently high-entropy distributions are instantaneously realizable at any item history. We also give a set of negative results justifying our assumptions, in the form of a runtime lower bound for non-local learning and linear regret lower bounds for alternate benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题