篮板匪徒建模饱食效果

论文标题

篮板匪徒建模饱食效果

Rebounding Bandits for Modeling Satiation Effects

论文作者

Leqi, Liu, Kilinc-Karzan, Fatma, Lipton, Zachary C., Montgomery, Alan L.

论文摘要

心理学研究表明，许多商品的享受可能会受到满足的影响，在对同一物品的反复接触后，短期满意度下降了。然而，提出的用于为推荐系统供电的算法很少建模这些动力，而是按照用户的偏好时间来固定。在这项工作中，我们介绍了篮板匪徒，这是一种多臂匪徒设置，在其中将饱满的动力学建模为时间不变的线性动力学系统。每只手臂的预期奖励单调下降，并连续接触它，并在没有拉动该手臂时反弹到最初的奖励。与经典的强盗设置不同，解决篮板匪徒的方法必须提前计划，基于模型的方法依赖于估计饱食动力学的参数。我们表征了计划问题，表明当武器表现出相同的确定性动态时，贪婪的政策是最佳的。为了用未知参数来解决随机浸润动力学，我们提出了探索探索计划（EEP），这是一种算法，可以有条不紊地拉动武器，估算系统动力学，然后相应地计划。

Psychological research shows that enjoyment of many goods is subject to satiation, with short-term satisfaction declining after repeated exposures to the same item. Nevertheless, proposed algorithms for powering recommender systems seldom model these dynamics, instead proceeding as though user preferences were fixed in time. In this work, we introduce rebounding bandits, a multi-armed bandit setup, where satiation dynamics are modeled as time-invariant linear dynamical systems. Expected rewards for each arm decline monotonically with consecutive exposures to it and rebound towards the initial reward whenever that arm is not pulled. Unlike classical bandit settings, methods for tackling rebounding bandits must plan ahead and model-based methods rely on estimating the parameters of the satiation dynamics. We characterize the planning problem, showing that the greedy policy is optimal when the arms exhibit identical deterministic dynamics. To address stochastic satiation dynamics with unknown parameters, we propose Explore-Estimate-Plan (EEP), an algorithm that pulls arms methodically, estimates the system dynamics, and then plans accordingly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题