论文标题
通过预测抽样的非平稳匪徒学习
Non-Stationary Bandit Learning via Predictive Sampling
论文作者
论文摘要
汤普森采样已被证明在各种固定的强盗环境中有效。但是,正如我们在本文中所证明的那样,应用于非平稳环境时的性能较差。我们将这种失败归因于以下事实:在探索算法时,该算法并未根据信息因非平稳性而丧失其有用性的速度而区分动作。在这种见识的基础上,我们提出了预测性抽样,该算法剥夺了获取迅速失去有用性的信息。关于预测抽样表现的理论保证是通过贝叶斯遗憾建立的。我们提供了预测抽样的版本,以便将计算术后扩展到具有实际关注的复杂匪徒环境。通过数值模拟,我们证明了在所有检查的所有非平稳环境中的预测抽样表现优于汤普森采样。
Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. A theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.