论文标题

TS-UCB:几乎没有其他计算,改进汤普森采样

TS-UCB: Improving on Thompson Sampling With Little to No Additional Computation

论文作者

Baek, Jackie, Farias, Vivek F.

论文摘要

汤普森采样已成为解决强盗反馈在线决策问题的普遍存在的方法。汤普森采样的关键算法任务是从最佳动作的后部绘制样本。我们提出了一项替代手臂选择规则,我们将TS-UCB配置为UCB,该规则需要可忽略的额外计算工作,但相对于汤普森采样提供了显着的性能改进。在每个步骤中,TS-UCB使用两种成分计算每个臂的分数:后验样品和上部置信边界。 TS-UCB可以在可用的两个数量的任何情况下使用,并且在输入的后验样本数量中具有灵活性。 TS-UCB在全面的合成和现实世界数据集中,遗憾的较低,包括来自Yahoo!的个性化文章建议数据集!以及Riquelme等人提出的深匪套房的一套基准数据集。 (2018)。最后,从理论的角度来看,我们为K臂和线性匪徒模型为TS-UCB建立了最佳的遗憾。

Thompson sampling has become a ubiquitous approach to online decision problems with bandit feedback. The key algorithmic task for Thompson sampling is drawing a sample from the posterior of the optimal action. We propose an alternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance improvements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehensive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoretical perspective, we establish optimal regret guarantees for TS-UCB for both the K-armed and linear bandit models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源