TS-UCB：几乎没有其他计算，改进汤普森采样

论文标题

TS-UCB：几乎没有其他计算，改进汤普森采样

TS-UCB: Improving on Thompson Sampling With Little to No Additional Computation

论文作者

Baek, Jackie, Farias, Vivek F.

论文摘要

汤普森采样已成为解决强盗反馈在线决策问题的普遍存在的方法。汤普森采样的关键算法任务是从最佳动作的后部绘制样本。我们提出了一项替代手臂选择规则，我们将TS-UCB配置为UCB，该规则需要可忽略的额外计算工作，但相对于汤普森采样提供了显着的性能改进。在每个步骤中，TS-UCB使用两种成分计算每个臂的分数：后验样品和上部置信边界。 TS-UCB可以在可用的两个数量的任何情况下使用，并且在输入的后验样本数量中具有灵活性。 TS-UCB在全面的合成和现实世界数据集中，遗憾的较低，包括来自Yahoo!的个性化文章建议数据集！以及Riquelme等人提出的深匪套房的一套基准数据集。（2018）。最后，从理论的角度来看，我们为K臂和线性匪徒模型为TS-UCB建立了最佳的遗憾。

Thompson sampling has become a ubiquitous approach to online decision problems with bandit feedback. The key algorithmic task for Thompson sampling is drawing a sample from the posterior of the optimal action. We propose an alternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance improvements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehensive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoretical perspective, we establish optimal regret guarantees for TS-UCB for both the K-armed and linear bandit models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题