关于线性汤普森采样的频繁遗憾

论文标题

关于线性汤普森采样的频繁遗憾

On Frequentist Regret of Linear Thompson Sampling

论文作者

Hamidi, Nima, Bayati, Mohsen

论文摘要

本文研究了随机线性匪徒问题，其中决策者从$ \ mathbb {r}^d $中的可能时间依赖的矢量组中选择动作，并获得嘈杂的奖励。目的是最大程度地减少遗憾，即决策者的累积预期奖励与甲骨文的预期奖励之间的差异，并通过$ t $决策的顺序获得了每个动作的预期奖励。线性汤普森采样（LINTS）是一种流行的贝叶斯启发式，并得到理论分析的支持，表明其贝叶斯遗憾由$ \ widetilde {\ Mathcal {o}}}（d \ sqrt {t}）$界定，匹配的最小下限。但是，先前的研究表明，宽恕的常见主义遗憾是$ \ widetilde {\ Mathcal {o}}}（d \ sqrt {dt}）$，它需要后方差通货膨胀，并且是$ \ sqrt {d} $比最佳最佳altgorith差异的因素。 We prove that this inflation is fundamental and that the frequentist bound of $\widetilde{\mathcal{O}}(d\sqrt{dT})$ is the best possible, by demonstrating a randomization bias phenomenon in LinTS that can cause linear regret without inflation.We propose a data-driven version of LinTS that adjusts posterior inflation using observed data, which can achieve在其他条件下，最小值的最佳频繁遗憾。我们的分析提供了有关绒毛的新见解，并解决了该领域的一个开放问题。

This paper studies the stochastic linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $\mathbb{R}^d$ and receives noisy rewards. The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions. Linear Thompson Sampling (LinTS) is a popular Bayesian heuristic, supported by theoretical analysis that shows its Bayesian regret is bounded by $\widetilde{\mathcal{O}}(d\sqrt{T})$, matching minimax lower bounds. However, previous studies demonstrate that the frequentist regret bound for LinTS is $\widetilde{\mathcal{O}}(d\sqrt{dT})$, which requires posterior variance inflation and is by a factor of $\sqrt{d}$ worse than the best optimism-based algorithms. We prove that this inflation is fundamental and that the frequentist bound of $\widetilde{\mathcal{O}}(d\sqrt{dT})$ is the best possible, by demonstrating a randomization bias phenomenon in LinTS that can cause linear regret without inflation.We propose a data-driven version of LinTS that adjusts posterior inflation using observed data, which can achieve minimax optimal frequentist regret, under additional conditions. Our analysis provides new insights into LinTS and settles an open problem in the field.

下载PDF全文

下载文献需遵守相关版权规定

论文标题