论文标题
神经汤普森采样
Neural Thompson Sampling
论文作者
论文摘要
汤普森采样(TS)是解决上下文多臂匪徒问题的最有效算法之一。在本文中,我们提出了一种称为神经汤普森(Thompson)采样的新算法,该算法适应了深层神经网络以进行探索和剥削。我们算法的核心是奖励的新型后验分布,其平均值是神经网络近似值,其方差构建在相应神经网络的神经切线特征上。我们证明,只要基本的奖励函数有限,则保证所提出的算法能够实现$ \ Mathcal {O}(t^{1/2})$的累积遗憾,这与其他上下文强盗算法的遗憾相符。各种数据集的实验比较与其他基准匪徒算法证实了我们的理论。
Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of $\mathcal{O}(T^{1/2})$, which matches the regret of other contextual bandit algorithms in terms of total round number $T$. Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.