论文标题
可区分的线性匪徒算法
Differentiable Linear Bandit Algorithm
论文作者
论文摘要
上置信度结合(UCB)可以说是线性多臂匪徒问题最常用的方法。尽管从概念和计算上简单,但此方法高度依赖于置信度范围,如果这些界限未正确设置,则无法进行最佳探索探索。在文献中,置信度范围通常是根据奖励分布的假设(例如,高西度性)的集中不平等得出的。但是,这些假设的有效性在实践中尚不清楚。在这项工作中,我们旨在以数据驱动的方式学习构成的信心,从而适应实际的问题结构。具体而言,指出现有的UCB型算法在置信度结合方面并不可观,我们首先提出了一种新型的可区分的线性匪徒算法。然后,我们引入一个梯度估计器,该梯度估计器可以通过梯度上升来学习置信度。从理论上讲,我们表明所提出的算法达到了$ \ tilde {\ Mathcal {o}}(\hatβ\ sqrt {dt})$上的$ t $ round遗憾的上限,其中$ d $是手臂特征的尺寸,$ \hatβ$是信心的尺寸。经验结果表明,$ \hatβ$明显小于其理论上限和建议的算法在模拟和现实世界数据集上的基线表现优于基线。
Upper Confidence Bound (UCB) is arguably the most commonly used method for linear multi-arm bandit problems. While conceptually and computationally simple, this method highly relies on the confidence bounds, failing to strike the optimal exploration-exploitation if these bounds are not properly set. In the literature, confidence bounds are typically derived from concentration inequalities based on assumptions on the reward distribution, e.g., sub-Gaussianity. The validity of these assumptions however is unknown in practice. In this work, we aim at learning the confidence bound in a data-driven fashion, making it adaptive to the actual problem structure. Specifically, noting that existing UCB-typed algorithms are not differentiable with respect to confidence bound, we first propose a novel differentiable linear bandit algorithm. Then, we introduce a gradient estimator, which allows the confidence bound to be learned via gradient ascent. Theoretically, we show that the proposed algorithm achieves a $\tilde{\mathcal{O}}(\hatβ\sqrt{dT})$ upper bound of $T$-round regret, where $d$ is the dimension of arm features and $\hatβ$ is the learned size of confidence bound. Empirical results show that $\hatβ$ is significantly smaller than its theoretical upper bound and proposed algorithms outperforms baseline ones on both simulated and real-world datasets.