挤压全部：线性上下文匪徒的新型估计器和自称绑定

论文标题

挤压全部：线性上下文匪徒的新型估计器和自称绑定

Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits

论文作者

Kim, Wonyoung, Paik, Myunghee Cho, Oh, Min-hwan

论文摘要

我们建议使用$ o（\ sqrt {dt \ log t}）$遗憾的线性上下文匪徒算法，其中$ d $是上下文的维度，$ t $是时间范围。我们提出的算法配备了一种新型估计量，其中探索通过显式随机化嵌入。根据随机化的不同，我们提出的估计器从所有武器的上下文或所选上下文中都取得了贡献。我们为我们的估计器建立了一个自称的绑定，该结合允许将累积遗憾的新颖分解为\ textit {添加}维度依赖性术语而不是乘法术语。我们还证明了在我们的问题设置下的$ω（\ sqrt {dt}）$的新颖下限。因此，我们提出的算法的遗憾与对数因素的下限相匹配。数值实验支持理论保证，并表明我们所提出的方法的表现优于现有的线性匪徒算法。

We propose a linear contextual bandit algorithm with $O(\sqrt{dT\log T})$ regret bound, where $d$ is the dimension of contexts and $T$ isthe time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contributions either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into \textit{additive} dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $Ω(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题