论文标题
匪徒的最佳世界世界算法,反馈延迟
A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback
论文作者
论文摘要
我们提出了对Zimmert和Seldin [2020]算法的修改调整,以供具有延迟反馈的对抗性多型匪徒,此外,除了Zimmert和Seldine seldinearsy Selie selie and Seldinease Shie delains Supperiable Sunders of Quinimax最佳对抗性遗憾保证外,它还具有固定设置的近距离保证。具体而言,对抗性遗憾保证是$ \ Mathcal {o}(\ sqrt {tk} + \ sqrt {dt \ log k})$,其中$ t $是时间范围,$ k $是武器数量,而$ d $是固定的延迟,而遗憾的是固定的延迟,而遗憾的是$ $ $ \ \ \ \ \ \ i {o {o {o {o {o {o {o {o {o {o {o {o {o {O { \ neq i^*}(\ frac {1} {Δ_i} + log(t) + \ frac {d} {δ_{δ_{i} \ log k}) + d k^{1/3} {1/3} {1/3} \ log k \ right)我们还向任意延迟的情况提出了算法的扩展,该算法基于对最大延迟$ d_ {max} $的最大知识,并实现$ \ mathcal {o}(\ sqrt {tk} + \ sqrt + \ sqrt {d \ log k} + d \ k} + d _ max + k)对抗状态,其中$ d $是总延迟,而$ \ Mathcal {o} \ left(\ sum_ {i \ neq i^*}(\ frac {1} {Δ_i} {δ_i} \ log(t) d_ {max} k^{1/3} \ log k \ right)$在随机制度中遗憾,其中$σ_{max} $是最大的杰出观测值。最后,我们提出了一个下界,与Zimmert和Seldin [2020]在对抗环境中的跳过技术所实现的遗憾上限相匹配。
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{Δ_i} \log(T) + \frac{d}{Δ_{i}\log K}) + d K^{1/3}\log K\right)$, where $Δ_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{Δ_i} \log(T) + \frac{σ_{max}}{Δ_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $σ_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.