匪徒的最佳世界世界算法，反馈延迟

论文标题

匪徒的最佳世界世界算法，反馈延迟

A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

论文作者

Masoudian, Saeed, Zimmert, Julian, Seldin, Yevgeny

论文摘要

我们提出了对Zimmert和Seldin [2020]算法的修改调整，以供具有延迟反馈的对抗性多型匪徒，此外，除了Zimmert和Seldine seldinearsy Selie selie and Seldinease Shie delains Supperiable Sunders of Quinimax最佳对抗性遗憾保证外，它还具有固定设置的近距离保证。具体而言，对抗性遗憾保证是$ \ Mathcal {o}（\ sqrt {tk} + \ sqrt {dt \ log k}）$，其中$ t $是时间范围，$ k $是武器数量，而$ d $是固定的延迟，而遗憾的是固定的延迟，而遗憾的是$ $ $ \ \ \ \ \ \ i {o {o {o {o {o {o {o {o {o {o {o {o {o {O { \ neq i^*}（\ frac {1} {Δ_i} + log（t） + \ frac {d} {δ_{δ_{i} \ log k}） + d k^{1/3} {1/3} {1/3} \ log k \ right）我们还向任意延迟的情况提出了算法的扩展，该算法基于对最大延迟$ d_ {max} $的最大知识，并实现$ \ mathcal {o}（\ sqrt {tk} + \ sqrt + \ sqrt {d \ log k} + d \ k} + d _ max + k）对抗状态，其中$ d $是总延迟，而$ \ Mathcal {o} \ left（\ sum_ {i \ neq i^*}（\ frac {1} {Δ_i} {δ_i} \ log（t） d_ {max} k^{1/3} \ log k \ right）$在随机制度中遗憾，其中$σ_{max} $是最大的杰出观测值。最后，我们提出了一个下界，与Zimmert和Seldin [2020]在对抗环境中的跳过技术所实现的遗憾上限相匹配。

We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{Δ_i} \log(T) + \frac{d}{Δ_{i}\log K}) + d K^{1/3}\log K\right)$, where $Δ_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{Δ_i} \log(T) + \frac{σ_{max}}{Δ_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $σ_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题