强盗算法：放开对数遗憾的统计鲁棒性

论文标题

强盗算法：放开对数遗憾的统计鲁棒性

Bandit algorithms: Letting go of logarithmic regret for statistical robustness

论文作者

Ashutosh, Kumar, Nair, Jayakrishnan, Kagrecha, Anmol, Jagannathan, Krishna

论文摘要

我们在随机的多军匪徒环境中学习遗憾的最小化，并在算法下遭受的遗憾与其统计鲁棒性之间建立了基本的权衡。考虑到基本武器的广泛类别，我们表明，与数遗憾的匪徒学习算法总是不一致的，并且一致的学习算法总是遭受超级遗憾的遗憾。该结果强调了文献中可用的所有“对数遗憾”的强盗算法的不可避免的统计脆弱性 - 例如，如果为$σ$ -Subgaussian分布设计的UCB算法在具有不匹配的差异参数的subgaussian设置中使用，那么学习性能可能是不合格的。接下来，我们会显示一个积极的结果：如果我们允许遗憾比对数稍差，则可以实现统计学稳定和一致的学习表现。具体而言，我们提出了三类分布遗漏的算法，这些算法实现了与对数的任意接近的渐近遗憾。

We study regret minimization in a stochastic multi-armed bandit setting and establish a fundamental trade-off between the regret suffered under an algorithm, and its statistical robustness. Considering broad classes of underlying arms' distributions, we show that bandit learning algorithms with logarithmic regret are always inconsistent and that consistent learning algorithms always suffer a super-logarithmic regret. This result highlights the inevitable statistical fragility of all `logarithmic regret' bandit algorithms available in the literature---for instance, if a UCB algorithm designed for $σ$-subGaussian distributions is used in a subGaussian setting with a mismatched variance parameter, the learning performance could be inconsistent. Next, we show a positive result: statistically robust and consistent learning performance is attainable if we allow the regret to be slightly worse than logarithmic. Specifically, we propose three classes of distribution oblivious algorithms that achieve an asymptotic regret that is arbitrarily close to logarithmic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题