论文标题

完整的政策后悔范围

Complete Policy Regret Bounds for Tallying Bandits

论文作者

Malik, Dhruv, Li, Yuanzhi, Singh, Aarti

论文摘要

政策遗憾是一个良好的概念,即衡量在线学习算法对自适应对手的性能。我们研究对对手的限制,即能够有效地最小化\ emph {完整的政策后悔},这是政策遗憾的最强大版本。我们确定了当前对哪种限制允许在这种挑战的环境中允许访问的理论理解的差距。为了解决这一差距,我们考虑了随机多武装匪徒的概括,我们称之为\ emph {Tallying Bandit}。这是一个在线学习环境,具有$ M $ - 内存有限的对手,其中打动动作的平均损失是在最后$ m $ timesteps中播放该动作的数字(或计数)的未知函数。对于$ k $ Action and Time Horizo​​n $ t $的Tally Bandit问题,我们提供了一种算法,该算法可以实现$ \ tilde {\ Mathcal {o}}}(Mk \ sqrt {t})$的完整政策后悔保证。我们还证明了$ \tildeΩ(\ sqrt {m k t})$下限,对任何Tally Bandit算法的预期策略后悔,证明了我们方法的最佳最佳性。

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $K$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\tilde{\mathcal{O}}(mK\sqrt{T})$, where the $\tilde{\mathcal{O}}$ notation hides only logarithmic factors. We additionally prove an $\tildeΩ(\sqrt{m K T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源