在基于位置的模型下同时学习随机和对抗性匪徒

论文标题

在基于位置的模型下同时学习随机和对抗性匪徒

Simultaneously Learning Stochastic and Adversarial Bandits under the Position-Based Model

论文作者

Chen, Cheng, Zhao, Canzhe, Li, Shuai

论文摘要

在线学习以交互式学习以相互交互学习，以根据描述用户点击行为的某些点击模型从大型集合中选择项目列表。该问题的最新作品集中在随机环境上，在学习过程中假定项目吸引力是不变的。但是，在许多实际情况下，环境可能是动态的，甚至可能是任意改变的。这项工作研究了基于位置模型（PBM）的随机环境和对抗环境中的OLTR问题。我们提出了一种基于tsallis熵的以下规范化领导者（FTRL）框架的方法，并开发了针对PBM设计的新的自我限制约束。我们证明了所提出的算法同时在随机环境中实现$ o（\ log {t}）$遗憾，而$ o（m \ sqrt {nt}）$遗憾在对抗环境中，其中$ t $是回合的数量，$ n $是项目的数量和$ m $是位置的数量。我们还为对抗性PBM提供了$ω（M \ sqrt {nt}）$的下限，该$与我们的上限相匹配，并在最先进的下限上进行了改进。实验表明，我们的算法可以在随机和对抗环境中同时学习，并且与为单个环境设计的现有方法相比，它具有竞争力。

Online learning to rank (OLTR) interactively learns to choose lists of items from a large collection based on certain click models that describe users' click behaviors. Most recent works for this problem focus on the stochastic environment where the item attractiveness is assumed to be invariant during the learning process. In many real-world scenarios, however, the environment could be dynamic or even arbitrarily changing. This work studies the OLTR problem in both stochastic and adversarial environments under the position-based model (PBM). We propose a method based on the follow-the-regularized-leader (FTRL) framework with Tsallis entropy and develop a new self-bounding constraint especially designed for PBM. We prove the proposed algorithm simultaneously achieves $O(\log{T})$ regret in the stochastic environment and $O(m\sqrt{nT})$ regret in the adversarial environment, where $T$ is the number of rounds, $n$ is the number of items and $m$ is the number of positions. We also provide a lower bound of order $Ω(m\sqrt{nT})$ for adversarial PBM, which matches our upper bound and improves over the state-of-the-art lower bound. The experiments show that our algorithm could simultaneously learn in both stochastic and adversarial environments and is competitive compared to existing methods that are designed for a single environment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题