多任务匪徒通过异质反馈聚合学习

论文标题

多任务匪徒通过异质反馈聚合学习

Multitask Bandit Learning Through Heterogeneous Feedback Aggregation

论文作者

Wang, Zhi, Zhang, Chicheng, Singh, Manish Kumar, Riek, Laurel D., Chaudhuri, Kamalika

论文摘要

在许多实际应用程序中，多个代理商试图学习如何在在线强盗学习协议中执行高度相关但略有不同的任务。我们将这个问题提出为$ε$ -Multi-Player多臂强盗问题，其中一组玩家与一组武器同时互动，对于每个手臂，所有玩家的奖励分布都是相似的，但不一定相同。我们开发了一种基于置信度的算法，Rubustagg $（ε）$，该算法可适应不同玩家收集的奖励。在众所周知的奖励分布的成对分布相似性的环境中，我们实现了依赖实例的遗憾，可以保证取决于跨播放器的信息共享的不适当性。我们通过几乎匹配的下限对这些上限进行补充。在成对相似性未知的环境中，我们提供了一个下限，以及一种算法，该算法可以使Minimax Rearme Suce vrace Suse vrace Suce trake Suse to Rapistivity适应未知的相似性结构。

In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $ε$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg$(ε)$, that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise similarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise similarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题