激励探索的价格：通过汤普森抽样和样品复杂性的表征

论文标题

激励探索的价格：通过汤普森抽样和样品复杂性的表征

The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity

论文作者

Sellke, Mark, Slivkins, Aleksandrs

论文摘要

我们考虑激励探索：一种多臂匪徒的版本，其中武器的选择由自私者控制，算法只能发出建议。该算法控制信息流，信息不对称可以激励代理探索。先前的工作实现了最佳的遗憾率，直到乘法因素而变化，这些因素根据贝叶斯先生而变大，并在武器数量上成倍规模扩展。对每个手臂进行采样的一个更基本的问题，一旦遇到了相似的因素。我们专注于激励措施的价格：出于激励兼容的目的，绩效的损失（广泛解释）。我们证明，如果用足够多的数据点初始化，则标准的匪徒汤普森采样是激励兼容的。因此，当收集这些数据点时，由于激励措施的绩效损失仅限于初始回合。这个问题主要降低到样本复杂性的问题：需要多少个回合？我们解决了这个问题，提供了匹配的上限和下限，并在各种推论中实例化。通常，最佳样本复杂性在武器数量和指数中是多项式的，在“信念的力量”中是多项式的。

We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily large depending on the Bayesian priors, and scale exponentially in the number of arms. A more basic problem of sampling each arm once runs into similar factors. We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility. We prove that Thompson Sampling, a standard bandit algorithm, is incentive-compatible if initialized with sufficiently many data points. The performance loss due to incentives is therefore limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity: how many rounds are needed? We address this question, providing matching upper and lower bounds and instantiating them in various corollaries. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the "strength of beliefs".

下载PDF全文

下载文献需遵守相关版权规定

论文标题