论文标题
在非参数多臂强盗中,最佳武器识别和固定预算
On Best-Arm Identification with a Fixed Budget in Non-Parametric Multi-Armed Bandits
论文作者
论文摘要
我们奠定了具有固定预算t的多臂匪徒中最佳武器识别的非参数理论的基础。我们考虑一般的(可能是非参数)模型D用于武器上的分布。一个总体示例是[0,1]上所有概率分布的模型d = p(0,1)。我们提出了基于信息理论数量误识别最佳臂的平均对数概率的上限,这些数量与INTIMA相对应,而不是ISTIMA与kullback-leibler差异相对于D和给定分布中的某些分布之间的差异。通过对Audibert,Bubeck和Munos(2010年)的连续拒绝策略进行完善的分析,这是可能的。我们最终在相同的新信息理论数量方面也提供了相同的平均对数概率的下限;当(自然)对所考虑策略的假设更强时,这些下限会更大。所有这些新的上限和下限都将基于现有的界限(例如分布之间的差距)推广。
We lay the foundations of a non-parametric theory of best-arm identification in multi-armed bandits with a fixed budget T. We consider general, possibly non-parametric, models D for distributions over the arms; an overarching example is the model D = P(0,1) of all probability distributions over [0,1]. We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities that correspond to infima over Kullback-Leibler divergences between some distributions in D and a given distribution. This is made possible by a refined analysis of the successive-rejects strategy of Audibert, Bubeck, and Munos (2010). We finally provide lower bounds on the same average log-probability, also in terms of the same new information-theoretic quantities; these lower bounds are larger when the (natural) assumptions on the considered strategies are stronger. All these new upper and lower bounds generalize existing bounds based, e.g., on gaps between distributions.