焦躁不安的马尔可夫多军匪徒中的最佳手臂识别

论文标题

焦躁不安的马尔可夫多军匪徒中的最佳手臂识别

Best Arm Identification in Restless Markov Multi-Armed Bandits

论文作者

Karthik, P. N., Reddy, Kota Srinivas, Tan, Vincent Y. F.

论文摘要

我们研究了在共同有限的状态空间上，每个手臂都是及时且奇异的离散时间马尔可夫进程时，在多臂匪徒环境中识别最佳臂的问题。每个臂上的状态进化受臂的过渡概率矩阵（TPM）的控制。知道ARM TPM的集合但不知道TPM到手臂的确切映射的决策实体，希望尽快找到最佳臂的索引，但要遵守错误概率上的上限。决策实体在一次依次选择一只手臂，所有未选择的手臂继续进行状态进化（{\ em躁动不安}臂）。对于这个问题，我们得出了第一个已知的问题实例依赖性渐近下限，该渐近下限是找到找到最佳臂的索引所需的预期时间的生长速率，其中渐近差为误差概率消失。此外，我们提出了一项顺序策略，该策略对于输入参数$ r $，强行选择了未在$ r $连续的时间内选择的武器。我们表明，该政策实现了依赖$ r $的上限，并且单调非侵入为$ r \ to \ infty $。通常，在与下限匹配的上限为$ r \ to \ infty $的限制值是否保持开放的问题。我们确定上限和下限匹配的特殊情况。最佳手臂识别的先前工作已经处理了（a）独立且相同分发的观察结果，以及（b）马尔可夫武器休息，而我们的工作则处理更困难的马尔可夫武器。

We study the problem of identifying the best arm in a multi-armed bandit environment when each arm is a time-homogeneous and ergodic discrete-time Markov process on a common, finite state space. The state evolution on each arm is governed by the arm's transition probability matrix (TPM). A decision entity that knows the set of arm TPMs but not the exact mapping of the TPMs to the arms, wishes to find the index of the best arm as quickly as possible, subject to an upper bound on the error probability. The decision entity selects one arm at a time sequentially, and all the unselected arms continue to undergo state evolution ({\em restless} arms). For this problem, we derive the first-known problem instance-dependent asymptotic lower bound on the growth rate of the expected time required to find the index of the best arm, where the asymptotics is as the error probability vanishes. Further, we propose a sequential policy that, for an input parameter $R$, forcibly selects an arm that has not been selected for $R$ consecutive time instants. We show that this policy achieves an upper bound that depends on $R$ and is monotonically non-increasing as $R\to\infty$. The question of whether, in general, the limiting value of the upper bound as $R\to\infty$ matches with the lower bound, remains open. We identify a special case in which the upper and the lower bounds match. Prior works on best arm identification have dealt with (a) independent and identically distributed observations from the arms, and (b) rested Markov arms, whereas our work deals with the more difficult setting of restless Markov arms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题