用颤抖的手发现一个奇怪的马尔可夫手臂

论文标题

用颤抖的手发现一个奇怪的马尔可夫手臂

Detecting an Odd Restless Markov Arm with a Trembling Hand

论文作者

Karthik, P. N., Sundaresan, Rajesh

论文摘要

在本文中，我们考虑了一个多武器强盗，其中每个手臂都是马尔可夫的过程，该过程在有限的状态空间上演变。状态空间在整个手臂上很常见，并且武器彼此独立。一个臂（奇数臂）的过渡概率矩阵与所有其他臂的常见过渡概率矩阵不同。知道这些过渡概率矩阵的决策者希望尽快识别奇数臂，同时保持决策错误的可能性较小。为此，决策者通过以顺序拉动手臂来收集手臂的观察，每次离散时间瞬间。但是，决策者的手有一只颤抖的手，实际上在任何给定时间拉动的手臂都与他打算拉的胳膊不同，概率很小。在任何给定时间的观察结果都是实际拉动的手臂及其当前状态。未观察到的武器的马尔可夫进程继续发展。这使手臂不安。在上述设置中，我们在识别奇数臂所需的预期时间上得出了第一个已知的渐近下限，渐近臂是消失的误差概率。每个手臂的持续发展为问题增加了一个新的维度，从而导致马尔可夫决策问题（MDP）在可数的状态空间上。然后，我们将某些参数化解决方案拼接到这些MDP上，并获得一系列策略的序列，这些策略的预期时间识别奇数臂在消失的误差概率方面任意接近下限。先前的作品处理独立和相同分布（跨时间）的武器和休息的马尔可夫武器，而我们的工作与躁动不安的马尔可夫武器交往。

In this paper, we consider a multi-armed bandit in which each arm is a Markov process evolving on a finite state space. The state space is common across the arms, and the arms are independent of each other. The transition probability matrix of one of the arms (the odd arm) is different from the common transition probability matrix of all the other arms. A decision maker, who knows these transition probability matrices, wishes to identify the odd arm as quickly as possible, while keeping the probability of decision error small. To do so, the decision maker collects observations from the arms by pulling the arms in a sequential manner, one at each discrete time instant. However, the decision maker has a trembling hand, and the arm that is actually pulled at any given time differs, with a small probability, from the one he intended to pull. The observation at any given time is the arm that is actually pulled and its current state. The Markov processes of the unobserved arms continue to evolve. This makes the arms restless. For the above setting, we derive the first known asymptotic lower bound on the expected time required to identify the odd arm, where the asymptotics is of vanishing error probability. The continued evolution of each arm adds a new dimension to the problem, leading to a family of Markov decision problems (MDPs) on a countable state space. We then stitch together certain parameterised solutions to these MDPs and obtain a sequence of strategies whose expected times to identify the odd arm come arbitrarily close to the lower bound in the regime of vanishing error probability. Prior works dealt with independent and identically distributed (across time) arms and rested Markov arms, whereas our work deals with restless Markov arms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题