情节潜在mab中的最优性

论文标题

情节潜在mab中的最优性

Tractable Optimality in Episodic Latent MABs

论文作者

Kwon, Jeongyeol, Efroni, Yonathan, Caramanis, Constantine, Mannor, Shie

论文摘要

我们考虑了$ m $ netatent上下文的多武器匪徒问题，在$ h $时间步骤中，代理商与环境互动。根据情节的长度，学习者可能无法准确估计潜在上下文。对环境的部分观察使学习任务变得更加具有挑战性。没有任何其他结构假设，可以解决部分设置的现有技术意味着决策者可以通过$ o（a）^h $情节学习近乎最佳的政策，但不要保证更多。在这项工作中，我们证明了在$ a $中使用{\ em polyenmial}样本学习。我们通过使用实验设计的技术来实现这一目标。然后，通过一种方法方法，我们设计了一个过程，该过程可证明使用$ O（\ texttt {poly}（a） + \ texttt {poly}（m，h）^{\ min（m，h）}）$ o（\ texttt {poly}（a） + \ texttt {poly}（a） + \ texttt {poly}（a） + texttt {poly}（a）}）$交互。实际上，我们表明我们可以通过最大似然估计来制定力矩匹配。在我们的实验中，这极大地超过了最坏的保证以及现有的实际方法。

We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. The resulting partial observation of the environment makes the learning task significantly more challenging. Without any additional structural assumptions, existing techniques to tackle partially observed settings imply the decision maker can learn a near-optimal policy with $O(A)^H$ episodes, but do not promise more. In this work, we show that learning with {\em polynomial} samples in $A$ is possible. We achieve this by using techniques from experiment design. Then, through a method-of-moments approach, we design a procedure that provably learns a near-optimal policy with $O(\texttt{poly}(A) + \texttt{poly}(M,H)^{\min(M,H)})$ interactions. In practice, we show that we can formulate the moment-matching via maximum likelihood estimation. In our experiments, this significantly outperforms the worst-case guarantees, as well as existing practical methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题