在可观察的pomdps中学习，没有计算棘手的甲壳

论文标题

在可观察的pomdps中学习，没有计算棘手的甲壳

Learning in Observable POMDPs, without Computationally Intractable Oracles

论文作者

Golowich, Noah, Moitra, Ankur, Rohatgi, Dhruv

论文摘要

大部分强化学习理论都建立在计算上难以实施的甲板上。专门用于学习部分可观察到的马尔可夫决策过程（POMDPS）中的近乎最佳政策，现有算法要么需要对模型动态（例如确定性过渡）做出强有力的假设，要么假设访问甲骨文以访问甲骨文来解决硬性的计划或估算问题，以作为分类。在这项工作中，我们在合理的假设下开发了第一个用于POMDP的无甲骨文学习算法。具体而言，我们给出了一种用于在“可观察” pomdps中学习的准化时间端到端算法，其中可观察性是一种假设，即对国家而言，分离良好的分布分布诱导分离良好的观察分布。我们的技术规定了在不确定性下使用乐观原则来促进探索的更传统的方法，而是在构建策略涵盖的新颖应用中使用了一种新颖的应用。

Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题