论文标题

指数下限用于在MDP中具有线性可实现的最佳动作值函数

Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable Optimal Action-Value Functions

论文作者

Weisz, Gellért, Amortila, Philip, Szepesvári, Csaba

论文摘要

我们考虑了具有线性函数近似值的固定水平和折扣马尔可夫决策过程(MDP)中本地规划的问题,并且在假设Planner可用的特征映射的跨度下,假设最佳动作值函数的假设。以前的工作已经打开了一个问题,即是否存在仅需要poly(h,d)查询的声音计划者,而不论MDP如何,其中H是地平线,D是功能的维度。我们以否定为基础回答这个问题:我们表明,任何声音计划者都必须在Fized-Horizo​​n设置中至少查询至少$ \ min(\ exp(ω(d)),ω(2^h))$样品,以及折扣设置中的$ \ exp(ω(d))$样品。我们还表明,对于任何$δ> 0 $,使用$ O(h^5d^{h+1}/δ^2)的最小二乘值迭代算法$ QUERIES $ QUERIES可以在固定Horizo​​n设置中计算$δ$ - optimal polity。我们讨论含义并剩下的开放问题。

We consider the problem of local planning in fixed-horizon and discounted Markov Decision Processes (MDPs) with linear function approximation and a generative model under the assumption that the optimal action-value function lies in the span of a feature map that is available to the planner. Previous work has left open the question of whether there exist sound planners that need only poly(H,d) queries regardless of the MDP, where H is the horizon and d is the dimensionality of the features. We answer this question in the negative: we show that any sound planner must query at least $\min(\exp(Ω(d)), Ω(2^H))$ samples in the fized-horizon setting and $\exp(Ω(d))$ samples in the discounted setting. We also show that for any $δ>0$, the least-squares value iteration algorithm with $O(H^5d^{H+1}/δ^2)$ queries can compute a $δ$-optimal policy in the fixed-horizon setting. We discuss implications and remaining open questions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源