多进取MDP的政策迭代的下限

论文标题

多进取MDP的政策迭代的下限

Lower Bounds for Policy Iteration on Multi-action MDPs

论文作者

Ashutosh, Kumar, Consul, Sarthak, Dedhia, Bhishma, Khirwadkar, Parthasarathi, Shah, Sahil, Kalyanakrishnan, Shivaram

论文摘要

政策迭代（PI）是一个经典的算法家族，用于计算任何给定的马尔可夫决策问题（MDP）的最佳政策。 PI中的基本思想是从一些初始策略开始，然后从改进集合中反复将策略更新为一项策略，直到达成最佳策略为止。 PI的不同变体来自用于改进的（开关）规则。一个重要的理论问题是，指定的PI变体将终止多少迭代作为$ n $数量的函数以及输入MDP中的操作$ K $数量。虽然在这个数字上取得了很大的进步，但下界的结果较少。特别是，现有的下限主要集中于$ k = 2 $动作的特殊情况。我们为$ k \ geq 3 $设计了下限。我们的主要结果是，PI的特定变体可以终止$ω（k^{n/2}）$迭代。我们还将现有的$ 2 $ ACTION MDP上的现有构造概括为在PI的某些常见确定性变体中缩放下限$ k $，而对于相应的随机变体，则以$ \ log（k）$为单位。

Policy Iteration (PI) is a classical family of algorithms to compute an optimal policy for any given Markov Decision Problem (MDP). The basic idea in PI is to begin with some initial policy and to repeatedly update the policy to one from an improving set, until an optimal policy is reached. Different variants of PI result from the (switching) rule used for improvement. An important theoretical question is how many iterations a specified PI variant will take to terminate as a function of the number of states $n$ and the number of actions $k$ in the input MDP. While there has been considerable progress towards upper-bounding this number, there are fewer results on lower bounds. In particular, existing lower bounds primarily focus on the special case of $k = 2$ actions. We devise lower bounds for $k \geq 3$. Our main result is that a particular variant of PI can take $Ω(k^{n/2})$ iterations to terminate. We also generalise existing constructions on $2$-action MDPs to scale lower bounds by a factor of $k$ for some common deterministic variants of PI, and by $\log(k)$ for corresponding randomised variants.

下载PDF全文

下载文献需遵守相关版权规定

论文标题