论文标题
朝着无限制的MDP进行无痛政策优化
Towards Painless Policy Optimization for Constrained MDPs
论文作者
论文摘要
我们在无限的地平线上研究政策优化,$γ$的约束马尔可夫决策过程(CMDP)。我们的目标是返回一项政策,该政策通过违反小小的约束而获得大量预期奖励。我们考虑具有线性函数近似的在线设置,并假设对相应功能的全局访问。我们提出了一个通用的原始偶型框架,使我们可以根据在线线性优化问题的原始和双重遗憾来束缚对任意算法的奖励子典型性和约束违规。我们实例化了该框架以使用硬币出生算法,并提出硬币投注Politex(CBP)算法。假设动作值函数是$ \ varepsilon_b $ - 将$ d $ d $维度状态功能的跨度和无抽样错误的跨度,我们证明CBP的$ t $迭代会导致$ o \ weft(\ frac {\ frac {1}} {(1-γ){1-γ) \ frac {\ varepsilon_b \ sqrt {d}}} {(1 - γ)^2} \右)$奖励子序列和$ o \ left(\ frac {1} {(1-γ)^2 \ sqrt^2 \ sqrt {t} \ sqrt {d}} {1-γ} \ right)$约束违规。重要的是,与梯度下降和其他最新方法不同,CBP不需要广泛的高参数调整。通过对合成和Cartpole环境的实验,我们证明了CBP的有效性和鲁棒性。
We study policy optimization in an infinite horizon, $γ$-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the Coin Betting Politex (CBP) algorithm. Assuming that the action-value functions are $\varepsilon_b$-close to the span of the $d$-dimensional state-action features and no sampling errors, we prove that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - γ)^3 \sqrt{T}} + \frac{\varepsilon_b\sqrt{d}}{(1 - γ)^2} \right)$ reward sub-optimality and an $O\left(\frac{1}{(1 - γ)^2 \sqrt{T}} + \frac{\varepsilon_b \sqrt{d}}{1 - γ} \right)$ constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP.