论文标题
CMDP中近乎最佳的原始偶偶联方法
A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP
论文作者
论文摘要
作为安全加强学习的重要框架,在最近的文献中已经对受约束的马尔可夫决策过程(CMDP)进行了广泛的研究。然而,尽管在各种上式学习设置下取得了丰富的结果,但就算法设计和信息理论样本复杂性下限而言,仍然缺乏对离线CMDP问题的基本理解。在本文中,我们专注于仅在脱机数据可用的情况下解决CMDP问题。通过采用单极浓度系数$ c^*$的概念,我们建立了一个$ω\ left(\ frac {\ min \ left \ left \ {| \ nathcal {s} || \ mathcal {a}示例复杂度的下限CMDP问题,其中$ i $代表约束数量。通过引入一种简单但新颖的偏差控制机制,我们提出了一种称为DPDL的近乎最佳的原始二元学习算法。该算法证明,除了$ \ tilde {\ Mathcal {o}}}((1-γ)^{ - 1})$ factor以外,该算法可以保证零约束违规及其样本复杂性匹配上下界。还包括有关如何处理未知常数$ c^*$以及离线数据集上潜在的异步结构的全面讨论。
As an important framework for safe Reinforcement Learning, the Constrained Markov Decision Process (CMDP) has been extensively studied in the recent literature. However, despite the rich results under various on-policy learning settings, there still lacks some essential understanding of the offline CMDP problems, in terms of both the algorithm design and the information theoretic sample complexity lower bound. In this paper, we focus on solving the CMDP problems where only offline data are available. By adopting the concept of the single-policy concentrability coefficient $C^*$, we establish an $Ω\left(\frac{\min\left\{|\mathcal{S}||\mathcal{A}|,|\mathcal{S}|+I\right\} C^*}{(1-γ)^3ε^2}\right)$ sample complexity lower bound for the offline CMDP problem, where $I$ stands for the number of constraints. By introducing a simple but novel deviation control mechanism, we propose a near-optimal primal-dual learning algorithm called DPDL. This algorithm provably guarantees zero constraint violation and its sample complexity matches the above lower bound except for an $\tilde{\mathcal{O}}((1-γ)^{-1})$ factor. Comprehensive discussion on how to deal with the unknown constant $C^*$ and the potential asynchronous structure on the offline dataset are also included.