论文标题
使非策略控制(几乎)像随机的那样容易
Making Non-Stochastic Control (Almost) as Easy as Stochastic
论文作者
论文摘要
最近的文献在理解\ emph {在线LQR}方面取得了很大的进步:对经典控制问题的现代学习理论,学习者试图最佳地控制I.I.D的未知线性动力学系统,并受到I.I.D的影响。高斯噪音。现在可以理解,最佳时光范围$ t $针对最佳控制法律缩放为$ \widetildeθ(\ sqrt {t})$。在本文中,我们表明,即使在相当多的一般的非传统控制模型中,也可以实现相同的遗憾率(相对于合适的基准),在该模型中,该系统是由\ emph {任意对抗性}噪声驱动的(Agarwal等人,2019年)。换句话说,\ emph {随机性在线LQR}几乎没有收益。 我们获得了最佳$ \ wideTilde {\ Mathcal {o}}(\ sqrt {t})$遗憾当学习者未知时,并且$ \ mathrm {poly}(\ log t)$遗憾(如果知道成本功能强劲)(如lqr)。我们的算法基于在线牛顿步骤的新型变体(Hazan等人,2007年),该变体适应了可能是对抗性干扰引起的几何形状,我们的分析取决于对仿制药的“政策遗憾”的界限,以限制OCO-With-With-Memory框架中某些结构性损失(Anava等人2015年)。此外,我们的结果可容纳非策略控制设置的全部一般性:对抗(可能是非二次)成本,部分状态观察以及完全对抗性过程和观察噪声。
Recent literature has made much progress in understanding \emph{online LQR}: a modern learning-theoretic take on the classical control problem in which a learner attempts to optimally control an unknown linear dynamical system with fully observed state, perturbed by i.i.d. Gaussian noise. It is now understood that the optimal regret on time horizon $T$ against the optimal control law scales as $\widetildeΘ(\sqrt{T})$. In this paper, we show that the same regret rate (against a suitable benchmark) is attainable even in the considerably more general non-stochastic control model, where the system is driven by \emph{arbitrary adversarial} noise (Agarwal et al. 2019). In other words, \emph{stochasticity confers little benefit in online LQR}. We attain the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret when the dynamics are unknown to the learner, and $\mathrm{poly}(\log T)$ regret when known, provided that the cost functions are strongly convex (as in LQR). Our algorithm is based on a novel variant of online Newton step (Hazan et al. 2007), which adapts to the geometry induced by possibly adversarial disturbances, and our analysis hinges on generic "policy regret" bounds for certain structured losses in the OCO-with-memory framework (Anava et al. 2015). Moreover, our results accomodate the full generality of the non-stochastic control setting: adversarially chosen (possibly non-quadratic) costs, partial state observation, and fully adversarial process and observation noise.