论文标题
Minimax对随机最短路径的遗憾,具有对抗成本和已知过渡
Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition
论文作者
论文摘要
We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is直径为$ t^\ star $是最佳政策的预期打击时间,$ s $是州的数量,$ a $是动作数,而$ k $是情节的数量。我们的结果大大改善了(Rosenberg and Mansour,2020)的现有工作,该工作仅考虑了全信息设置并实现了次优的遗憾。我们的工作也是第一个考虑以对抗成本考虑匪徒反馈的人。 我们的算法建立在在线镜下降框架之上,其各种新技术可能具有独立感兴趣,包括改进的多尺度专家算法,从一般自随机的最短路径减少到特殊的无循环案例,偏斜的占用率测量空间,以及在成本估算器中添加了新颖的校正程度。有趣的是,最后两个要素分别通过正偏见和最佳策略的差异分别通过负偏差来减少学习者的差异,并同时使它们同时使其对于获得匪徒反馈设置中的最佳高概率界限至关重要。
We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the existing work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.