离线随机最短路径：学习，评估和迈向最佳性

论文标题

离线随机最短路径：学习，评估和迈向最佳性

Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality

论文作者

Yin, Ming, Chen, Wenjing, Wang, Mengdi, Wang, Yu-Xiang

论文摘要

以目标为导向的强化学习，代理商需要达到目标状态，同时降低成本，在现实世界应用中受到了极大的关注。它的理论表述是随机最短路径（SSP），在在线环境中进行了深入研究。然而，当禁止使用这种在线互动并且仅提供历史数据时，它仍然被忽略了。在本文中，当状态空间和动作空间是有限的时，我们考虑离线随机路径问题。我们设计了基于简单的价值迭代算法，以解决离线政策评估（OPE）和离线政策学习任务。值得注意的是，我们对这些简单算法的分析产生了强大的实例依赖性边界，这可能意味着接近Minimax的最差案例最佳界限。我们希望我们的研究能够帮助阐明离线SSP问题的基本统计限制，并激发超出当前考虑范围的进一步研究。

Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题