通过线性函数近似，分布稳健的离线加固学习

论文标题

通过线性函数近似，分布稳健的离线加固学习

Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation

论文作者

Ma, Xiaoteng, Liang, Zhipeng, Blanchet, Jose, Liu, Mingwen, Xia, Li, Zhang, Jiheng, Zhao, Qianchuan, Zhou, Zhengyuan

论文摘要

在妨碍加强学习（RL）应用到现实世界问题的原因之一，两个因素至关重要：数据有限和测试环境（部署策略的实际环境）与培训环境（例如模拟器）之间的不匹配。本文试图通过分配强大的离线RL同时解决这些问题，在这里我们使用从源环境获得的历史数据来学习分配强大的策略，通过优化其最坏情况下的扰动。特别是，我们超越了表格设置，并考虑线性函数近似。更具体地说，我们考虑了两个设置，一个设置对数据集进行了充分探索，另一个数据集对数据集具有足够的最佳策略范围。我们提出了两种算法〜-两个设置中的每个算法〜-实现$ \ tilde {o}（d^{1/2}/n^{1/2}）$和$ \ tilde {o}（d^{d^{3/2}}/n^{1/2} $ nise $ in INS $ IS $ nimim and in the linigentim，数据集中的轨迹数量。据我们所知，它们在这种情况下提供了样本复杂性的第一个非质合结果。进行了不同的实验以证明我们的理论发现，显示了我们的算法与非稳定算法的优越性。

Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.

下载PDF全文

下载文献需遵守相关版权规定

论文标题