价值内存图：一个用于离线加固学习的图形结构世界模型

论文标题

价值内存图：一个用于离线加固学习的图形结构世界模型

Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning

论文作者

Zhu, Deyao, Li, Li Erran, Elhoseiny, Mohamed

论文摘要

加强学习（RL）方法通常直接应用于学习政策的环境中。在某些具有连续的状态空间，稀疏奖励和/或较长时间视野的复杂环境中，在原始环境中学习良好的政策可能很困难。为了关注离线RL设置，我们旨在建立一个简单而离散的世界模型，以抽象原始环境。 RL方法应用于我们的世界模型，而不是用于简化政策学习的环境数据。我们的世界模型，称为价值内存图（VMG），被设计为基于指导的马尔可夫决策过程（MDP），其顶点和有向边缘分别代表图形状态和图形操作。与原始环境相比，由于VMG的状态行动空间是有限的且相对较小的，因此我们可以直接应用VMG上的值迭代算法来估算图形状态值并找出最佳的图形操作。 VMG是根据离线RL数据集培训并构建的。 VMG与将VMG中的抽象图动作转换为真实动作的动作翻译器一起，VMG控制代理以最大化情节返回。我们在D4RL基准测试的实验表明，在几个面向目标的任务中，VMG可以胜过最先进的离线RL方法，尤其是当环境具有稀疏的奖励和较长的时间范围时。代码可从https://github.com/tsutikgiau/valuememorygraph获得

Reinforcement Learning (RL) methods are typically applied directly in environments to learn policies. In some complex environments with continuous state-action spaces, sparse rewards, and/or long temporal horizons, learning a good policy in the original environments can be difficult. Focusing on the offline RL setting, we aim to build a simple and discrete world model that abstracts the original environment. RL methods are applied to our world model instead of the environment data for simplified policy learning. Our world model, dubbed Value Memory Graph (VMG), is designed as a directed-graph-based Markov decision process (MDP) of which vertices and directed edges represent graph states and graph actions, separately. As state-action spaces of VMG are finite and relatively small compared to the original environment, we can directly apply the value iteration algorithm on VMG to estimate graph state values and figure out the best graph actions. VMG is trained from and built on the offline RL dataset. Together with an action translator that converts the abstract graph actions in VMG to real actions in the original environment, VMG controls agents to maximize episode returns. Our experiments on the D4RL benchmark show that VMG can outperform state-of-the-art offline RL methods in several goal-oriented tasks, especially when environments have sparse rewards and long temporal horizons. Code is available at https://github.com/TsuTikgiau/ValueMemoryGraph

下载PDF全文

下载文献需遵守相关版权规定

论文标题