通过非结构化数据的视觉提供的基础语言

论文标题

通过非结构化数据的视觉提供的基础语言

Grounding Language with Visual Affordances over Unstructured Data

论文作者

Mees, Oier, Borja-Diaz, Jessica, Burgard, Wolfram

论文摘要

最近的作品表明，大型语言模型（LLM）可以应用于地面自然语言，以适应各种机器人技能。但是，实际上，学习多任务，语言条件的机器人技能通常需要大规模的数据收集和频繁的人类干预来重置环境或帮助纠正当前的政策。在这项工作中，我们提出了一种新颖的方法，可以通过利用自我监督的视觉可负担能力模型来有效地从非结构化，离线和无复位数据中学习通用语言的机器人技能，这需要用语言的总数据占总数据的1％。我们在模拟和现实世界的机器人任务中进行了广泛的实验评估我们的方法，在具有挑战性的加尔文基准中实现了最先进的性能，并通过现实世界中的单个政策学习了25多种不同的视觉动态操纵任务。我们发现，当与LLMS配对以通过少数提示将抽象的自然语言指令分解为子目标时，我们的方法能够完成现实世界中的长马式多层任务，同时需要比以前的方法少的数据级。代码和视频可在http://hulc2.cs.uni-freiburg.de上找到

Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills. However, in practice, learning multi-task, language-conditioned robotic skills typically requires large-scale data collection and frequent human intervention to reset the environment or help correcting the current policies. In this work, we propose a novel approach to efficiently learn general-purpose language-conditioned robot skills from unstructured, offline and reset-free data in the real world by exploiting a self-supervised visuo-lingual affordance model, which requires annotating as little as 1% of the total data with language. We evaluate our method in extensive experiments both in simulated and real-world robotic tasks, achieving state-of-the-art performance on the challenging CALVIN benchmark and learning over 25 distinct visuomotor manipulation tasks with a single policy in the real world. We find that when paired with LLMs to break down abstract natural language instructions into subgoals via few-shot prompting, our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches. Code and videos are available at http://hulc2.cs.uni-freiburg.de

下载PDF全文

下载文献需遵守相关版权规定

论文标题