通过语言说明学习6-DOF对象构成掌握类别级的对象

论文标题

通过语言说明学习6-DOF对象构成掌握类别级的对象

Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions

论文作者

Cheang, Chilam, Lin, Haitao, Fu, Yanwei, Xue, Xiangyang

论文摘要

本文研究了通过自由形式的语言指令从已知类别中掌握的任何对象的任务。该任务需要计算机视觉，自然语言处理和机器人技术中的技术。我们将这些学科融合在一起，这对于人类机器人的互动至关重要。至关重要的是，关键挑战在于从语言指令中推断对象类别，并准确估算已知类别中未见对象的6-DOF信息。相反，以前的作品着重于在实例级别推断对象候选者的姿势。这显着限制了其在现实世界中的应用。在本文中，我们提出了一种语言引导的6-DOF类别级对象定位模型，以通过理解人类的意图来实现机器人抓地力。为此，我们提出了一种新颖的两阶段方法。特别是，第一阶段通过对象的名称，属性和空间关系的语言描述RGB图像中的目标。第二阶段提取和段从裁剪深度图像中点云，并估计类别级别的完整6-DOF对象姿势。在这样的方式下，我们的方法可以通过按照人类的说明来定位特定对象，并估算一个已知类别但看不见的实例的完整6-DOF姿势，该姿势并未用于训练该模型。广泛的实验结果表明，我们的方法与最先进的语言条件的掌握方法具有竞争力。重要的是，我们将方法部署在物理机器人上，以验证现实世界应用程序中框架的可用性。请参考我们的机器人实验的演示视频。

This paper studies the task of any objects grasping from the known categories by free-form language instructions. This task demands the technique in computer vision, natural language processing, and robotics. We bring these disciplines together on this open challenge, which is essential to human-robot interaction. Critically, the key challenge lies in inferring the category of objects from linguistic instructions and accurately estimating the 6-DoF information of unseen objects from the known classes. In contrast, previous works focus on inferring the pose of object candidates at the instance level. This significantly limits its applications in real-world scenarios.In this paper, we propose a language-guided 6-DoF category-level object localization model to achieve robotic grasping by comprehending human intention. To this end, we propose a novel two-stage method. Particularly, the first stage grounds the target in the RGB image through language description of names, attributes, and spatial relations of objects. The second stage extracts and segments point clouds from the cropped depth image and estimates the full 6-DoF object pose at category-level. Under such a manner, our approach can locate the specific object by following human instructions, and estimate the full 6-DoF pose of a category-known but unseen instance which is not utilized for training the model. Extensive experimental results show that our method is competitive with the state-of-the-art language-conditioned grasp method. Importantly, we deploy our approach on a physical robot to validate the usability of our framework in real-world applications. Please refer to the supplementary for the demo videos of our robot experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题