3D概念基础在神经领域

论文标题

3D概念基础在神经领域

3D Concept Grounding on Neural Fields

论文作者

Hong, Yining, Du, Yilun, Lin, Chunru, Tenenbaum, Joshua B., Gan, Chuang

论文摘要

在本文中，我们通过查看RGBD图像以及有关配对问题和答案的推理来解决3D概念接地（即细分和学习视觉概念）的挑战性问题。现有的视觉推理方法通常利用监督的方法来提取概念接地的2D分割面具。相比之下，人类能够将图像的基础3D表示基础。但是，传统上推断出的3D表示（例如，点云，VoxelGrids和网格）无法灵活地捕获连续的3D特征，从而使基于所指对象的语言描述对3D区域的地面概念充满挑战。为了解决这两个问题，我们建议利用神经领域的连续，可区分的性质来细分和学习概念。具体而言，场景中的每个3D坐标都表示为高维描述符。然后，可以通过计算3D坐标的描述符向量与语言概念的向量嵌入之间的相似性来执行概念接地，这使分割和概念能够以可区分的方式在神经领域中共同学习。结果，3D语义和实例分割都可以直接通过使用神经场顶上的一组定义的神经操作员来回答监督（例如，过滤和计数）。实验结果表明，我们所提出的框架在语义和实例细分任务上优于无监督/语言介导的分割模型，并且在具有挑战性的3D意识到的视觉推理任务上优于现有模型。此外，我们的框架可以很好地概括为看不见的形状类别和真实的扫描。

In this paper, we address the challenging problem of 3D concept grounding (i.e. segmenting and learning visual concepts) by looking at RGBD images and reasoning about paired questions and answers. Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded. In contrast, humans are capable of grounding concepts on the underlying 3D representation of images. However, traditionally inferred 3D representations (e.g., point clouds, voxelgrids, and meshes) cannot capture continuous 3D features flexibly, thus making it challenging to ground concepts to 3D regions based on the language description of the object being referred to. To address both issues, we propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts. Specifically, each 3D coordinate in a scene is represented as a high-dimensional descriptor. Concept grounding can then be performed by computing the similarity between the descriptor vector of a 3D coordinate and the vector embedding of a language concept, which enables segmentations and concept learning to be jointly learned on neural fields in a differentiable fashion. As a result, both 3D semantic and instance segmentations can emerge directly from question answering supervision using a set of defined neural operators on top of neural fields (e.g., filtering and counting). Experimental results show that our proposed framework outperforms unsupervised/language-mediated segmentation models on semantic and instance segmentation tasks, as well as outperforms existing models on the challenging 3D aware visual reasoning tasks. Furthermore, our framework can generalize well to unseen shape categories and real scans.

下载PDF全文

下载文献需遵守相关版权规定

论文标题