参考图像分割的位置感知对比度对齐

论文标题

参考图像分割的位置感知对比度对齐

Position-Aware Contrastive Alignment for Referring Image Segmentation

论文作者

Chen, Bo, Hu, Zhiwei, Ji, Zhilong, Bai, Jinfeng, Zuo, Wangmeng

论文摘要

参考图像分割旨在分割给定自然语言表达式描述的目标对象。通常，引用表达式包含目标及其周围对象之间的复杂关系。该任务的主要挑战是同时了解视觉和语言内容，并在图像中的所有实例中准确找到引用对象。当前，解决上述问题的最有效方法是通过计算地面真实面罩的监督下的视觉和语言特征方式之间的相关性来获得对齐的多模式特征。但是，由于无法直接感知有关涉及目标的周围物体的信息，因此现有的范式难以彻底理解视觉和语言内容。这样可以防止他们学习对齐的多模式特征，从而导致分割不准确。为了解决这个问题，我们提出了一个感知的对比对准网络（PCAN），以通过通过先前的位置信息指导视觉与语言之间的相互作用来增强多模式特征的对齐。我们的PCAN由两个模块组成：1）位置意识模块（PAM），该模块（PAM）提供了与自然语言描述相关的所有对象的位置信息，以及2）对比性语言理解模块（CLUM），从而通过比较引用对象与相关对象的对象进行比较来增强多模式的对齐。在三个基准上进行的广泛实验表明，我们的PCAN对最先进的方法表现出色。我们的代码将公开可用。

Referring image segmentation aims to segment the target object described by a given natural language expression. Typically, referring expressions contain complex relationships between the target and its surrounding objects. The main challenge of this task is to understand the visual and linguistic content simultaneously and to find the referred object accurately among all instances in the image. Currently, the most effective way to solve the above problem is to obtain aligned multi-modal features by computing the correlation between visual and linguistic feature modalities under the supervision of the ground-truth mask. However, existing paradigms have difficulty in thoroughly understanding visual and linguistic content due to the inability to perceive information directly about surrounding objects that refer to the target. This prevents them from learning aligned multi-modal features, which leads to inaccurate segmentation. To address this issue, we present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features by guiding the interaction between vision and language through prior position information. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment by comparing the features of the referred object with those of related objects. Extensive experiments on three benchmarks demonstrate our PCAN performs favorably against the state-of-the-art methods. Our code will be made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题