SEQTR：一个简单而通用的视觉接地网络

论文标题

SEQTR：一个简单而通用的视觉接地网络

SeqTR: A Simple yet Universal Network for Visual Grounding

论文作者

Zhu, Chaoyang, Zhou, Yiyi, Shen, Yunhang, Luo, Gen, Pan, Xingjia, Lin, Mingbao, Chen, Chao, Cao, Liujuan, Sun, Xiaoshuai, Ji, Rongrong

论文摘要

在本文中，我们提出了一个简单而通用的网络，该网络称为SEQTR，用于视觉接地任务，例如短语定位，参考表达理解（REC）和分割（RES）。用于视觉接地的规范范例通常需要在设计网络架构和损失功能方面具有丰富的专业知识，从而使它们难以跨任务进行推广。为了简化和统一建模，我们将视觉接地作为点预测问题在图像和文本输入上进行的，其中边界框或二进制掩码表示为一系列离散坐标令牌。在此范式下，视觉接地任务是在我们的SEQTR网络中统一的，而没有特定于任务的分支或头部，例如，用于RES的卷积蒙版解码器，这大大降低了多任务建模的复杂性。此外，SEQTR还具有简单的横向损失，共享所有任务的相同优化目标，从而进一步降低了部署手工制作的损失功能的复杂性。五个基准数据集的实验表明，所提出的SEQTR优于现有的最新作品（或与之相提并论），证明一种简单而通用的视觉接地方法确实是可行的。源代码可在https://github.com/sean-zhuh/seqtr上找到。

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks. To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling. In addition, SeqTR also shares the same optimization objective for all tasks with a simple cross-entropy loss, further reducing the complexity of deploying hand-crafted loss functions. Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible. Source code is available at https://github.com/sean-zhuh/SeqTR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题