GLIPV2：统一本地化和视觉理解

论文标题

GLIPV2：统一本地化和视觉理解

GLIPv2: Unifying Localization and Vision-Language Understanding

论文作者

Zhang, Haotian, Zhang, Pengchuan, Hu, Xiaowei, Chen, Yen-Chun, Li, Liunian Harold, Dai, Xiyang, Wang, Lijuan, Yuan, Lu, Hwang, Jenq-Neng, Gao, Jianfeng

论文摘要

我们提出GLIPV2是一个接地的VL理解模型，该模型既服务于本地化任务（例如对象检测，实例分段）和视觉语言（VL）理解任务（例如VQA，图像字幕）。 GLIPV2优雅地将本地化预训练和视觉语言预训练（VLP）具有三个预训练的任务：短语接地作为对检测任务的VL重新重新制定，区域词对比度学习是一种新颖的区域词对比级别对比的学习任务，并被掩盖的语言建模。这种统一不仅简化了先前的多阶段VLP程序，而且还可以在本地化和理解任务之间实现相互利益。实验结果表明，在各种本地化和理解任务上，单个GLIPV2模型（所有模型权重）都在SOTA性能附近实现。该模型还显示了（1）在开放式摄制对象检测任务上进行的强零射击和很少的自适应性能，以及（2）VL理解任务上的卓越接地能力。代码将在https://github.com/microsoft/glip上发布。

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题