论文标题

GLIPV2:统一本地化和视觉理解

GLIPv2: Unifying Localization and Vision-Language Understanding

论文作者

Zhang, Haotian, Zhang, Pengchuan, Hu, Xiaowei, Chen, Yen-Chun, Li, Liunian Harold, Dai, Xiyang, Wang, Lijuan, Yuan, Lu, Hwang, Jenq-Neng, Gao, Jianfeng

论文摘要

我们提出GLIPV2是一个接地的VL理解模型,该模型既服务于本地化任务(例如对象检测,实例分段)和视觉语言(VL)理解任务(例如VQA,图像字幕)。 GLIPV2优雅地将本地化预训练和视觉语言预训练(VLP)具有三个预训练的任务:短语接地作为对检测任务的VL重新重新制定,区域词对比度学习是一种新颖的区域词对比级别对比的学习任务,并被掩盖的语言建模。这种统一不仅简化了先前的多阶段VLP程序,而且还可以在本地化和理解任务之间实现相互利益。实验结果表明,在各种本地化和理解任务上,单个GLIPV2模型(所有模型权重)都在SOTA性能附近实现。该模型还显示了(1)在开放式摄制对象检测任务上进行的强零射击和很少的自适应性能,以及(2)VL理解任务上的卓越接地能力。代码将在https://github.com/microsoft/glip上发布。

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源