论文标题

通过基于检索的多粒子对准无监督的视觉和语言预训练

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

论文作者

Zhou, Mingyang, Yu, Licheng, Singh, Amanpreet, Wang, Mengjiao, Yu, Zhou, Zhang, Ning

论文摘要

近年来,在各种多模式基准上,视觉和语言(V+L)预训练模型已取得了巨大的成功。但是,大多数现有模型都需要对大量并行图像文本数据进行预训练,这与仅图像或仅文本数据相比,收集费用很高。在本文中,我们探讨了无监督的视觉和语言预训练(UVLP),以从非并行图像和文本数据集中学习跨模式表示。我们发现了两个关键因素,这些因素可导致良好的无监督V+L预训练,而无需并联数据:(i)联合图像和文本输入(ii)总体图像文本对齐(甚至对于非平行数据)。因此,我们为非平行文本和图像提供了一种新颖的无监督V+L训练预训练课程。我们首先通过基于检索的方法构建一个弱对准的图像文本语料库,然后应用一组多个粒度对准预训练的任务,包括区域对标签,区域到范围,更宽图和图像对句子对齐,以弥合两种方式之间的间隙。一项全面的消融研究表明,每种粒度都有助于学习更强的预训练模型。我们将预训练的模型调整为一组V+L下游任务,包括VQA,NLVR2,Visual Intailment和Refcoco+。我们的模型在无监督的设置下实现了所有这些任务的最新性能。

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bridge the gap between the two modalities. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model. We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+. Our model achieves the state-of-art performance in all these tasks under the unsupervised setting.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源