通过基于检索的多粒子对准无监督的视觉和语言预训练

论文标题

通过基于检索的多粒子对准无监督的视觉和语言预训练

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

论文作者

Zhou, Mingyang, Yu, Licheng, Singh, Amanpreet, Wang, Mengjiao, Yu, Zhou, Zhang, Ning

论文摘要

近年来，在各种多模式基准上，视觉和语言（V+L）预训练模型已取得了巨大的成功。但是，大多数现有模型都需要对大量并行图像文本数据进行预训练，这与仅图像或仅文本数据相比，收集费用很高。在本文中，我们探讨了无监督的视觉和语言预训练（UVLP），以从非并行图像和文本数据集中学习跨模式表示。我们发现了两个关键因素，这些因素可导致良好的无监督V+L预训练，而无需并联数据：（i）联合图像和文本输入（ii）总体图像文本对齐（甚至对于非平行数据）。因此，我们为非平行文本和图像提供了一种新颖的无监督V+L训练预训练课程。我们首先通过基于检索的方法构建一个弱对准的图像文本语料库，然后应用一组多个粒度对准预训练的任务，包括区域对标签，区域到范围，更宽图和图像对句子对齐，以弥合两种方式之间的间隙。一项全面的消融研究表明，每种粒度都有助于学习更强的预训练模型。我们将预训练的模型调整为一组V+L下游任务，包括VQA，NLVR2，Visual Intailment和Refcoco+。我们的模型在无监督的设置下实现了所有这些任务的最新性能。

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bridge the gap between the two modalities. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model. We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+. Our model achieves the state-of-art performance in all these tasks under the unsupervised setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题