VL-BEIT：预处理生成视觉语言

论文标题

VL-BEIT：预处理生成视觉语言

VL-BEiT: Generative Vision-Language Pretraining

论文作者

Bao, Hangbo, Wang, Wenhui, Dong, Li, Wei, Furu

论文摘要

我们介绍了一个名为VL-BEIT的视觉基础模型，这是一种双向多模式变压器，通过生成预处理学习。我们的极简主义解决方案通过共享变压器对单接和多模式数据进行掩盖的预测。具体而言，我们对图像文本对，文本上的掩盖语言建模以及图像上的掩盖图像建模进行了掩盖的视觉模型。 VL-从头开始学习，其中一项统一的预处理，一个共用的骨干和一个阶段的训练。我们的方法在概念上是简单的，并且在经验上有效。实验结果表明，VL-BEIT在各种视觉语言基准（例如视觉问题回答，视觉推理和图像文本检索）上获得了强大的结果。此外，我们的方法学习可转移的视觉特征，在图像分类上实现竞争性能以及语义分割。

We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Our minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images. VL-BEiT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training. Our method is conceptually simple and empirically effective. Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题