vlmae：视觉语言蒙面自动编码器

论文标题

vlmae：视觉语言蒙面自动编码器

VLMAE: Vision-Language Masked Autoencoder

论文作者

He, Sunan, Guo, Taian, Dai, Tao, Qiao, Ruizhi, Wu, Chen, Shu, Xiujun, Ren, Bo

论文摘要

图像和语言建模对于视觉前训练（VLP）至关重要，该图像旨在从大规模配对的图像文本数据中学习多模式表示。但是，我们观察到，大多数现有的VLP方法着重于建模图像和文本特征之间的相互作用，同时忽略图像和文本之间的信息差异，从而遭受焦点偏见。为了解决这个问题，我们提出了一个视觉语言掩盖的自动编码器框架（VLMAE）。 VLMAE采用视觉生成学习，促进该模型获得细粒度和无偏的功能。与以前的作品不同，Vlmae注意图像中几乎所有关键的补丁，提供了更全面的理解。广泛的实验表明，VLMAE在各种视觉语言下游任务中取得了更好的性能，包括视觉问答答案，图像文本检索和视觉接地，即使有多达20％的预训练速度。

Image and language modeling is of crucial importance for vision-language pre-training (VLP), which aims to learn multi-modal representations from large-scale paired image-text data. However, we observe that most existing VLP methods focus on modeling the interactions between image and text features while neglecting the information disparity between image and text, thus suffering from focal bias. To address this problem, we propose a vision-language masked autoencoder framework (VLMAE). VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features. Unlike the previous works, VLMAE pays attention to almost all critical patches in an image, providing more comprehensive understanding. Extensive experiments demonstrate that VLMAE achieves better performance in various vision-language downstream tasks, including visual question answering, image-text retrieval and visual grounding, even with up to 20% pre-training speedup.

下载PDF全文

下载文献需遵守相关版权规定

论文标题