UNIMO-2：端到端统一视力语言基础学习

论文标题

UNIMO-2：端到端统一视力语言基础学习

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

论文作者

Li, Wei, Gao, Can, Niu, Guocheng, Xiao, Xinyan, Liu, Hao, Liu, Jiachen, Wu, Hua, Wang, Haifeng

论文摘要

视力语言预训练（VLP）在各种跨模式下游任务上取得了令人印象深刻的表现。但是，大多数现有方法只能从对齐的图像捕获数据中学习，并严重依赖昂贵的区域功能，从而极大地限制了它们的可扩展性和性能。在本文中，我们提出了一个端到端统一模式的预训练框架，即Unimo-2，用于对校准图像捕获数据和不一致的仅图像和仅一本文本语料库进行联合学习。我们建立一个统一的变压器模型，以共同学习图像和文本之间的视觉表示，文本表示和语义对齐。特别是，我们建议通过共享的扎根空间对图像和文本进行基础学习，这有助于桥接不一致的图像和文本，并将视觉和文本语义空间对齐不同类型的语料库。实验表明，我们的扎根学习方法可以改善文本和视觉语义一致性，以改善各种跨模式任务的性能。此外，从不同类型的语料库的有效联合建模中受益，我们的模型还可以在单模式的视觉和文本任务上实现令人印象深刻的性能。我们的代码和模型在UNIMO项目页面https://unimo-ptm.github.io/上公开。

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks. Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page https://unimo-ptm.github.io/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题