论文标题
具有单耐力变压器的统一视觉表示空间
Unifying Vision-Language Representation Space with Single-tower Transformer
论文作者
论文摘要
对比学习是一种远程学习的形式,旨在从两个相关表示形式中学习不变特征。在本文中,我们探讨了一个大胆的假设,即可以将图像及其标题简单地视为基础互信息的两种不同观点,并训练模型以学习统一的视觉表示空间,该空间一次以模态 - 渐进式的方式编码这两种方式。我们首先确定了学习视觉预科(VLP)的通用单耐力模型的困难,并提出Oner作为我们目标的简单而有效的框架。我们发现了有趣的属性,这些属性将Oner与以前的作品区分开来,这些作品学习了特定于模态的表示空间,例如零击对象定位,文本引导的视觉推理和多模式检索,并进行了当前的分析,以提供有关这种新形式的多模式表示学习形式的见解。彻底的评估证明了统一的模态无形VLP框架的潜力。
Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this paper, we explore the bold hypothesis that an image and its caption can be simply regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a generic one-tower model for vision-language pretraining (VLP), and propose OneR as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.