COTS：跨模式检索的协作两流视觉语言预训练模型

论文标题

COTS：跨模式检索的协作两流视觉语言预训练模型

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

论文作者

Lu, Haoyu, Fei, Nanyi, Huo, Yuqi, Gao, Yizhao, Lu, Zhiwu, Wen, Ji-Rong

论文摘要

大规模的单流预训练在图像文本检索中表现出巨大的性能。遗憾的是，由于注意力层的大量关注层，它面临着低推理效率。最近，诸如夹子和高推理效率的对齐的两流方法也显示出令人鼓舞的性能，但是，它们仅考虑两个流之间的实例级别对齐（因此仍有改进的余地）。为了克服这些局限性，我们提出了一种新型的协作两条视觉视觉预测模型，该模型称为图像文本检索的COTS，通过增强跨模式相互作用。 In addition to instance level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: (1) Token-level interaction - a masked visionlanguage modeling (MVLM) learning objective is devised without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. （2）任务级交互 - 在文本对图像和图像到文本检索任务之间设计了一个KL对齐学习目标，其中在动量对比度学习中，用负队列计算出每个任务的概率分布。在公平的比较设置下，我们的COTS在所有两流方法中的性能最高，并且在推理速度快10,800倍）W.R.T.最新的单流方法。重要的是，我们的COTS也适用于文本到视频检索，可在广泛使用的MSR-VTT数据集上产生新的最新技术。

Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high inference efficiency have also shown promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval by enhancing cross-modal interaction. In addition to instance level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: (1) Token-level interaction - a masked visionlanguage modeling (MVLM) learning objective is devised without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. (2) Task-level interaction - a KL-alignment learning objective is devised between text-to-image and image-to-text retrieval tasks, where the probability distribution per task is computed with the negative queues in momentum contrastive learning. Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10,800X faster in inference) w.r.t. the latest single-stream methods. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题