多模式自适应蒸馏用于利用单峰编码器进行视觉任务

论文标题

多模式自适应蒸馏用于利用单峰编码器进行视觉任务

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

论文作者

Wang, Zhecan, Codella, Noel, Chen, Yen-Chun, Zhou, Luowei, Dai, Xiyang, Xiao, Bin, Yang, Jianwei, You, Haoxuan, Chang, Kai-Wei, Chang, Shih-fu, Yuan, Lu

论文摘要

视觉语言（VL）任务的跨模式编码器通常通过精心策划的视觉语言数据集进行仔细研究。尽管这些数据集达到了1000万个样本的订单，但人工成本却无法进一步扩展。相反，单峰编码器的鉴定，用更简单的注释较低，成本较低，达到了数亿至数十亿的规模。结果，单峰编码器已在许多下游任务上实现了最先进的（SOTA）。但是，适用于VL任务时仍存在挑战。预处理数据对于跨模式架构并不是最佳的，需要大量的计算资源。此外，单峰体系结构缺乏跨模式相互作用，这对VL任务显示出重大好处。因此，如何最好地利用预处理的单形成编码来进行VL任务仍然是一个积极研究的领域。在这项工作中，我们提出了一种方法来利用单峰视觉和文本编码的VL任务，以增强现有VL方法，同时保存计算复杂性。具体而言，我们提出了多模式自适应蒸馏（MAD），该蒸馏（MAD）将有用的知识从预告片编码器到跨模式VL编码器。其次，为了更好地捕获对VL任务性能的细微影响，我们介绍了一个评估协议，其中包括视觉常识性推理（VCR），视觉范围（SNLI-VE）和视觉问题答案（VQA），跨越了域移动的各种数据约束和条件。实验表明，与其他单个单个模型相比，MAD导致在VCR，SNLI-VE和VQA上的低射击，域移位和完全监督条件的稳定增长，并且与其他单个模型相比，在VCR上实现了SOTA性能。最后，使用剪辑中预处理的视觉编码器，疯狂的表现同时效力。代码将提供。

Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. While these datasets reach an order of 10 million samples, the labor cost is prohibitive to scale further. Conversely, unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. As a result, unimodal encoders have achieved state-of-art (SOTA) on many downstream tasks. However, challenges remain when applying to VL tasks. The pretraining data is not optimal for cross-modal architectures and requires heavy computational resources. In addition, unimodal architectures lack cross-modal interactions that have demonstrated significant benefits for VL tasks. Therefore, how to best leverage pretrained unimodal encoders for VL tasks is still an area of active research. In this work, we propose a method to leverage unimodal vision and text encoders for VL tasks that augment existing VL approaches while conserving computational complexity. Specifically, we propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders. Second, to better capture nuanced impacts on VL task performance, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data constraints and conditions of domain shift. Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data. Finally, MAD outperforms concurrent works utilizing pretrained vision encoder from CLIP. Code will be made available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题