剪辑模型是很少的学习者：关于VQA和视觉范围的经验研究

论文标题

剪辑模型是很少的学习者：关于VQA和视觉范围的经验研究

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

论文作者

Song, Haoyu, Dong, Li, Zhang, Wei-Nan, Liu, Ting, Wei, Furu

论文摘要

夹在各种视觉任务上显示出了显着的零击功能。以前，剪辑仅被视为强大的视觉编码器。但是，在通过大量图像捕获对的语言监督预先训练之后，剪辑本身也应该获得一些视力语言任务的能力。在这项工作中，我们从经验上表明，通过利用语言的力量，剪辑可以是一个强大的视觉语言学习者。我们首先在典型的视觉问题回答任务上评估Clip的零弹性性能，并在视觉上的任务上演示剪辑的零击交叉模式传递能力。然后，我们提出了一种参数高效的微调策略，以提高VQA任务上的几杆性能。我们在视觉问题回答和视觉上的任务上实现了竞争性的零/几射击结果，而无需引入任何其他预训练程序。

CLIP has shown a remarkable zero-shot capability on a wide range of vision tasks. Previously, CLIP is only regarded as a powerful visual encoder. However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks. In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language. We first evaluate CLIP's zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task. Then we propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task. We achieve competitive zero/few-shot results on the visual question answering and visual entailment tasks without introducing any additional pre-training procedure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题