论文标题
剪辑模型是很少的学习者:关于VQA和视觉范围的经验研究
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
论文作者
论文摘要
夹在各种视觉任务上显示出了显着的零击功能。以前,剪辑仅被视为强大的视觉编码器。但是,在通过大量图像捕获对的语言监督预先训练之后,剪辑本身也应该获得一些视力语言任务的能力。在这项工作中,我们从经验上表明,通过利用语言的力量,剪辑可以是一个强大的视觉语言学习者。我们首先在典型的视觉问题回答任务上评估Clip的零弹性性能,并在视觉上的任务上演示剪辑的零击交叉模式传递能力。然后,我们提出了一种参数高效的微调策略,以提高VQA任务上的几杆性能。我们在视觉问题回答和视觉上的任务上实现了竞争性的零/几射击结果,而无需引入任何其他预训练程序。
CLIP has shown a remarkable zero-shot capability on a wide range of vision tasks. Previously, CLIP is only regarded as a powerful visual encoder. However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks. In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language. We first evaluate CLIP's zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task. Then we propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task. We achieve competitive zero/few-shot results on the visual question answering and visual entailment tasks without introducing any additional pre-training procedure.