论文标题
重新访问分类器:转移视频识别的视觉语言模型
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
论文作者
论文摘要
从任务不足的预训练的深层模型中转移知识以进行下游任务是计算机视觉研究中的一个重要主题。随着计算能力的增长,我们现在拥有大规模的模型架构和数据量的开源视觉语言预培训模型。在这项研究中,我们专注于转移视频分类任务的知识。传统方法随机初始化线性分类器头进行视觉分类,但是它们将文本编码器的用法留为未发现的下游视觉识别任务。在本文中,我们修改了线性分类器的作用,并用预训练模型的不同知识代替了分类器。我们利用良好的语言模型来产生良好的语义目标,以进行有效的转移学习。实证研究表明,我们的方法提高了视频分类的性能和训练速度,模型的变化微不足道。我们简单而有效的调整范式在各种视频识别方案(即零射击,很少射击,一般识别)方面实现了最先进的表现和有效的培训。特别是,我们的范式在动力学400上达到了87.8%的最新精度,并且在零射击下,在零射击下的绝对TOP-1精度也超过了先前的方法,在五个流行的视频数据集中几乎没有设置。可以在https://github.com/whwu95/text4vis上找到代码和模型。
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning. The empirical study shows that our method improves both the performance and the training speed of video classification, with a negligible change in the model. Our simple yet effective tuning paradigm achieves state-of-the-art performance and efficient training on various video recognition scenarios, i.e., zero-shot, few-shot, general recognition. In particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1 accuracy under zero-shot, few-shot settings on five popular video datasets. Code and models can be found at https://github.com/whwu95/Text4Vis .