扩展语言图像预算的模型，以供一般视频识别

论文标题

扩展语言图像预算的模型，以供一般视频识别

Expanding Language-Image Pretrained Models for General Video Recognition

论文作者

Ni, Bolin, Peng, Houwen, Chen, Minghao, Zhang, Songyang, Meng, Gaofeng, Fu, Jianlong, Xiang, Shiming, Ling, Haibin

论文摘要

对比性语言图像预处理在从网络尺度数据中学习视觉文本联合表示方面取得了巨大的成功，这表明了各种图像任务的显着“零射”概括能力。但是，如何有效地将这种新的语言图像预处理方法扩展到视频域仍然是一个开放的问题。在这项工作中，我们提出了一种简单而有效的方法，该方法将验证的语言图像模型适应视频识别，而不是从头开始验证新模型。更具体地说，为了捕获沿时间维度框架的远距离依赖性，我们提出了一种跨框架注意机制，该机制可以明确交换跨帧的信息。这样的模块是轻量级的，可以无缝地插入验证的语言图像模型中。此外，我们提出了一个特定于视频的提示方案，该方案利用视频内容信息生成歧视性文本提示。广泛的实验表明，我们的方法是有效的，可以推广到不同的视频识别场景。特别是，在完全监督的设置下，我们的方法在Kinectics-400上获得了87.1％的前1个精度，而与SWIN-L和Vivit-H相比，使用较少的FLOPs少12倍。在零拍摄的实验中，我们的方法超过了当前的最新方法 +7.6％，而在两个流行协议下，TOP-1的准确性方面超过了14.9％。在少数拍摄的情况下，当标记的数据极为有限时，我们的方法优于先前的最佳方法 +32.1％和 +23.1％。代码和型号可在https://aka.ms/x-clip上找到

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://aka.ms/X-CLIP

下载PDF全文

下载文献需遵守相关版权规定

论文标题