论文标题

VOP:文本视频合作及时调整跨模式检索

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

论文作者

Huang, Siteng, Gong, Biao, Pan, Yulin, Jiang, Jianwen, Lv, Yiliang, Li, Yuyuan, Wang, Donglin

论文摘要

许多最近的研究利用了预先训练的剪辑来进行文本视频跨模式检索,通过用其他重型模块调整骨架,这不仅带来了具有更多参数的巨大计算负担,还导致了从上游模型中忘记的知识。在这项工作中,我们提出了VOP:Text-Video合作提示调整,以进行有效调整文本视频检索任务。提出的VOP是一个端到端的框架,并介绍了视频和文本提示,可以将其视为功能强大的基线,只有0.1%可训练的参数。此外,根据视频的时空特征,我们开发了三种新颖的视频及时机制,以使用不同的可训练参数来提高性能。 VOP增强的基本思想是分别使用特定的可训练提示来对框架位置,框架上下文和层函数进行建模。广泛的实验表明,与完整的微调相比,增强的VOP在五个文本视频检索基准测试基准中获得了1.4%的平均R@1增益,而参数却少6倍。该代码将在https://github.com/bighuang624/vop上找到。

Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源