论文标题
长视频检索指南
A CLIP-Hitchhiker's Guide to Long Video Retrieval
论文作者
论文摘要
本文我们的目标是改编为长时间视频检索的图像文本模型。最近的作品通过采用剪辑来证明视频检索的最新性能,有效地搭在视频任务的图像文本表示上。但是,在学习时间聚集方面的成功有限,超过均衡的图像级表示,每帧通过剪辑提取的图像级表示。我们发现,通过查询分数的框架嵌入加权均值的简单而有效的基线比所有先前的时间建模尝试和平均锻炼都显着改善。在此过程中,我们为其他人提供了改进的基线,以便在一系列长期视频检索基准测试中进行比较并展示此简单基线的最先进表现。
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP, effectively hitchhiking on the image-text representation for video tasks. However, there has been limited success in learning temporal aggregation that outperform mean-pooling the image-level representations extracted per frame by CLIP. We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling. In doing so, we provide an improved baseline for others to compare to and demonstrate state-of-the-art performance of this simple baseline on a suite of long video retrieval benchmarks.