论文标题
零击视频瞬间检索带有现成的模型
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
论文作者
论文摘要
对于大多数机器学习社区而言,收集高质量的人类注销数据的昂贵性质以及在有限的计算上有效的最新预算型模型是为了建立新任务模型的主要瓶颈。我们为一个这样的任务(视频矩检索(VMR))提出了一种零射击的简单方法,该方法不会执行任何额外的填充,并且简单地重新调整了对其他任务训练的现成模型。我们的三步方法包括矩提案,瞬间匹配和后处理,仅使用现成的模型。在VMR的QVHighlights基准中,我们在所有指标上极大地提高了以前的零击方法的性能至少2.5倍,并减少了零射击与最先进的差距,而最先进的是超过74%。此外,我们还表明,我们的零射击方法在召回指标上击败了未备注的监督模型,并且在地图指标上非常接近。而且,它的表现也比较短的时刻上最好的有监督的模型要好。最后,我们消融和分析结果,并提出有趣的未来方向。
For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.