论文标题
长期视频的时间对齐网络
Temporal Alignment Networks for Long-term Video
论文作者
论文摘要
本文的目的是一个时间对齐网络,它会摄取长期视频序列和相关的文本句子,以:(1)确定句子是否与视频相符; (2)如果可以对齐,请确定其对齐。面临的挑战是从大规模数据集(例如Howto100m)训练此类网络,其中相关的文本句子具有明显的噪声,并且仅在相关时才弱对齐。除了提出对齐网络外,我们还做出了四个贡献:(i)我们描述了一种新颖的共同训练方法,该方法可以在不使用手动注释的情况下进行原始教学视频进行贬低和训练,尽管有很大的噪音; (ii)为了基于对齐性能,我们手动策划了一个10小时的Howto100m的子集,总共80个视频,并具有稀疏的时间描述。我们提出的对HOWTO100M训练的模型优于该对齐数据集上强的基线(夹子,MIL-NCE)。 (iii)我们将零击设置中训练有素的模型应用于多个下游视频理解任务并实现最新结果,包括在YouCook2上进行文本视频检索,以及弱监督的视频动作对早餐时的视频动作细分; (iv)我们使用自动对齐的HOWTO100M注释来对骨干模型的端到端登录,并在下游动作识别任务上获得改进的性能。
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment. The challenge is to train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise, and are only weakly aligned when relevant. Apart from proposing the alignment network, we also make four contributions: (i) we describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise; (ii) to benchmark the alignment performance, we manually curate a 10-hour subset of HowTo100M, totalling 80 videos, with sparse temporal descriptions. Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin; (iii) we apply the trained model in the zero-shot settings to multiple downstream video understanding tasks and achieve state-of-the-art results, including text-video retrieval on YouCook2, and weakly supervised video action segmentation on Breakfast-Action; (iv) we use the automatically aligned HowTo100M annotations for end-to-end finetuning of the backbone model, and obtain improved performance on downstream action recognition tasks.