论文标题
揭示视频和语言学习的单帧偏见
Revealing Single Frame Bias for Video-and-Language Learning
论文作者
论文摘要
培训有效的视频和语言模型直观地需要多个帧作为模型输入。但是,目前尚不清楚使用多个帧是否有利于下游任务,如果是的话,性能增益是否值得通过使用更多帧而产生的巨大计算和内存成本。在这项工作中,我们探索了视频和语言学习的单帧模型。在各种视频和语言任务(包括文本到视频检索和视频问题回答)上,我们显示出令人惊讶的结果,即通过大规模的预训练和推理时间上适当的框架集合策略,可以使用多个训练训练的现有方法获得更好的时间性能。该结果揭示了流行的视频和语言数据集中存在强烈的“静态外观偏差”。因此,为了对视频和语言模型进行更全面的评估,我们根据现有的良好颗粒动作识别数据集提出了两个新的检索任务,以鼓励时间建模。我们的代码可从https://github.com/jayleicn/singularity获得
Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Our code is available at https://github.com/jayleicn/singularity