论文标题
超越教学视频:探索YouTube上更多样化的视觉文本基础
Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube
论文作者
论文摘要
未经标记的Web视频预处理已迅速成为在许多视频理解任务上实现高性能的事实手段。通过预测视觉内容与自动语音识别(ASR)令牌之间的基础关系来学习特征。但是,事先预处理的工作仅限于教学视频。先验,我们希望该领域相对“容易:”教学视频中的扬声器通常会引用所描绘的字面对象/动作。我们问:可以对更多样化的视频语料库进行类似的模型?而且,如果是这样,哪些类型的视频是“接地”的,哪些类型不是?我们将代表性的预处理模型与不同的YouTube8M数据集拟合,并研究其成功和故障案例。我们发现,在以前未开发的视频类别中确实可以进行视觉文本接地,并且对更多样化的集合进行了预处理会导致表示对非教学和教学域的概括。
Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks. Features are learned via prediction of grounded relationships between visual content and automatic speech recognition (ASR) tokens. However, prior pretraining work has been limited to only instructional videos; a priori, we expect this domain to be relatively "easy:" speakers in instructional videos will often reference the literal objects/actions being depicted. We ask: can similar models be trained on more diverse video corpora? And, if so, what types of videos are "grounded" and what types are not? We fit a representative pretraining model to the diverse YouTube8M dataset, and study its success and failure cases. We find that visual-textual grounding is indeed possible across previously unexplored video categories, and that pretraining on a more diverse set results in representations that generalize to both non-instructional and instructional domains.