在说话之前看：视觉上下文化的话语

论文标题

在说话之前看：视觉上下文化的话语

Look Before you Speak: Visually Contextualized Utterances

论文作者

Seo, Paul Hongsuck, Nagrani, Arsha, Schmid, Cordelia

论文摘要

虽然大多数对话AI系统仅着眼于文本对话，但在视觉上下文上（可用时）的调节话语会导致更现实的对话。不幸的是，将视觉上下文纳入对话对话中的主要挑战是缺乏大规模标签的数据集。我们以新的视觉条件的未来话语预测任务的形式提供解决方案。我们的任务涉及在视频中使用视觉帧和抄录语音作为上下文来预测视频中的下一个话语。通过在线利用大量的教学视频，我们训练模型以大规模解决此任务，而无需手动注释。利用多模式学习的最新进展，我们的模型由一种新型的共同涉及多模式变压器组成，并且在对文本和视觉上下文进行培训时，都超过了单独使用文本输入的基准。此外，我们证明了我们的模型在未标记的视频上培训了这项任务，可以在许多下游VideoQA基准测试中实现最先进的性能，例如MSRVTT-QA，MSVD-QA，ActivityNet-QA，activityNet-QA和How2QA。

While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic conversations. Unfortunately, a major challenge for incorporating visual context into conversational dialogue is the lack of large-scale labeled datasets. We provide a solution in the form of a new visually conditioned Future Utterance Prediction task. Our task involves predicting the next utterance in a video, using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone. Further, we demonstrate that our model trained for this task on unlabelled videos achieves state-of-the-art performance on a number of downstream VideoQA benchmarks such as MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题