观看和学习：绘制语言和嘈杂的现实世界视频，并带有自我意识

论文标题

观看和学习：绘制语言和嘈杂的现实世界视频，并带有自我意识

Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

论文作者

Zhong, Yujie, Xie, Linhai, Wang, Sen, Specia, Lucia, Miao, Yishu

论文摘要

在本文中，我们教机器通过在没有明确注释的情况下学习句子和嘈杂的视频片段之间的映射来了解视觉和自然语言。首先，我们定义了一个自我监督的学习框架，该框架捕获了跨模式信息。然后引入一个新颖的对抗性学习模块，以明确处理自然视频中的噪音，在该视频中，副标题不保证与视频片段非常相对应。为了进行培训和评估，我们贡献了一个新的数据集“公寓”，其中包含大量在线视频和字幕。我们对句子和视频之间的双向检索任务进行了实验，结果表明，我们所提出的模型在这两个检索任务上都达到了最先进的性能，并且超过了几个强大的基准。数据集可以在https://github.com/zyj-13/wal上下载。

In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset can be downloaded at https://github.com/zyj-13/WAL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题