论文标题
新闻故事:用视觉摘要说明文章
NewsStories: Illustrating articles with visual summaries
论文作者
论文摘要
最近的自我监督方法使用了大规模的图像文本数据集来学习强大的表示,这些表示无需填补即可将其转移到许多任务。这些方法通常假定其图像与其(短)字幕之间存在一对一的对应关系。但是,许多任务需要关于多个图像和长文本叙述的推理,例如描述带有视觉摘要的新闻文章。因此,我们探索了一个新颖的环境,其目标是学习一个自我监督的视觉语言表示,该表示对改变文本长度和图像的数量是可靠的。此外,与假定字幕的先前工作不同,我们假设图像仅包含与文本的宽松说明对应关系。为了探索这个问题,我们介绍了一个大规模的多模式数据集,其中包含31m文章,22m图像和1M视频。我们表明,最先进的图像文本对齐方法对于具有多个图像的更长叙述并不强大。最后,我们介绍了一个直观的基线,该基线在GoodNews数据集上在零拍图像集检索上胜过10%。
Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visual summaries. Thus, we explore a novel setting where the goal is to learn a self-supervised visual-language representation that is robust to varying text length and the number of images. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images. Finally, we introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.