通过幻觉学习：视力语言预训练和弱监督

论文标题

通过幻觉学习：视力语言预训练和弱监督

Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

论文作者

Wang, Tzu-Jui Julius, Laaksonen, Jorma, Langer, Tomas, Arponen, Heikki, Bishop, Tom E.

论文摘要

弱监督视觉语言（V-L）预训练（W-VLP）旨在学习几乎没有或没有配对数据的跨模式对齐，例如对齐的图像和字幕。最近的W-VLP方法将视觉特征与对象标签配对，有助于实现与在各种V-L下游任务中用对齐对训练的一些VLP模型相当的性能。但是，在交叉模式检索（XMR）中并非如此。我们认为，这种W-VLP模型的学习是由有限语义的对象标签限制的。我们通过使用新型的基于视觉词汇的功能幻觉器（WFH）来解决缺少配对的V-L数据来进行模型监督，该功能幻觉器（WFH）是通过弱监督作为W-VLP模型进行训练的，而不需要与字幕配对的图像。 WFH从文本中产生视觉幻觉，然后将其与最初的未配对文本配对，从而使跨模态的相互作用更加多样化。从经验上讲，WFH始终提高先前的W-VLP作品，例如U-Visualbert（U-VB），在各种V-L任务上，即XMR，视觉询问答案等。值得注意的是，以Recce@{1,5,10}为基础，它一致地在图像到text上的U-VB始终改进了在两个流行数据集Flickr30k和Mcococo上的图像到tox-to-Text-to-Text-to-Text-image检索。同时，在这些XMR任务上的跨数据集概括测试中，它至少增长了14.5％。此外，在考虑的其他V-L下游任务中，我们的WFH模型与经过配对V-L数据训练的模型相当，揭示了未配对数据的实用性。这些结果表明，使用WFH对所提出的W-VLP模型进行了更大的概括。

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities. Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT (U-VB), over a variety of V-L tasks, i.e. XMR, Visual Question Answering, etc. Notably, benchmarked with recall@{1,5,10}, it consistently improves U-VB on image-to-text and text-to-image retrieval on two popular datasets Flickr30K and MSCOCO. Meanwhile, it gains by at least 14.5% in cross-dataset generalization tests on these XMR tasks. Moreover, in other V-L downstream tasks considered, our WFH models are on par with models trained with paired V-L data, revealing the utility of unpaired data. These results demonstrate greater generalization of the proposed W-VLP model with WFH.

下载PDF全文

下载文献需遵守相关版权规定

论文标题