在视觉语义文本相似性上评估多模式表示

论文标题

在视觉语义文本相似性上评估多模式表示

Evaluating Multimodal Representations on Visual Semantic Textual Similarity

论文作者

de Lacalle, Oier Lopez, Salaberria, Ander, Soroa, Aitor, Azkune, Gorka, Agirre, Eneko

论文摘要

视觉和文本表示的结合在诸如图像字幕和视觉问题回答之类的任务中产生了出色的结果，但是多模式表示的推论能力在很大程度上未经测试。在文本表示的情况下，经常使用诸如文本构成和语义文本相似性之类的推理任务来基准文本表示的质量。我们研究的长期目标是设计增强当前推理能力的多模式表示技术。因此，我们提出了一个新颖的任务，视觉语义文本相似性（VST），可以直接测试这种推理能力。给定两个项目由图像及其随附的标题组成，VSTS系统需要评估上下文中字幕在语义上相等的程度。我们使用简单多模式表示的实验表明，与仅文本表示相比，图像表示的添加产生更好的推断。当直接计算两个项目表示之间的相似性以及基于VSTS培训数据学习暹罗网络时，可以观察到改进。我们的工作首次表明，视觉信息对文本推断的成功贡献，并有足够的空间来基准更复杂的多模式表示选项。

The combination of visual and textual representations has produced excellent results in tasks such as image captioning and visual question answering, but the inference capabilities of multimodal representations are largely untested. In the case of textual representations, inference tasks such as Textual Entailment and Semantic Textual Similarity have been often used to benchmark the quality of textual representations. The long term goal of our research is to devise multimodal representation techniques that improve current inference capabilities. We thus present a novel task, Visual Semantic Textual Similarity (vSTS), where such inference ability can be tested directly. Given two items comprised each by an image and its accompanying caption, vSTS systems need to assess the degree to which the captions in context are semantically equivalent to each other. Our experiments using simple multimodal representations show that the addition of image representations produces better inference, compared to text-only representations. The improvement is observed both when directly computing the similarity between the representations of the two items, and when learning a siamese network based on vSTS training data. Our work shows, for the first time, the successful contribution of visual information to textual inference, with ample room for benchmarking more complex multimodal representation options.

下载PDF全文

下载文献需遵守相关版权规定

论文标题