多模式图神经网络，用于视觉和场景文本的联合推理

论文标题

多模式图神经网络，用于视觉和场景文本的联合推理

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

论文作者

Gao, Difei, Li, Ke, Wang, Ruiping, Shan, Shiguang, Chen, Xilin

论文摘要

回答需要在图像中读取文本的问题对于当前模型来说是具有挑战性的。这项任务的一个关键困难是图像中经常出现罕见，多义和模棱两可的单词，例如地点，产品和运动队的名称。为了克服这一困难，仅诉诸于预训练的单词嵌入模型还远远不够。所需的模型应以图像的多种方式利用丰富的信息来帮助理解场景文本的含义，例如，瓶子上的突出文本最有可能是品牌。遵循这个想法，我们提出了一种新型的VQA方法，即多模式图神经网络（MM-gnn）。它首先将图像表示为由三个子图组成的图，分别描绘了视觉，语义和数字模态。然后，我们介绍了三个聚合器，这些聚合器指导从一个图传递到另一个图的消息以各种方式利用上下文，以完善节点的特征。更新的节点对于下游问题答案模块具有更好的功能。实验评估表明，我们的MM-gnn代表场景文本更好，并且显然促进了需要阅读场景文本的两个VQA任务的表演。

Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题