基于知识的视觉问题回答的跨模式知识推理

论文标题

基于知识的视觉问题回答的跨模式知识推理

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

论文作者

Yu, Jing, Zhu, Zihao, Wang, Yujing, Zhang, Weifeng, Hu, Yue, Tan, Jianlong

论文摘要

基于知识的视觉问题回答（KVQA）需要超越可见内容的外部知识来回答有关图像的问题。这种能力具有挑战性，但要实现一般VQA必不可少。现有KVQA解决方案的一个局限性是，他们共同嵌入了各种信息而没有精细选择，这引入了意外的噪音，以推理正确的答案。如何捕获面向提问的和信息互补的证据仍然是解决问题的关键挑战。受到人类认知理论的启发，在本文中，我们通过视觉，语义和事实观点的多个知识图描绘了图像。在其中，视觉图和语义图被视为事实图的图像条件实例化。除了这些新表示外，我们还将基于知识的视觉问题回答作为一个经常出现的推理过程，以从多模式信息中获得互补证据。为此，我们将模型分解为一系列基于内存的推理步骤，每个步骤都由基于G Raph的REED，U PDATE和C ONTROL（GRUC）模块执行，该模块在视觉和语义信息上都进行了并行推理。通过多次堆叠模块，我们的模型执行了传递推理，并在不同方式的约束下获得了面向问题的概念表示。最后，我们通过共同考虑所有概念来执行图形神经网络来推断全球最佳的答案。我们在包括FVQA，Visual7W-KB和OK-VQA在内的三个流行基准数据集上实现了新的最先进的性能，并通过广泛的实验证明了我们的模型的有效性和解释性。

Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing KVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the correct answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. Thereinto, the visual graph and semantic graph are regarded as image-conditioned instantiation of the factual graph. On top of these new representations, we re-formulate Knowledge-based Visual Question Answering as a recurrent reasoning process for obtaining complementary evidence from multimodal information. To this end, we decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol ( GRUC ) module that conducts parallel reasoning over both visual and semantic information. By stacking the modules multiple times, our model performs transitive reasoning and obtains question-oriented concept representations under the constrain of different modalities. Finally, we perform graph neural networks to infer the global-optimal answer by jointly considering all the concepts. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA, and demonstrate the effectiveness and interpretability of our model with extensive experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题