复兴：区域视觉表示在基于知识的视觉问题回答中很重要

论文标题

复兴：区域视觉表示在基于知识的视觉问题回答中很重要

REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

论文作者

Lin, Yuanze, Xie, Yujia, Chen, Dongdong, Xu, Yichong, Zhu, Chenguang, Yuan, Lu

论文摘要

本文在基于知识的视觉问题回答（VQA）中重新审视了视觉表示，并证明以更好的方式使用区域信息可以显着改善性能。尽管在传统VQA中对视觉表示形式进行了广泛的研究，但在基于知识的VQA中探索了它，即使这两个任务共享共同的精神，即依靠视觉输入来回答问题。具体而言，我们观察到，在大多数基于知识的最新知识的VQA方法中：1）视觉特征是从整个图像中或以滑动窗口的方式提取的，以检索知识，并且对象区域内/之间的重要关系被忽略； 2）在最终答案模型中，视觉特征在某种程度上是违反直觉的。基于这些观察结果，我们提出了一种新的基于知识的VQA方法，该方法试图不仅在知识检索阶段，而且在答案模型中使用对象区域的明确信息。关键动机是对象区域和固有关系对于基于知识的VQA很重要。我们对标准的OK-VQA数据集进行了广泛的实验，并实现了新的最先进的性能，即准确度为58.0％，超过了以前的最先进方法（+3.6％）。我们还进行了详细的分析，并显示了基于知识的VQA的不同框架组件中区域信息的必要性。代码可在https://github.com/yzleroy/revive上公开获取。

This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is extensively studied in traditional VQA, it is under-explored in knowledge-based VQA even though these two tasks share the common spirit, i.e., rely on visual input to answer the question. Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected; 2) visual features are not well utilized in the final answering model, which is counter-intuitive to some extent. Based on these observations, we propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions not only in the knowledge retrieval stage but also in the answering model. The key motivation is that object regions and inherent relationship are important for knowledge-based VQA. We perform extensive experiments on the standard OK-VQA dataset and achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin (+3.6%). We also conduct detailed analysis and show the necessity of regional information in different framework components for knowledge-based VQA. Code is publicly available at https://github.com/yzleroy/REVIVE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题