在视觉问题中减少语言偏见，以视觉上的问题编码器回答

论文标题

在视觉问题中减少语言偏见，以视觉上的问题编码器回答

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

论文作者

KV, Gouthaman, Mittal, Anurag

论文摘要

最近的研究表明，当前的VQA模型对火车中的语言先验有很大的偏见，无论图像如何，都可以回答这个问题。例如，压倒性地回答“什么运动”为“网球”或“什么颜色香蕉”为“黄色”。这种行为限制了他们从现实世界应用方案中。在这项工作中，我们提出了一个新颖的模型不合时宜的问题编码器，视觉上的问题编码器（VGQE），以降低这种效果。 VGQE在编码问题时同时同时使用视觉和语言方式。因此，问题表示本身就会获得足够的视觉接地，从而降低了模型对语言先验的依赖性。我们证明了VGQE对最近的三个VQA模型的影响，并实现了最新的结果对VQAV2数据集的偏见分开的影响； VQA-CPV2。此外，与现有的减少偏差技术不同，在标准VQAV2基准上，我们的方法不会降低准确性。相反，它改善了性能。

Recent studies have shown that current VQA models are heavily biased on the language priors in the train set to answer the question, irrespective of the image. E.g., overwhelmingly answer "what sport is" as "tennis" or "what color banana" as "yellow." This behavior restricts them from real-world application scenarios. In this work, we propose a novel model-agnostic question encoder, Visually-Grounded Question Encoder (VGQE), for VQA that reduces this effect. VGQE utilizes both visual and language modalities equally while encoding the question. Hence the question representation itself gets sufficient visual-grounding, and thus reduces the dependency of the model on the language priors. We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results on the bias-sensitive split of the VQAv2 dataset; VQA-CPv2. Further, unlike the existing bias-reduction techniques, on the standard VQAv2 benchmark, our approach does not drop the accuracy; instead, it improves the performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题