了解我吗？多模式评估用于细粒度的视觉常识

论文标题

了解我吗？多模式评估用于细粒度的视觉常识

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

论文作者

Wang, Zhecan, You, Haoxuan, He, Yicheng, Li, Wenhao, Chang, Kai-Wei, Chang, Shih-Fu

论文摘要

视觉常识的理解需要视觉语言（VL）模型不仅了解图像和文本，还需要在之间进行交叉引用，以完全整合并实现对所描述的视觉场景的理解。最近，已经开发了各种方法并在视觉常识基准上实现了高性能。但是，目前尚不清楚这些模型是否真正了解视觉场景和由于评估数据资源有限而导致的常识性知识。为了提供深入的分析，我们提出了多模式评估（ME）管道，以自动生成问答对，以测试模型对视觉场景，文本和相关知识的理解。然后，我们进一步迈出了一步，以表明使用ME数据的培训可以提高该模型在标准VCR评估中的性能。最后，我们的深入分析和比较揭示了有趣的发现：（1）语义低级别的信息可以帮助学习高级信息，但不是相反；（2）与文本相比，视觉信息通常在使用中。

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.

下载PDF全文

下载文献需遵守相关版权规定

论文标题