VL-CheckList：评估具有对象，属性和关系的预训练的视力语言模型

论文标题

VL-CheckList：评估具有对象，属性和关系的预训练的视力语言模型

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

论文作者

Zhao, Tiancheng, Zhang, Tianqi, Zhu, Mingwei, Shen, Haozhan, Lee, Kyusong, Lu, Xiaopeng, Yin, Jianwei

论文摘要

视觉预读（VLP）模型最近成功地促进了许多跨模式下游任务。大多数现有作品通过比较微调的下游任务性能来评估其系统。但是，只有平均下游任务准确性才能提供有关每种VLP方法的优缺点的几乎没有信息，更不用说有关社区如何改善系统的见解。受清单测试自然语言处理的启发，我们利用了VL-CheckList，这是一个新颖的框架来了解VLP模型的功能。所提出的方法将VLP模型的图像定位能力分为三类：对象，属性和关系，并使用新颖的分类法进一步分解这三个方面。我们进行了全面的研究，通过提出的框架分析了七个最近流行的VLP模型。结果通过揭示了仅在下游任务评估中不可见的比较模型之间的细粒度差异，从而证实了所提出的方法的有效性。进一步的结果表明，在构建更好的VLP模型方面有希望的研究方向。我们的数据和代码可在以下网址获得：https：//github.com/om--ai-lab/vl-checklist。

Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Our data and code are available at: https://github.com/om-ai-lab/VL-CheckList.

下载PDF全文

下载文献需遵守相关版权规定

论文标题