论文标题
视觉审计的模型是否学习可组合原始概念?
Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
论文作者
论文摘要
视觉语言(VL)预处理的模型在多模式推理和零射击识别任务上取得了令人印象深刻的表现。这些VL模型中的许多模型都在互联网上未标记的图像和标题对上进行了预估计。在本文中,我们研究了原始概念的表示(例如颜色,形状或物体部分的属性)是否会自动在这些预处理的VL模型中自动出现。我们提出了一个两步框架,即组成概念映射(compmap),以研究此问题。 CompMap首先要求VL模型通过文本提示生成原始概念激活,然后学会构建一个构图模型,该模型将原始概念激活(例如,黑尾或红翼的可能性)映射到复合概念(例如,红翅黑鸟)。我们表明,可以从地面真理原始概念中可靠地学习组成模型。因此,我们假设,如果原始概念确实在VL预估计的模型中出现,则可以使用其原始概念激活来学习类似于专家设计的构图模型。我们提出了一个定量度量标准来衡量相似程度,并将指标称为解释性度量。当使用原始概念激活和学习的组成模型预测复合概念时,我们还测量了分类精度,并将其称为实用性度量。我们的研究表明,最先进的VL预处理模型学习原始概念,这些概念对于CUB数据集上的细粒度视觉识别非常有用,以及MIT-State数据集上的组成概括任务。但是,我们观察到,在我们的定性分析中,学到的构图模型具有低解释性。我们的结果揭示了现有VL模型的局限性以及鼓励获得原始概念的预处理目标的必要性。
Vision-language (VL) pretrained models have achieved impressive performance on multimodal reasoning and zero-shot recognition tasks. Many of these VL models are pretrained on unlabeled image and caption pairs from the internet. In this paper, we study whether representations of primitive concepts--such as colors, shapes, or the attributes of object parts--emerge automatically within these pretrained VL models. We propose a two-step framework, Compositional Concept Mapping (CompMap), to investigate this. CompMap first asks a VL model to generate primitive concept activations with text prompts, and then learns to construct a composition model that maps the primitive concept activations (e.g. the likelihood of black tail or red wing) to composite concepts (e.g. a red-winged blackbird). We show that a composition model can be reliably learn from ground truth primitive concepts. We thus hypothesize that if primitive concepts indeed emerge in a VL pretrained model, its primitive concept activations can be used to learn a composition model similar to the one designed by experts. We propose a quantitative metric to measure the degree of similarity, and refer to the metric as the interpretability metric. We also measure the classification accuracy when using the primitive concept activations and the learned composition model to predict the composite concepts, and refer to it as the usefulness metric. Our study reveals that state-of-the-art VL pretrained models learn primitive concepts that are highly useful for fine-grained visual recognition on the CUB dataset, and compositional generalization tasks on the MIT-States dataset. However, we observe that the learned composition models have low interpretability in our qualitative analyses. Our results reveal the limitations of existing VL models, and the necessity of pretraining objectives that encourage the acquisition of primitive concepts.