通过大型语言模型描述的视觉分类

论文标题

通过大型语言模型描述的视觉分类

Visual Classification via Description from Large Language Models

论文作者

Menon, Sachit, Vondrick, Carl

论文摘要

视觉语言模型（VLM）（例如剪辑）使用标准的零摄像分类过程在各种识别任务上表现出了有希望的性能 - 计算每个类别的查询图像与嵌入式单词之间的相似性。通过仅使用类别名称，他们忽略了语言提供的其他信息的丰富上下文。该过程对为什么选择类别的原因没有中间理解，此外，没有提供调整用于此决策的标准的机制。我们提出了与VLMS分类的替代框架，我们通过描述称为分类。我们要求VLMS检查描述性特征而不是广泛的类别：寻找老虎，寻找条纹；它的爪子；还有更多。通过根据这些描述符的决策，我们可以提供其他提示，以鼓励使用想要使用的功能。在此过程中，我们可以清楚了解该模型用于构建其决策的功能。它具有一定程度的固有解释性。我们查询大型语言模型（例如GPT-3），以使这些描述符以可扩展的方式获取它们。广泛的实验表明，我们的框架具有过去的可解释性优势。我们显示出跨分布变化的Imagenet准确性的提高；证明能够适应VLM的能力，可以识别训练期间看不见的概念；并说明如何对描述符进行编辑以有效地减轻偏差与基线相比。

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题