改善多模式模型的零击概括和鲁棒性

论文标题

改善多模式模型的零击概括和鲁棒性

Improving Zero-shot Generalization and Robustness of Multi-modal Models

论文作者

Ge, Yunhao, Ren, Jie, Gallagher, Andrew, Wang, Yuxiao, Yang, Ming-Hsuan, Adam, Hartwig, Itti, Laurent, Lakshminarayanan, Balaji, Zhao, Jiaping

论文摘要

多模式图像文本模型（例如夹子和LIT）在图像分类基准上表现出了令人印象深刻的性能，其零击球能力尤其令人兴奋。尽管这些模型的前5个零射击精度非常高，但前1个精度却低得多（在某些情况下为25％的差距超过25％）。我们研究了这种性能差距的原因，发现许多故障案例都是由于文本提示中的歧义引起的。首先，我们通过测量预测W.R.T.的一致性来识别TOP-1预测可能是不正确的图像的图像。多个提示和图像转换。我们表明，我们的过程更好地预测了错误，在选择性预测任务上表现优于流行的最大logit基线。接下来，我们提出了一种简单有效的方法，通过利用WordNet层次结构来提高这种不确定图像的准确性。具体来说，我们通过将其父母和子女从语义标签层次结构中纳入并插入文本提示来增强原始类。我们使用五个不同的基于Imagenet的数据集对夹和LIT模型进行实验。对于夹子，我们的方法将不确定子集的TOP-1精度提高了17.13％，整个ImageNet验证集的3.6％提高了3.6％。我们还表明，我们的方法在ImageNet移位的数据集，其他四个数据集以及其他模型架构（例如LIT）中有所改进。所提出的方法不含高参数，不需要其他模型训练，并且可以轻松缩放到其他大型多模式体系结构。代码可从https://github.com/gyhandy/hierarchy-clip获得。

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets, four other datasets, and other model architectures such as LiT. The proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code is available at https://github.com/gyhandy/Hierarchy-CLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题