论文标题
使用现成的图像生成和字幕发现视觉模型中的错误
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning
论文作者
论文摘要
在现实世界设置下自动发现视觉模型中的故障仍然是一个开放的挑战。这项工作演示了如何利用大量数据培训的现成,大规模,图像到文本和文本对象模型,以自动找到此类故障。从本质上讲,有条件的文本对图像生成模型用于生成大量的合成但现实的输入,给出了地面真相标签。错误分类的输入是聚类的,并且使用字幕模型来描述每个群集。每个集群的描述依次使用来生成更多的输入,并评估特定簇是否会导致比预期更多的故障。我们使用该管道来证明我们可以有效询问在Imagenet上训练的分类器,以找到特定的故障案例并发现虚假的相关性。我们还表明,我们可以扩展针对特定分类器体系结构的对抗数据集的方法。这项工作是概念验证,证明了大规模生成模型的实用性,可以以开放式的方式自动发现视觉模型中的错误。我们还描述了与这种方法相关的许多局限性和陷阱。
Automatically discovering failures in vision models under real-world settings remains an open challenge. This work demonstrates how off-the-shelf, large-scale, image-to-text and text-to-image models, trained on vast amounts of data, can be leveraged to automatically find such failures. In essence, a conditional text-to-image generative model is used to generate large amounts of synthetic, yet realistic, inputs given a ground-truth label. Misclassified inputs are clustered and a captioning model is used to describe each cluster. Each cluster's description is used in turn to generate more inputs and assess whether specific clusters induce more failures than expected. We use this pipeline to demonstrate that we can effectively interrogate classifiers trained on ImageNet to find specific failure cases and discover spurious correlations. We also show that we can scale the approach to generate adversarial datasets targeting specific classifier architectures. This work serves as a proof-of-concept demonstrating the utility of large-scale generative models to automatically discover bugs in vision models in an open-ended manner. We also describe a number of limitations and pitfalls related to this approach.