论文标题
视觉变形金刚学到什么?视觉探索
What do Vision Transformers Learn? A Visual Exploration
论文作者
论文摘要
视觉变形金刚(VIT)迅速成为计算机视觉的事实上的架构,但我们对它们的工作和学到的知识一无所知。尽管现有研究在视觉上分析了卷积神经网络的机制,但对VIT的类似探索仍然具有挑战性。在本文中,我们首先解决了对VIT进行可视化的障碍。在这些解决方案的协助下,我们观察到,通过语言模型监督训练的VIT中的神经元(例如,剪辑)被语义概念而不是视觉特征激活。我们还探讨了VIT和CNN之间的潜在差异,并且发现变形金刚像卷积的相对方案一样检测图像背景特征,但它们的预测远不那么依赖于高频信息。另一方面,两种体系结构类型都以类似的方式来表现从早期层中的抽象模式到晚期混凝土对象的进展。此外,我们表明VIT在除最后一层以外的所有层中维护空间信息。与以前的作品相反,我们表明最后一层很可能会放弃空间信息,并且表现为学习的全球合并操作。最后,我们对各种VIT变体进行了大规模可视化,包括Deit,Goat,Convit,Pit,Pit,Swin和Twin,以验证我们方法的有效性。
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.