论文标题
改善小型数据集上视觉变压器的局部性指南
Locality Guidance for Improving Vision Transformers on Tiny Datasets
论文作者
论文摘要
虽然视觉变压器(VT)体系结构在计算机视觉中变得潮流,但纯VT模型在微型数据集上的性能较差。为了解决此问题,本文提出了改善小型数据集VT性能的当地指南。我们首先分析,由于VTS中自我注意的机制的高灵活性和内在的全球性,因此很难用有限的数据来学习局部信息,这对于理解图像非常重要。为了促进本地信息,我们通过模仿已经训练有素的卷积神经网络(CNN)的特征来实现VT的当地指南,灵感来自CNN的本地局部到全球层次结构。在我们的双任务学习范式下,对低分辨率图像训练的轻量级CNN提供的局部性指南足以加速收敛并在很大程度上提高VTS的性能。因此,我们的局部指导方法非常简单有效,可以作为小型数据集中VT的基本性能增强方法。广泛的实验表明,我们的方法在小型数据集中从头开始训练时可以显着改善VT,并且与不同种类的VT和数据集兼容。例如,我们提出的方法可以将各种VT在微型数据集上的性能提高(例如,DEIT 13.07%,T2T为8.98%,PVT为7.85%),并使更强的基线PVTV2提高了1.86%至79.30%,显示了VTS上VT的潜在潜在的。该代码可从https://github.com/lkhl/tiny-transformers获得。
While the Vision Transformer (VT) architecture is becoming trendy in computer vision, pure VT models perform poorly on tiny datasets. To address this issue, this paper proposes the locality guidance for improving the performance of VTs on tiny datasets. We first analyze that the local information, which is of great importance for understanding images, is hard to be learned with limited data due to the high flexibility and intrinsic globality of the self-attention mechanism in VTs. To facilitate local information, we realize the locality guidance for VTs by imitating the features of an already trained convolutional neural network (CNN), inspired by the built-in local-to-global hierarchy of CNN. Under our dual-task learning paradigm, the locality guidance provided by a lightweight CNN trained on low-resolution images is adequate to accelerate the convergence and improve the performance of VTs to a large extent. Therefore, our locality guidance approach is very simple and efficient, and can serve as a basic performance enhancement method for VTs on tiny datasets. Extensive experiments demonstrate that our method can significantly improve VTs when training from scratch on tiny datasets and is compatible with different kinds of VTs and datasets. For example, our proposed method can boost the performance of various VTs on tiny datasets (e.g., 13.07% for DeiT, 8.98% for T2T and 7.85% for PVT), and enhance even stronger baseline PVTv2 by 1.86% to 79.30%, showing the potential of VTs on tiny datasets. The code is available at https://github.com/lkhl/tiny-transformers.