卷积Xformers for Vision

论文标题

卷积Xformers for Vision

Convolutional Xformers for Vision

论文作者

Jeevan, Pranav, sethi, Amit

论文摘要

视觉变形金刚（VIT）仅在处理图像中发现有限的实际用途，尽管它们在某些基准测试中具有最先进的精度。与卷积神经网络（CNN）相比，他们使用有限使用的原因包括他们对大型培训数据集的需求和更多的计算资源，这是由于其自我注意力的二次复杂性。我们提出了线性注意力卷积混合体系结构 - 视觉卷积X形成剂（CXV） - 以克服这些局限性。我们用线性注意机制（例如表演者，NyStrömformer和线性变压器）代替了二次注意力，以减少其GPU使用情况。图像数据的电感先验由卷积子层提供，从而消除了VIT使用的类令牌和位置嵌入的需求。我们还提出了一种新的培训方法，在培训的不同阶段，我们使用两个不同的优化器，并表明它提高了不同体系结构的TOP-1图像分类精度。 CXV胜过其他架构，令牌搅拌机（例如Convmixer，FNET和MLP混合器），变压器模型（例如VIT，CCT，CCT，CVT和HYBRID XFORMERS）以及在具有有限数据和GPU资源（Cores，RAM，RAM，POWER）的情况下进行图像分类的图像分类。

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. We also propose a new training method where we use two different optimizers during different phases of training and show that it improves the top-1 image classification accuracy across different architectures. CXV outperforms other architectures, token mixers (e.g. ConvMixer, FNet and MLP Mixer), transformer models (e.g. ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources (cores, RAM, power).

下载PDF全文

下载文献需遵守相关版权规定

论文标题