快速PARC：捕获Convnet和Vits的位置意识到全球功能

论文标题

快速PARC：捕获Convnet和Vits的位置意识到全球功能

Fast-ParC: Capturing Position Aware Global Feature for ConvNets and ViTs

论文作者

Yang, Tao, Zhang, Haokui, Hu, Wenze, Chen, Changwen, Wang, Xiaoyu

论文摘要

近年来，变压器模型在各个领域取得了巨大进展。在计算机视觉领域，视觉变压器（VIT）也成为卷积神经网络（Convnets）的强大替代方案，但由于两者都有自己的优点，因此他们无法取代Convnet。例如，VIT擅长通过注意机制提取全球特征，而Convnet由于其强烈的感应偏置而在局部关系中更有效。出现的一个自然想法是结合Convnets和Vits的优势来设计新结构。在本文中，我们提出了一个新的基本神经网络运算符，名为“位置感知循环卷积”（PARC）及其加速版Fast-PARC。 PARC操作员可以使用全球内核和圆形卷积捕获全球功能，同时通过使用位置嵌入来保持位置敏感性。我们的快速PARC进一步降低了PARC的O（n2）时间复杂性，使用快速傅立叶变换。这种加速度使得在具有大特征图的模型的早期阶段使用全球卷积是可能的，但仍然保持与使用3x3或7x7内核相当的整体计算成本。提出的操作可以以插件方式使用至1）将VIT转换为纯convnet体系结构，以享受更广泛的硬件支持并实现更高的推理速度； 2）在探报的深处取代传统的卷积，以扩大有效的接受场来提高准确性。实验结果表明，我们的PARC OP可以有效地扩大传统交流的接收领域，并在所有三个流行的视觉任务，图像分类，对象上采用拟议的OP对VIT和CONVNET模型的好处。

Transformer models have made tremendous progress in various fields in recent years. In the field of computer vision, vision transformers (ViTs) also become strong alternatives to convolutional neural networks (ConvNets), yet they have not been able to replace ConvNets since both have their own merits. For instance, ViTs are good at extracting global features with attention mechanisms while ConvNets are more efficient in modeling local relationships due to their strong inductive bias. A natural idea that arises is to combine the strengths of both ConvNets and ViTs to design new structures. In this paper, we propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. The ParC operator can capture global features by using a global kernel and circular convolution while keeping location sensitiveness by employing position embeddings. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. This acceleration makes it possible to use global convolution in the early stages of models with large feature maps, yet still maintains the overall computational cost comparable with using 3x3 or 7x7 kernels. The proposed operation can be used in a plug-and-play manner to 1) convert ViTs to pure-ConvNet architecture to enjoy wider hardware support and achieve higher inference speed; 2) replacing traditional convolutions in the deep stage of ConvNets to improve accuracy by enlarging the effective receptive field. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets, and adopting the proposed op benefits both ViTs and ConvNet models on all three popular vision tasks, image classification, object

下载PDF全文

下载文献需遵守相关版权规定

论文标题