论文标题
邻里注意变形金刚
Neighborhood Attention Transformer
论文作者
论文摘要
我们提出邻里注意力(NA),这是视力的第一个高效且可扩展的滑动窗口注意机制。 Na是一个像素的操作,将自我注意力(SA)定位在最近的相邻像素上,因此与SA的二次复杂性相比,享有线性时间和空间的复杂性。滑动窗口模式可以使Na的接受场不需要额外的像素移动而增长,并且可以保留转移性的等同性,这与Swin Transformer的窗户自我注意力(WSA)不同。我们开发了Natten(邻居注意扩展),这是一种具有高效C ++和CUDA内核的Python软件包,它使NA可以比Swin的WSA快40%,同时使用多达25%的内存。我们进一步介绍了基于NA的新的层次变压器设计,它进一步介绍了邻里注意力变压器(NAT),可提高图像分类和下游视觉性能。 NAT的实验结果具有竞争力; Nat-tiny在Imagenet上达到83.2%的TOP-1精度,MS-Coco上的MAP为51.4%,ADE20K的MIOU上达到了48.4%MIOU,即ImagEnet精度为1.9%,可可映射1.0%和2.6%的ADE20K MIOU改进具有相似尺寸的SWIN模型。为了支持基于滑动窗口注意的更多研究,我们为我们的项目开源,并在以下网址发布我们的检查站:https://github.com/shi-labs/neighborhood-ctithention-transformer。
We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding-window attention, we open source our project and release our checkpoints at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer .