视觉变压器：基于令牌的图像表示和计算机视觉处理

论文标题

视觉变压器：基于令牌的图像表示和计算机视觉处理

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

论文作者

Wu, Bichen, Xu, Chenfeng, Dai, Xiaoliang, Wan, Alvin, Zhang, Peizhao, Yan, Zhicheng, Tomizuka, Masayoshi, Gonzalez, Joseph, Keutzer, Kurt, Vajda, Peter

论文摘要

计算机视觉通过（a）表示图像作为统一的像素阵列和（b）卷积高度定位的特征取得了显着的成功。但是，卷积不管重要性如何处理所有图像像素。不论内容如何，都会明确对所有图像进行建模；并努力将空间距离的概念联系起来。在这项工作中，我们通过（a）将图像表示为语义视觉令牌和（b）运行变压器以密集建模令牌关系来挑战此范式。至关重要的是，我们的视觉变压器在语义令牌空间中运行，根据上下文明智地关注不同的图像部分。这与需要更高的计算订单的像素空间变压器形成鲜明对比。使用高级训练配方，我们的VTS显着优于其卷积对应物，在使用较少的FLOP和参数的同时，将Imagenet Top-1的重新连接精度提高了4.6至7点。对于基于VT的唇部和可可固定的语义分割，基于VT的特征金字塔网络（FPN）在MIOU上实现了0.35点，同时将FPN模块的拖鞋降低了6.5倍。

Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to 7 points while using fewer FLOPs and parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.

下载PDF全文

下载文献需遵守相关版权规定

论文标题