Quadtree对视觉变压器的关注

论文标题

Quadtree对视觉变压器的关注

QuadTree Attention for Vision Transformers

论文作者

Tang, Shitao, Zhang, Jiahui, Zhu, Siyu, Tan, Ping

论文摘要

由于其能力捕获了远程依赖性，变形金刚在许多视觉任务中都取得了成功。但是，它们的二次计算复杂性为将它们应用于需要密集预测的视力任务（例如对象检测，功能匹配，立体声等）的视觉任务带来了一个主要障碍。我们引入了Quadtree的注意，从而降低了从二次性到线性的计算复杂性。我们的Quadtree Transformer建立了代币金字塔，并以粗到精细的方式计算注意力。在每个级别，选择了最高注意力分数的顶部K斑块，因此，在下一级别，仅在与这些顶部K斑块相对应的相关区域内评估注意力。我们证明，Quadtree的注意力在各种视觉任务中取得了最先进的表现，例如随着扫描仪的特征匹配提高了4.0％，立体声匹配匹配减少了约50％，ImageNet分类的TOP-1准确性提高了0.4-1.5％，可可对象检测的1.2-1.8％提高，而对先前正式的正式变形金刚的语义分割提高了0.7-2.4％。这些代码可在https://github.com/tangshitao/quadtreattention上找到。

Transformers have been successful in many vision tasks, thanks to their capability of capturing long-range dependency. However, their quadratic computational complexity poses a major obstacle for applying them to vision tasks requiring dense predictions, such as object detection, feature matching, stereo, etc. We introduce QuadTree Attention, which reduces the computational complexity from quadratic to linear. Our quadtree transformer builds token pyramids and computes attention in a coarse-to-fine manner. At each level, the top K patches with the highest attention scores are selected, such that at the next level, attention is only evaluated within the relevant regions corresponding to these top K patches. We demonstrate that quadtree attention achieves state-of-the-art performance in various vision tasks, e.g. with 4.0% improvement in feature matching on ScanNet, about 50% flops reduction in stereo matching, 0.4-1.5% improvement in top-1 accuracy on ImageNet classification, 1.2-1.8% improvement on COCO object detection, and 0.7-2.4% improvement on semantic segmentation over previous state-of-the-art transformers. The codes are available at https://github.com/Tangshitao/QuadtreeAttention.

下载PDF全文

下载文献需遵守相关版权规定

论文标题