CLUSTR：通过聚类探索有效的自我注意力，以供视觉变压器

论文标题

CLUSTR：通过聚类探索有效的自我注意力，以供视觉变压器

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

论文作者

Xie, Yutong, Zhang, Jianpeng, Xia, Yong, Hengel, Anton van den, Wu, Qi

论文摘要

尽管变形金刚已成功地从其语言建模起源过渡到基于图像的应用程序，但它们的二次计算复杂性仍然是一个挑战，尤其是对于密集的预测。在本文中，我们提出了一种基于内容的稀疏注意方法，以替代密集的自我注意力，旨在降低计算复杂性，同时保留对远程依赖性建模的能力。具体而言，我们聚集，然后汇总键和值代币，作为一种基于内容的方法来减少总代币计数。由此产生的聚类序列保留了原始信号的语义多样性，但可以以较低的计算成本进行处理。此外，我们进一步将聚类引导的注意力从单尺度扩展到多尺度，这有利于密集的预测任务。我们将提出的变压器体系结构标记为Clustrust，并证明它在各种视觉任务上实现了最新的性能，但计算成本较低，参数较少。例如，我们具有2270万参数的cluster小型模型可在Imagenet上实现83.2 \％TOP-1的精度。源代码和Imagenet模型将公开可用。

Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR, and demonstrate that it achieves state-of-the-art performance on various vision tasks but at lower computational cost and with fewer parameters. For instance, our ClusTR small model with 22.7M parameters achieves 83.2\% Top-1 accuracy on ImageNet. Source code and ImageNet models will be made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题