论文标题
SAIT:稀疏视觉变压器通过自适应令牌修剪
SaiT: Sparse Vision Transformers through Adaptive Token Pruning
论文作者
论文摘要
尽管视觉变压器取得了令人印象深刻的结果,但有效,有效地加速了这些模型,可以进一步提高性能。在这项工作中,我们提出了一个密集/稀疏的训练框架,以获取统一模型,从而使各种令牌密度之间的重量共享。因此,一种模型为不同的应用提供了一系列准确性和吞吐量权衡。此外,我们引入自适应令牌修剪,以根据输入图像优化贴片令牌稀疏性。此外,我们研究了知识蒸馏,以增强早期变压器模块中的令牌选择能力。稀疏的自适应图像变压器(SAIT)仅通过改变代币的稀疏性来提供不同级别的模型加速度。具体而言,SAIT将计算复杂性(FLOP)降低了39%-43%,并将吞吐量增加67%-91%,而各种视觉变压器模型的精度损失小于0.5%。同时,同一模型还通过跳过稀疏步骤来提供零精度下降选项。与最先进的变压器和卷积模型相比,SAIT可以实现更好的准确性和计算权衡。
While vision transformers have achieved impressive results, effectively and efficiently accelerating these models can further boost performances. In this work, we propose a dense/sparse training framework to obtain a unified model, enabling weight sharing across various token densities. Thus one model offers a range of accuracy and throughput tradeoffs for different applications. Besides, we introduce adaptive token pruning to optimize the patch token sparsity based on the input image. In addition, we investigate knowledge distillation to enhance token selection capability in early transformer modules. Sparse adaptive image Transformer (SaiT) offers varying levels of model acceleration by merely changing the token sparsity on the fly. Specifically, SaiT reduces the computation complexity (FLOPs) by 39% - 43% and increases the throughput by 67% - 91% with less than 0.5% accuracy loss for various vision transformer models. Meanwhile, the same model also provides the zero accuracy drop option by skipping the sparsification step. SaiT achieves better accuracy and computation tradeoffs than state-of-the-art transformer and convolutional models.