令牌合并：您的VIT但更快

论文标题

令牌合并：您的VIT但更快

Token Merging: Your ViT But Faster

论文作者

Bolya, Daniel, Fu, Cheng-Yang, Dai, Xiaoliang, Zhang, Peizhao, Feichtenhofer, Christoph, Hoffman, Judy

论文摘要

我们引入了令牌合并（TOME），这是一种简单的方法，可以增加现有VIT模型的吞吐量而无需训练。 Tome使用通用和轻巧的匹配算法逐渐结合了变压器中的类似令牌，该算法与修剪一样快，同时更准确。在现成的情况下，图像可以将最先进的VIT-L @ 512和VIT-H @ 518型号的吞吐量和2.2倍的视频吞吐量，每种情况下的精度仅为0.2-0.3％。在训练期间，也可以轻松地应用tome，从而提高实践训练的速度高达2倍，以进行视频中的MAE微调。用TOME训练进一步最小化准确性下降，导致VIT-B在音频上的吞吐量仅为0.4％的地图下降。从定性上讲，我们发现Tome将对象部分合并为一个令牌，即使是在多个视频框架上也是如此。总体而言，Tome的准确性和速度在图像，视频和音频方面具有最先进的竞争力。

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

下载PDF全文

下载文献需遵守相关版权规定

论文标题