SMYRF：使用不对称聚类有效注意

论文标题

SMYRF：使用不对称聚类有效注意

SMYRF: Efficient Attention using Asymmetric Clustering

论文作者

Daras, Giannis, Kitaev, Nikita, Odena, Augustus, Dimakis, Alexandros G.

论文摘要

我们提出了一种新型的平衡聚类算法，以近似关注。注意力复杂性从$ O（n^2）$降低到$ O（n \ log n）$，其中$ n $是序列长度。我们的算法SMYRF通过定义新的不对称转换和产生平衡簇的自适应方案，以一种新颖的方式使用局部敏感的散列（LSH）。 SMYRF的最大优势是，它可以用作备用密集注意层而无需进行任何重新培训的替换。相反，先前的快速注意方法施加约束（例如查询和键共享相同的向量表示），并且需要从头开始重新训练。我们将我们的方法应用于预训练的最先进的自然语言处理和计算机视觉模型，并报告了重大的记忆和速度收益。值得注意的是，Smyrf-bert在胶水上的表现优于bert（略微），而使用$ 50 \％$ $减少内存。我们还表明，在训练之前和之后，SMYRF可以互换地互换使用。最后，我们使用SMYRF在高分辨率上引起人们的注意来训练甘斯。使用单个TPU，我们能够将注意力扩展到128x128 = 16K和256x256 = 65k令牌上的Biggan上的Celeba-HQ上的65K令牌。

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$, where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using $50\%$ less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

下载PDF全文

下载文献需遵守相关版权规定

论文标题