多分辨率分析（MRA）进行近似自我注意力

论文标题

多分辨率分析（MRA）进行近似自我注意力

Multi Resolution Analysis (MRA) for Approximate Self-Attention

论文作者

Zeng, Zhanpeng, Pal, Sourav, Kline, Jeffery, Fung, Glenn M, Singh, Vikas

论文摘要

变形金刚已成为自然langagage处理和视觉中许多任务的首选模型。最近在培训和部署变压器方面的努力更有效地确定了许多策略，以近似自我发作矩阵，这是变压器体系结构中的关键模块。有效的想法包括各种预先指定的稀疏模式，低级基础扩展及其组合。在本文中，我们重新访问了经典的多分辨率分析（MRA）概念，例如小波，其在这种情况下的潜在价值迄今仍未被逐渐消失。我们表明，基于现代硬件和实施挑战所告知的经验反馈和设计选择的简单近似值，最终在大多数感兴趣的标准中产生了一种基于MRA的自我注意力方法，具有出色的性能。我们进行了一系列广泛的实验，并证明该多分辨率方案的表现优于最有效的自我发挥建议，并且对短序列和长序列都有利。代码可在\ url {https://github.com/mlpen/mra-witchention}中找到。

Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and demonstrate that this multi-resolution scheme outperforms most efficient self-attention proposals and is favorable for both short and long sequences. Code is available at \url{https://github.com/mlpen/mra-attention}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题