Ernie-sparse：通过正规化的自我注意力学习等级高效变压器

论文标题

Ernie-sparse：通过正规化的自我注意力学习等级高效变压器

ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention

论文作者

Liu, Yang, Liu, Jiaxiang, Chen, Li, Lu, Yuxiang, Feng, Shikun, Feng, Zhida, Sun, Yu, Tian, Hao, Wu, Hua, Wang, Haifeng

论文摘要

稀疏变压器最近引起了很多关注，因为能够降低二次依赖性对序列长度的能力。我们认为，不同注意力拓扑之间的信息瓶颈灵敏度和不一致的两个因素可能会影响稀疏变压器的性能。本文提出了一个精心设计的模型，名为Ernie-Sparse。它由两个独特的部分组成：（i）层次稀疏变压器（HST），以依次统一本地和全局信息。（ii）自我发挥正则化（SAR）方法，一种新颖的正则化，旨在最大程度地减少具有不同注意力拓扑的变压器的距离。为了评估Ernie-Sparse的有效性，我们进行了广泛的评估。首先，我们对多模式的长序列建模任务基准（LRA）进行实验。实验结果表明，Ernie-Sparse的表现明显优于多种强大的基线方法，包括密集的注意力和其他有效的稀疏注意方法，并提高了2.77％（57.78％vs. 55.01％）。其次，为了进一步显示我们的方法的有效性，我们在3个文本分类和2个质量检查下游任务上验证了它，并将分类基准的改进提高了0.83％（92.46％vs. 91.63％），QA基准为3.24％（74.67％vs. 71.43.41.43％）。实验结果继续证明其出色的性能。

Sparse Transformer has recently attracted a lot of attention since the ability for reducing the quadratic dependency on the sequence length. We argue that two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information. (ii) Self-Attention Regularization (SAR) method, a novel regularization designed to minimize the distance for transformers with different attention topologies. To evaluate the effectiveness of ERNIE-Sparse, we perform extensive evaluations. Firstly, we perform experiments on a multi-modal long sequence modeling task benchmark, Long Range Arena (LRA). Experimental results demonstrate that ERNIE-Sparse significantly outperforms a variety of strong baseline methods including the dense attention and other efficient sparse attention methods and achieves improvements by 2.77% (57.78% vs. 55.01%). Secondly, to further show the effectiveness of our method, we pretrain ERNIE-Sparse and verified it on 3 text classification and 2 QA downstream tasks, achieve improvements on classification benchmark by 0.83% (92.46% vs. 91.63%), on QA benchmark by 3.24% (74.67% vs. 71.43%). Experimental results continue to demonstrate its superior performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题