Deberta：解码增强的伯特（Bert）

论文标题

Deberta：解码增强的伯特（Bert）

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

论文作者

He, Pengcheng, Liu, Xiaodong, Gao, Jianfeng, Chen, Weizhu

论文摘要

预训练的神经语言模型的最新进展显着改善了许多自然语言处理（NLP）任务的性能。在本文中，我们提出了一种新的模型体系结构Deberta（解码增强的BERT，并引起了人们的注意），该模型使用两种新技术改善了Bert和Roberta模型。第一个是分离的注意机制，其中每个单词使用两个向量分别编码其内容和位置表示，并且单词之间的注意力权重分别使用其内容和相对位置上的分离矩阵计算。其次，增强的掩模解码器用于在解码层中合并绝对位置，以预测模型预训练中的蒙版令牌。此外，一种新的虚拟对抗训练方法用于微调来改善模型的概括。我们表明，这些技术显着提高了模型预训练的效率，以及自然语言理解（NLU）和自然语言产生（NLG）下游任务的效率。与罗伯塔·普兰格（Roberta-Large）相比，对一半培训数据进行培训的Deberta模型在各种NLP任务上的表现始终如一，在MNLI方面取得了进步 +0.9％（90.2％vs. 91.1％），对小队V2.0 v2.0 by +2.3％（88.4％vs. 88.4％vs. 90.7％）和 +3.6％（86％）（86％）（86％）（86％）（86％）（86％）。值得注意的是，我们通过训练一个较大版本，该版本由48个转换层组成，其中包括15亿个参数。显着的性能提升使单一的Deberta模型超过了超级基准（Wang等，2016a）在宏观平均得分方面的第一次（89.9对89.8），而Ensemble Deberta模型则位于超级Glue排行榜上，如2021年1月6日，在2021年1月6日，由2021年1月6日，人类基线（90）成员（90）成员（90）成员Margin（90）。

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

下载PDF全文

下载文献需遵守相关版权规定

论文标题