论文标题
Electra:训练前文本编码作为歧视者而不是发电机
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
论文作者
论文摘要
蒙版语言建模(MLM)预训练方法,例如BERT通过用[bask]替换某些令牌,然后训练模型来重建原始令牌,从而破坏了输入。尽管它们在转移到下游NLP任务时产生良好的结果,但通常需要大量的计算才能有效。作为替代方案,我们提出了一项更有效的样品预训练任务,称为代替令牌检测。我们的方法没有掩盖输入,而是通过用从小发电机网络采样的合理替代方案替换一些令牌来破坏它。然后,我们没有训练一个模型来预测损坏的令牌的原始身份,而是训练一个歧视模型,该模型预测损坏的输入中的每个令牌是否被发电机样本代替。彻底的实验表明,这项新的训练任务比MLM更有效,因为该任务是在所有输入令牌上定义的,而不仅仅是被掩盖的小子集。结果,通过我们的方法学到的上下文表示,给定模型大小,数据和计算,伯特学到的伯特学会的表示。对于小型模型而言,收益尤其强劲。例如,我们在一个GPU上训练模型4天,在胶水自然语言理解基准的基准上胜过GPT(使用30倍训练)。我们的方法在规模上也很好地效果很好,在使用相同数量的计算时,它与Roberta和XLNET相当地表现,同时使用少于1/4的计算,并且在使用相同数量的计算时表现出色。
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.