论文标题
下游数据集使训练有素的语料库令人惊讶
Downstream Datasets Make Surprisingly Good Pretraining Corpora
论文作者
论文摘要
对于大多数自然语言处理任务,主要的实践是使用较小的下游数据集对大型预验证的变压器模型(例如BERT)。尽管这种方法取得了成功,但尚不清楚这些收益在多大程度上归因于用于预处理而不是训练训练的目标本身所采用的大量背景语料库。本文介绍了一项自我预测的大规模研究,其中相同的(下游)训练数据都用于预训练和填充。在解决Electra和Roberta型号以及10个不同下游分类数据集的实验中,我们观察到,在BookWiki语料库上进行自我预测的竞争对手标准预处理(尽管使用了约$ 10 \ times $ $ 500 \ $ 500 \ timess $ $数据),在$ 7 $和$ 5 $ $ 5 $数据的情况下却远远超过了后者。令人惊讶的是,这些特定于任务的预预性模型通常在其他任务(包括胶水基准)上表现良好。除了分类任务外,自我预测还为结构化的输出预测任务(例如基于跨度的问题回答和常识推理)提供了好处,通常提供超过$ 50 \%的$ $ $ $ $ $ $ 50 \%的$ $ $ $ $ $ 50 \%$的提高性能提高了Bookwiki Corpus。我们的结果暗示,在许多情况下,可归因于预训练的性能的收益主要由预处理目标本身驱动,并且并不总是归因于大量使用外部预处理数据。考虑到网络规模预处理数据中对知识产权和进攻内容的担忧,这些发现尤其重要。
For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around $10\times$--$500\times$ less data), outperforming the latter on $7$ and $5$ datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Besides classification tasks, self-pretraining also provides benefits on structured output prediction tasks such as span based question answering and commonsense inference, often providing more than $50\%$ of the performance boosts provided by pretraining on the BookWiki corpus. Our results hint that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.