论文标题
没有人类语言的语言模型进行培训
Pre-Training a Language Model Without Human Language
论文作者
论文摘要
在本文中,我们研究了预训练数据的内在性质如何有助于微调下游性能。为此,我们在具有某些功能的几个语料库上预先培训了不同的基于变压器的掩盖语言模型,然后在胶水基准上微调了这些语言模型。我们发现,在非结构化数据上预先训练的模型击败了那些在下游任务上直接从头开始训练的模型。我们的结果还表明,在结构化数据上进行预训练并不总是使模型获取能力可以转移到下游任务的自然语言。令我们惊讶的是,我们发现对某些非人类语言数据进行预培训的胶水性能接近了另一种非英语语言的良好性能。
In this paper, we study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance. To this end, we pre-train different transformer-based masked language models on several corpora with certain features, and we fine-tune those language models on GLUE benchmarks. We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks. Our results also show that pre-training on structured data does not always make the model acquire ability that can be transferred to natural language downstream tasks. To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.