论文标题

您什么时候需要数十亿个验证数据的单词?

When Do You Need Billions of Words of Pretraining Data?

论文作者

Zhang, Yian, Warstadt, Alex, Li, Haau-Sing, Bowman, Samuel R.

论文摘要

NLP目前由Roberta等通用语言模型(如Roberta)主导,该模型通过对数十亿个单词进行预处理,在NLU任务上实现了强劲的性能。但是,变压器LMS从大规模预处理中学习的知识或技能是什么确切的知识或技能?我们采用四种探测方法---分类器探测,信息理论探测,无监督的相对可接受性判断以及对NLU任务进行微调 - 并绘制学习曲线,这些曲线跟踪这些不同的语言能力测量的增长,该曲线在使用Minibertas(Roberta Mander)(在1M,10m,10m,10m,10m,100m,100m,100m,100m,100m和1 bor的ROBERTA模型上)进行了预处理数据的增长。我们发现,LMS仅需要约100m或100m的单词来学习可靠地编码我们测试的大多数句法和语义特征的表示。为了获得足够的常识性知识和其他典型的下游NLU任务所需的常识性知识和其他技能,需要大量数据。结果表明,虽然编码语言特征的能力几乎可以肯定对于语言理解是必要的,但其他形式的知识可能是大型预审预周化模型中语言理解的主要驱动力的主要驱动力。

NLP is currently dominated by general-purpose pretrained language models like RoBERTa, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? We adopt four probing methods---classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks---and draw learning curves that track the growth of these different measures of linguistic ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features we test. A much larger quantity of data is needed in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源