为空白板着色：预训练赋予序列到序列模型的分层电感偏差

论文标题

为空白板着色：预训练赋予序列到序列模型的分层电感偏差

Coloring the Blank Slate: Pre-training Imparts a Hierarchical Inductive Bias to Sequence-to-sequence Models

论文作者

Mueller, Aaron, Frank, Robert, Linzen, Tal, Wang, Luheng, Schuster, Sebastian

论文摘要

单词之间的关系由层次结构而不是线性排序支配。序列到序列（SEQ2SEQ）模型尽管在下游NLP应用程序中取得了成功，但在执行语法转换时通常无法以层次结构敏感的方式概括，例如，将声明句子转换为问题。但是，SEQ2SEQ模型的句法评估仅观察到在经过自然语言数据之前未经培训以执行句法转换的模型，尽管已经发现预训练会诱导语言模型中的层次语言概括。换句话说，SEQ2SEQ模型的句法功能可能被大大低估了。我们使用预先训练的SEQ2SEQ模型T5和BART及其多语言变体MT5和MBART解决了这一差距。我们评估它们是否在两种语言的两种转换上进行层次概括：英语和德语中的问题形成和钝化。我们发现，在执行句法转换时，预训练的SEQ2SEQ模型会从层次上延伸，而从头开始训练的句法转换的模型则不会。该结果提供了证据表明，从未注销的自然语言文本中层次句法信息的学习性，同时还表明SEQ2SEQ模型能够进行句法概括，尽管只有在接触到人类学习者比人类学习者更多的语言数据之后。

Relations between words are governed by hierarchical structure rather than linear ordering. Sequence-to-sequence (seq2seq) models, despite their success in downstream NLP applications, often fail to generalize in a hierarchy-sensitive manner when performing syntactic transformations - for example, transforming declarative sentences into questions. However, syntactic evaluations of seq2seq models have only observed models that were not pre-trained on natural language data before being trained to perform syntactic transformations, in spite of the fact that pre-training has been found to induce hierarchical linguistic generalizations in language models; in other words, the syntactic capabilities of seq2seq models may have been greatly understated. We address this gap using the pre-trained seq2seq models T5 and BART, as well as their multilingual variants mT5 and mBART. We evaluate whether they generalize hierarchically on two transformations in two languages: question formation and passivization in English and German. We find that pre-trained seq2seq models generalize hierarchically when performing syntactic transformations, whereas models trained from scratch on syntactic transformations do not. This result presents evidence for the learnability of hierarchical syntactic information from non-annotated natural language text while also demonstrating that seq2seq models are capable of syntactic generalization, though only after exposure to much more language data than human learners receive.

下载PDF全文

下载文献需遵守相关版权规定

论文标题