论文标题
EOS决策和长度外推
The EOS Decision and Length Extrapolation
论文作者
论文摘要
向看不见的序列长度外推是语言神经产生模型的挑战。在这项工作中,我们表征了经常忽略的建模决策的长度外推的影响:通过使用特殊的序列(EOS)词汇项目来预测生成过程的终结。我们研究了甲骨文设置 - 强迫模型在测试时生成正确的序列长度 - 以比较训练有训练的网络的长度 - 外排外行为,以预测EOS(+EOS)与未对(-EOS)训练的网络进行比较。我们发现-EOS基本上要优于 +EOS,例如,在括号结束任务中,训练时间比训练时间时长10倍,并且在困难的扫描数据集长度概括任务中比 +EOS提高了40%。通过比较-EOS和 +EOS模型的隐藏状态和动力学,我们观察到 +EOS模型无法概括,因为它们(1)不必要地按线性位置将其隐藏状态分层是一个序列(我们称之为长度歧管)或(2)陷入群集(我们称之为eos the eos the eos the eos token the eos token the eos token)。
Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.