EOS决策和长度外推

论文标题

EOS决策和长度外推

The EOS Decision and Length Extrapolation

论文作者

Newman, Benjamin, Hewitt, John, Liang, Percy, Manning, Christopher D.

论文摘要

向看不见的序列长度外推是语言神经产生模型的挑战。在这项工作中，我们表征了经常忽略的建模决策的长度外推的影响：通过使用特殊的序列（EOS）词汇项目来预测生成过程的终结。我们研究了甲骨文设置 - 强迫模型在测试时生成正确的序列长度 - 以比较训练有训练的网络的长度 - 外排外行为，以预测EOS（+EOS）与未对（-EOS）训练的网络进行比较。我们发现-EOS基本上要优于 +EOS，例如，在括号结束任务中，训练时间比训练时间时长10倍，并且在困难的扫描数据集长度概括任务中比 +EOS提高了40％。通过比较-EOS和 +EOS模型的隐藏状态和动力学，我们观察到 +EOS模型无法概括，因为它们（1）不必要地按线性位置将其隐藏状态分层是一个序列（我们称之为长度歧管）或（2）陷入群集（我们称之为eos the eos the eos the eos token the eos token the eos token）。

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题