论文标题
词汇概括通过较大的模型和更长的培训来改善
Lexical Generalization Improves with Larger Models and Longer Training
论文作者
论文摘要
虽然微调语言模型在许多任务上都表现良好,但它们也被证明依赖于浅层表面特征,例如词汇重叠。过度利用这种启发式方法可能会导致挑战性投入失败。我们分析了在自然语言推断,释义检测和阅读理解(使用新型的对比数据集)中分析词汇叠加启发式方法的使用,并发现更大的模型不太容易使用词汇覆盖启发式。我们还发现,较长的培训导致模型放弃词汇叠加启发式方法。最后,我们提供了证据,表明模型大小之间的差异在预训练的模型中具有其来源
While fine-tuned language models perform well on many tasks, they were also shown to rely on superficial surface features such as lexical overlap. Excessive utilization of such heuristics can lead to failure on challenging inputs. We analyze the use of lexical overlap heuristics in natural language inference, paraphrase detection, and reading comprehension (using a novel contrastive dataset), and find that larger models are much less susceptible to adopting lexical overlap heuristics. We also find that longer training leads models to abandon lexical overlap heuristics. Finally, we provide evidence that the disparity between models size has its source in the pre-trained model