变形金刚的简单性偏见及其学习稀疏布尔功能的能力

论文标题

变形金刚的简单性偏见及其学习稀疏布尔功能的能力

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

论文作者

Bhattamishra, Satwik, Patel, Arkil, Kanade, Varun, Blunsom, Phil

论文摘要

尽管变形金刚在NLP任务上取得了广泛的成功，但最近的作品发现，与经常性模型相比，它们很难对几种正式语言进行建模。这就提出了一个问题，即为什么变形金刚在实践中表现良好，以及它们是否具有使它们能够比复发模型更好地概括的属性。在这项工作中，我们对布尔函数进行了广泛的实证研究，以证明以下内容：（i）随机变压器对低灵敏度的功能相对偏见。（ii）当对布尔函数进行培训时，变形金刚和LSTM都优先考虑低灵敏度的学习功能，而变压器最终会融合到较低灵敏度的功能。（iii）在稀疏敏感性的稀疏布尔函数上，我们发现变形金刚在噪声标签的情况下即使在噪声标签的情况下也可以完全概括，而LSTMS过度合适并实现了较差的概括准确性。总体而言，我们的结果提供了有力的可量化证据，这些证据表明了变压器和经常性模型的电感偏差差异，这些偏见可能有助于解释尽管表达相对有限，但仍可以解释变形金刚的有效概括性能。

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题