变形金刚通过梯度下降学习

论文标题

变形金刚通过梯度下降学习

Transformers learn in-context by gradient descent

论文作者

von Oswald, Johannes, Niklasson, Eyvind, Randazzo, Ettore, Sacramento, João, Mordvintsev, Alexander, Zhmoginov, Andrey, Vladymyrov, Max

论文摘要

目前，变压器中的文化学习机制尚未得到充分理解，并且主要是直觉。在本文中，我们建议有关自动回归目标的培训变压器与基于梯度的元学习配方密切相关。首先，我们提供了一个简单的重量构造，该构建显示了1）单个线性自我发项层和2）回归损失的梯度偏生度（GD）的等效性。通过这种构建的激励，我们从经验上表明，当训练仅自动发项的变压器上的简单回归任务时，GD和Transformers所学的模型表现出极大的相似性，或者显着地，通过优化的重量与构造相匹配。因此，我们展示了训练有素的变压器如何变为梅萨式优化者，即通过梯度下降在其前传中学习模型。这至少使我们能够在回归问题的领域中理解优化变压器中文本学习的内部工作。在这种见解的基础上，我们通过学习迭代曲率校正并在深度数据表示上学习线性模型来求解非线性回归任务，从而确定了变形金刚如何超越普通梯度下降的性能。最后，我们讨论了与被认为对所谓的诱导式学习至关重要的机制（Olsson等，2022）至关重要的相似之处，并展示了如何将其理解为变压器中梯度下降学习的特定情况。可以在https://github.com/google-research/self-organising-systems/tree/master/master/transformers_learn_icl_by_gd上找到重现实验的代码。

At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss. Motivated by that construction, we show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction. Thus we show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass. This allows us, at least in the domain of regression problems, to mechanistically understand the inner workings of in-context learning in optimized Transformers. Building on this insight, we furthermore identify how Transformers surpass the performance of plain gradient descent by learning an iterative curvature correction and learn linear models on deep data representations to solve non-linear regression tasks. Finally, we discuss intriguing parallels to a mechanism identified to be crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be understood as a specific case of in-context learning by gradient descent learning within Transformers. Code to reproduce the experiments can be found at https://github.com/google-research/self-organising-systems/tree/master/transformers_learn_icl_by_gd .

下载PDF全文

下载文献需遵守相关版权规定

论文标题