通过大型线性层的记忆效率反向传播

论文标题

通过大型线性层的记忆效率反向传播

Memory-Efficient Backpropagation through Large Linear Layers

论文作者

Bershatsky, Daniel, Mikhalev, Aleksandr, Katrutsa, Alexandr, Gusak, Julia, Merkulov, Daniil, Oseledets, Ivan

论文摘要

在变压器等现代神经网络中，线性层需要大量的内存才能在向后传递期间存储激活。这项研究提出了一种通过线性层执行反向传播的记忆方法。由于线性层的梯度是通过矩阵乘法计算的，因此我们考虑了随机矩阵乘法的方法，并证明它们需要更少的内存，并且测试准确性中等降低。此外，我们研究了由随机矩阵乘法引起的梯度估计值的方差。我们将这种差异与基于样本批次的梯度估计的差异进行比较。我们证明了拟议方法对预先训练的罗伯塔模型在胶水任务上进行微调的好处。

In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题