SGD和重量衰减秘密地最大程度地减少了您的神经网络的排名

论文标题

SGD和重量衰减秘密地最大程度地减少了您的神经网络的排名

SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network

论文作者

Galanti, Tomer, Siegel, Zachary S., Gupte, Aparna, Poggio, Tomaso

论文摘要

我们研究了随机梯度下降（SGD）在训练深神经网络期间学习低级体重矩阵的固有偏见。我们的结果表明，使用迷你批次SGD和重量衰减的训练会导致重量矩阵中的等级最小化。具体而言，我们从理论和经验上都表明，这种偏见变得更加明显，批量较小，学习率更高或重量衰减更强。此外，我们预测并从经验上证实，重量衰减对于这种偏见的发生至关重要。与以前的文献不同，我们的分析不依赖于对重量矩阵的数据，收敛或最佳性的假设，从而适用于任何宽度或深度的广泛神经网络体系结构。最后，我们从经验上探索了这种偏见与概括之间的联系，发现它对测试性能具有边际影响。

We investigate the inherent bias of Stochastic Gradient Descent (SGD) toward learning low-rank weight matrices during the training of deep neural networks. Our results demonstrate that training with mini-batch SGD and weight decay induces a bias toward rank minimization in the weight matrices. Specifically, we show both theoretically and empirically that this bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay. Additionally, we predict and empirically confirm that weight decay is essential for this bias to occur. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices, making it applicable to a wide range of neural network architectures of any width or depth. Finally, we empirically explore the connection between this bias and generalization, finding that it has a marginal effect on the test performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题