SGD尺寸较大的SGD学习稀疏功能

论文标题

SGD尺寸较大的SGD学习稀疏功能

SGD with Large Step Sizes Learns Sparse Features

论文作者

Andriushchenko, Maksym, Varre, Aditya, Pillaud-Vivien, Loucas, Flammarion, Nicolas

论文摘要

我们展示了在神经网络训练中随机梯度下降（SGD）动力学的重要特征。我们提出了通常使用大型步骤大小的经验观察（i）带领迭代从山谷的一侧跳到另一侧，从而导致损失稳定，（ii）这种稳定会导致隐藏的随机动力学正交到弹跳的方向，从而隐含了它隐含的偏见。此外，我们从经验上表明，较长的大小尺寸在损失景观山谷中保持较高的SGD，隐式正则化可以运行并找到稀疏表示形式。值得注意的是，没有使用明确的正则化，因此正则化效果仅来自受步进时间表影响的SGD训练动力学。因此，这些观察结果揭示了如何通过步长计划，梯度和噪声都可以通过神经网络的损失景观驱动SGD动力学。我们通过研究简单的神经网络模型以及从随机过程启发的定性论点来证明这些发现是合理的。最后，这种分析使我们能够在训练神经网络时对某些常见实践和观察到的现象发出新的启示。我们的实验代码可在https://github.com/tml-epfl/sgd-sparse-features上获得。

We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward sparse predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used so that the regularization effect comes solely from the SGD training dynamics influenced by the step size schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. Finally, this analysis allows us to shed a new light on some common practice and observed phenomena when training neural networks. The code of our experiments is available at https://github.com/tml-epfl/sgd-sparse-features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题