深度学习的大型学习率阶段：弹射机制

论文标题

深度学习的大型学习率阶段：弹射机制

The large learning rate phase of deep learning: the catapult mechanism

论文作者

Lewkowycz, Aitor, Bahri, Yasaman, Dyer, Ethan, Sohl-Dickstein, Jascha, Gur-Ari, Guy

论文摘要

初始学习率的选择可能会对深层网络的性能产生深远的影响。我们介绍了具有可解决的培训动态的一类神经网络，并在实用的深度学习环境中经验确认了他们的预测。这些网络以小型和大的学习率表现出明显不同的行为。这两个制度通过相变隔开。在小型学习率阶段，可以使用无限宽的神经网络理论来理解培训。在很大的学习率下，模型捕获了定性不同的现象，包括梯度下降动力学对最小值的收敛。我们模型的一个关键预测是狭窄的大型学习率范围狭窄。我们在现实的深度学习环境中发现了模型的预测与培训动态之间的良好共识。此外，我们发现在大型学习率阶段通常可以找到这种设置中的最佳性能。我们相信我们的结果阐明了以不同学习率培训的模型的特征。特别是，它们填补了现有的广泛神经网络理论与非线性，大学习率，与实践相关的培训动态之间的空白。

The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题