论文标题
在最大的Hessian特征值和概括性上
On the Maximum Hessian Eigenvalue and Generalization
论文作者
论文摘要
某些培训干预措施(例如提高学习率和应用批处理标准化)的机制提高了深网的概括仍然是一个谜。先前的工作已经推测,“扁平”解决方案比“更清晰”的解决方案更好地概括了看不见的数据,激发了几个指标来测量平坦度(尤其是损失的Hessian最大的特征值$λ_{max} $);和算法,例如清晰度最小化(SAM)[1],它们直接优化了平坦度。其他作品质疑$λ_{max} $与概括之间的链接。在本文中,我们提出了称为$λ_{max} $进一步质疑的$λ_{max} $的发现。我们表明:(1)虽然较大的学习率减少了所有批量尺寸的$λ_{max} $,但概括益处有时会消失在较大的批次尺寸下; (2)通过同时缩放批量的大小和学习率,我们可以更改$λ_{max} $而不会影响概括; (3)虽然SAM为所有批量大小生产较小的$λ_{max} $,但概括益处(也)消失,较大的批量大小; (4)对于辍学,过高的辍学概率可能会降低概括,即使它们促进了较小的$λ_{max} $; (5)虽然批处理不变并不能始终产生较小的$λ_{max} $,但它仍然赋予了概括。尽管我们的实验肯定了大型学习率和SAM对Minibatch SGD的概括优势,但GD-SGD差异表明了限制到$λ_{max} $解释神经网络中概括的能力。
The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $λ_{max}$ and generalization. In this paper, we present findings that call $λ_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $λ_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $λ_{max}$ without affecting generalization; (3) while SAM produces smaller $λ_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $λ_{max}$; and (5) while batch-normalization does not consistently produce smaller $λ_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $λ_{max}$'s ability to explain generalization in neural networks.