在最大的Hessian特征值和概括性上

论文标题

在最大的Hessian特征值和概括性上

On the Maximum Hessian Eigenvalue and Generalization

论文作者

Kaur, Simran, Cohen, Jeremy, Lipton, Zachary C.

论文摘要

某些培训干预措施（例如提高学习率和应用批处理标准化）的机制提高了深网的概括仍然是一个谜。先前的工作已经推测，“扁平”解决方案比“更清晰”的解决方案更好地概括了看不见的数据，激发了几个指标来测量平坦度（尤其是损失的Hessian最大的特征值$λ_{max} $）；和算法，例如清晰度最小化（SAM）[1]，它们直接优化了平坦度。其他作品质疑$λ_{max} $与概括之间的链接。在本文中，我们提出了称为$λ_{max} $进一步质疑的$λ_{max} $的发现。我们表明：（1）虽然较大的学习率减少了所有批量尺寸的$λ_{max} $，但概括益处有时会消失在较大的批次尺寸下；（2）通过同时缩放批量的大小和学习率，我们可以更改$λ_{max} $而不会影响概括；（3）虽然SAM为所有批量大小生产较小的$λ_{max} $，但概括益处（也）消失，较大的批量大小；（4）对于辍学，过高的辍学概率可能会降低概括，即使它们促进了较小的$λ_{max} $；（5）虽然批处理不变并不能始终产生较小的$λ_{max} $，但它仍然赋予了概括。尽管我们的实验肯定了大型学习率和SAM对Minibatch SGD的概括优势，但GD-SGD差异表明了限制到$λ_{max} $解释神经网络中概括的能力。

The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $λ_{max}$ and generalization. In this paper, we present findings that call $λ_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $λ_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $λ_{max}$ without affecting generalization; (3) while SAM produces smaller $λ_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $λ_{max}$; and (5) while batch-normalization does not consistently produce smaller $λ_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $λ_{max}$'s ability to explain generalization in neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题