论文标题
SGD的高维极限定理:有效的动力学和关键缩放
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling
论文作者
论文摘要
我们研究了在高维度中具有恒定步骤的随机梯度下降(SGD)的缩放限制。我们证明,随着尺寸为无穷大,SGD的摘要统计轨迹(即有限维函数)的轨迹限制了定理。我们的方法允许人们选择所跟踪的摘要统计信息,初始化和步进尺寸。它同时产生弹道(ODE)和扩散(SDE)极限,其极限取决于以前的选择。我们显示了对阶梯尺寸的临界缩放制度,在下面,有效的弹道动力学与人口损失的梯度流相匹配,但是在该渐变中,出现了新的校正项,从而改变了相图。关于这种有效动力学的固定点,相应的扩散极限可能非常复杂,甚至变性。我们在流行的示例中演示了我们的方法,包括估计峰值矩阵和张量模型,以及通过两层网络进行二进制和XOR型高斯混合模型的分类。这些示例表现出令人惊讶的现象,包括多模式的时间尺度到收敛,以及趋于亚最佳溶液的概率,其概率与随机(例如高斯)初始化的范围限制为零。同时,我们通过证明后者的概率随着第二层宽度的增长而变为零,从而证明了超透明化的益处。
We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.