内核和丰富的模型中的丰富制度

论文标题

内核和丰富的模型中的丰富制度

Kernel and Rich Regimes in Overparametrized Models

论文作者

Woodworth, Blake, Gunasekar, Suriya, Lee, Jason D., Moroshko, Edward, Savarese, Pedro, Golan, Itay, Soudry, Daniel, Srebro, Nathan

论文摘要

最近的一系列工作研究过多地参数化了“内核”状态中的神经网络，即当网络在训练过程中作为核心的线性预测指标表现，因此梯度下降的训练的效果具有找到最小RKHS Norm Solutim norme。这与其他研究相反，这些研究证明了过度参数化的多层网络上的梯度下降如何诱导不是RKHS规范的丰富隐式偏见。在Chizat和Bach的观察基础上，我们展示了初始化的尺度如何控制“内核”（又名Lazy）和“ Rich”（又名Active）制度之间的过渡，并影响多层均质模型中的概括性能。在初始化时预测变量并非零相同的情况下，我们还强调了模型宽度的有趣作用。我们为一个简单的深度 - $ d $模型的家族提供了完整而详细的分析，这些模型已经在内核和丰富的制度之间表现出有趣且有意义的过渡，我们还通过经验证明了这一过渡，以证明更复杂的矩阵分解模型和多层非线性网络。

A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We also highlight an interesting role for the width of a model in the case that the predictor is not identically zero at initialization. We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题