论文标题
Relu网络梯度下降训练中的高原现象:解释,量化和回避
Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance
论文作者
论文摘要
神经网络在广泛的应用程序中提供“最佳课程”近似值的能力已有据可查。然而,如果一个人无法有效地训练(选择)定义网络的参数,则神经网络的强大表达能力就会毫无意义。通常,神经网络通过梯度下降类型优化方法或其随机变体训练。在实践中,这种方法导致损失函数在训练开始时迅速下降,但是在相对较小的步骤之后,损失功能显着减慢。损失甚至可能在大量时期的时期停滞不前,然后突然出于明显的原因突然开始快速减少。这种所谓的高原现象在许多学习任务中表现出来。 目前的工作旨在识别和量化高原现象的根本原因。关于训练数据数量的神经元数量没有任何假设,我们对懒惰和适应性制度的结果均可构成。主要发现是:高原对应于激活模式保持恒定的时期,其中激活模式是指激活给定神经元的数据点的数量;定量梯度流动动力学的收敛性;并且,以局部最小二乘在训练数据的子集对局部最小二乘回归线的解决方案的解决方案来表征固定点的表征。基于这些结论,我们提出了一种新的迭代训练方法,即主动神经元正方形(ANLS),其特征在于每个步骤在每个步骤中对激活模式进行显式调整,旨在使高原快速退出。整个过程中都包含说明性的数值示例。
The ability of neural networks to provide `best in class' approximation across a wide range of applications is well-documented. Nevertheless, the powerful expressivity of neural networks comes to naught if one is unable to effectively train (choose) the parameters defining the network. In general, neural networks are trained by gradient descent type optimization methods, or a stochastic variant thereof. In practice, such methods result in the loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down. The loss may even appear to stagnate over the period of a large number of epochs, only to then suddenly start to decrease fast again for no apparent reason. This so-called plateau phenomenon manifests itself in many learning tasks. The present work aims to identify and quantify the root causes of plateau phenomenon. No assumptions are made on the number of neurons relative to the number of training data, and our results hold for both the lazy and adaptive regimes. The main findings are: plateaux correspond to periods during which activation patterns remain constant, where activation pattern refers to the number of data points that activate a given neuron; quantification of convergence of the gradient flow dynamics; and, characterization of stationary points in terms solutions of local least squares regression lines over subsets of the training data. Based on these conclusions, we propose a new iterative training method, the Active Neuron Least Squares (ANLS), characterised by the explicit adjustment of the activation pattern at each step, which is designed to enable a quick exit from a plateau. Illustrative numerical examples are included throughout.