论文标题
同质神经网络的自适应优化算法的隐式偏差
The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks
论文作者
论文摘要
尽管他们压倒性的能力过分,但通过特定优化算法训练的深度神经网络倾向于将其概括为看不见的数据。最近,研究人员通过研究优化算法的隐式正则化效果来解释这一点。这项工作是一个显着的进步(Lyu&Li,2019),它证明了梯度下降(GD)最大程度地提高了同质深神经网络的边缘。除了GD,Adagrad,RMSprop和Adam等自适应算法外,由于其快速培训过程而受欢迎。但是,仍然缺乏自适应优化算法的理论保证。在本文中,我们研究了自适应优化算法的隐式正规化,它们正在优化均质深神经网络的逻辑损失。我们证明,采用指数移动平均策略(例如Adam和RMSProp)的自适应算法可以最大化神经网络的边缘,而Adagrad则直接列为护发素中的历史平方梯度。它表明了调节剂设计中指数移动平均策略的概括。从技术上讲,我们提供了一个统一的框架来通过构建新型的自适应梯度流和替代缘来分析自适应优化算法的收敛方向。我们的实验可以很好地支持自适应优化算法收敛方向的理论发现。
Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it by investigating the implicit regularization effect of optimization algorithms. A remarkable progress is the work (Lyu&Li, 2019), which proves gradient descent (GD) maximizes the margin of homogeneous deep neural networks. Except GD, adaptive algorithms such as AdaGrad, RMSProp and Adam are popular owing to their rapid training process. However, theoretical guarantee for the generalization of adaptive optimization algorithms is still lacking. In this paper, we study the implicit regularization of adaptive optimization algorithms when they are optimizing the logistic loss on homogeneous deep neural networks. We prove that adaptive algorithms that adopt exponential moving average strategy in conditioner (such as Adam and RMSProp) can maximize the margin of the neural network, while AdaGrad that directly sums historical squared gradients in conditioner can not. It indicates superiority on generalization of exponential moving average strategy in the design of the conditioner. Technically, we provide a unified framework to analyze convergent direction of adaptive optimization algorithms by constructing novel adaptive gradient flow and surrogate margin. Our experiments can well support the theoretical findings on convergent direction of adaptive optimization algorithms.