论文标题

使用接近lipschitz平滑度的超级参数对亚当进行理论分析

Theoretical analysis of Adam using hyperparameters close to one without Lipschitz smoothness

论文作者

Iiduka, Hideaki

论文摘要

自适应方法(例如自适应力矩估计(ADAM)及其变体)的收敛性和收敛速率分析已被广泛研究以进行非convex优化。这些分析基于假设,即预期或经验的平均损失函数是Lipschitz平滑的(即其梯度是Lipschitz的连续),并且学习率取决于Lipschitz连续梯度的Lipschitz常数。同时,对亚当及其变体的数值评估已经澄清说,使用较小的恒定学习速率而不依赖Lipschitz恒定和超参数($β_1$和$β_2$)接近一个是有利的,这对于训练深层神经网络是有利的。由于计算Lipschitz常数为NP-HARD,因此Lipschitz的平滑度条件将是不现实的。本文提供了亚当的理论分析,而没有假设Lipschitz平滑度条件,以弥合理论和实践之间的差距。主要的贡献是表明理论上的证据表明,Adam使用较小的学习率和接近一个的超级参数表现良好,而先前的理论结果全部用于接近零的超参数。我们的分析还导致发现亚当在较大的批量尺寸方面表现良好。此外,我们表明,当亚当使用学习率降低和接近一个的超级参数时,它的表现良好。

Convergence and convergence rate analyses of adaptive methods, such as Adaptive Moment Estimation (Adam) and its variants, have been widely studied for nonconvex optimization. The analyses are based on assumptions that the expected or empirical average loss function is Lipschitz smooth (i.e., its gradient is Lipschitz continuous) and the learning rates depend on the Lipschitz constant of the Lipschitz continuous gradient. Meanwhile, numerical evaluations of Adam and its variants have clarified that using small constant learning rates without depending on the Lipschitz constant and hyperparameters ($β_1$ and $β_2$) close to one is advantageous for training deep neural networks. Since computing the Lipschitz constant is NP-hard, the Lipschitz smoothness condition would be unrealistic. This paper provides theoretical analyses of Adam without assuming the Lipschitz smoothness condition in order to bridge the gap between theory and practice. The main contribution is to show theoretical evidence that Adam using small learning rates and hyperparameters close to one performs well, whereas the previous theoretical results were all for hyperparameters close to zero. Our analysis also leads to the finding that Adam performs well with large batch sizes. Moreover, we show that Adam performs well when it uses diminishing learning rates and hyperparameters close to one.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源