论文标题
随机归一化梯度下降,动量进行大批量训练
Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training
论文作者
论文摘要
随机梯度下降〜(SGD)及其变体一直是机器学习中的主导优化方法。与经过小批量培训的SGD相比,具有大批量培训的SGD可以更好地利用当前多核系统(例如图形处理单元)的计算能力〜(GPU),并且可以减少分布式培训设置中的通信循环数量。因此,接受大批量培训的SGD引起了很大的关注。但是,现有的经验结果表明,大批量训练通常会导致泛化精度下降。因此,如何确保大批量训练中的概括能力成为一项具有挑战性的任务。在本文中,我们提出了一种简单而有效的方法,称为动量〜(SNGM)的随机归一化梯度下降,用于大批量训练。我们证明,使用相同数量的梯度计算,SNGM可以采用比动量SGD〜(MSGD)更大的批量大小,该量SGD〜(MSGD)是SGD的最广泛使用的变体之一,以收敛到$ε$ - 稳定点。深度学习的经验结果证明,在采用相同的较大批量尺寸时,SNGM可以比MSGD和其他最先进的大批量训练方法获得更好的测试准确性。
Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an $ε$-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.