随机归一化梯度下降，动量进行大批量训练

论文标题

随机归一化梯度下降，动量进行大批量训练

Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training

论文作者

Zhao, Shen-Yi, Shi, Chang-Wei, Xie, Yin-Peng, Li, Wu-Jun

论文摘要

随机梯度下降〜（SGD）及其变体一直是机器学习中的主导优化方法。与经过小批量培训的SGD相比，具有大批量培训的SGD可以更好地利用当前多核系统（例如图形处理单元）的计算能力〜（GPU），并且可以减少分布式培训设置中的通信循环数量。因此，接受大批量培训的SGD引起了很大的关注。但是，现有的经验结果表明，大批量训练通常会导致泛化精度下降。因此，如何确保大批量训练中的概括能力成为一项具有挑战性的任务。在本文中，我们提出了一种简单而有效的方法，称为动量〜（SNGM）的随机归一化梯度下降，用于大批量训练。我们证明，使用相同数量的梯度计算，SNGM可以采用比动量SGD〜（MSGD）更大的批量大小，该量SGD〜（MSGD）是SGD的最广泛使用的变体之一，以收敛到$ε$ - 稳定点。深度学习的经验结果证明，在采用相同的较大批量尺寸时，SNGM可以比MSGD和其他最先进的大批量训练方法获得更好的测试准确性。

Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an $ε$-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题