Adascale SGD：用于分布式培训的用户友好算法

论文标题

Adascale SGD：用于分布式培训的用户友好算法

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

论文作者

Johnson, Tyler B., Agrawal, Pulkit, Gu, Haijie, Guestrin, Carlos

论文摘要

当使用大批量训练加快随机梯度下降时，学习率必须适应新的批次大小，以最大程度地提高加速并保持模型质量。重新调整学习率是资源密集的，而固定缩放规则通常会降低模型质量。我们提出了Adascale SGD，这是一种可靠地适应大批量培训的学习率的算法。通过不断适应梯度的差异，Adascale自动实现了各种批量尺寸的加速。我们通过Adascale的收敛界限正式描述了这种质量，即使批量尺寸较大并且迭代次数减少，它仍保持最终的客观值。在经验比较中，Adascale的训练远远超出了流行的“线性学习率缩放”规则的批次尺寸限制。这包括大批量培训，没有用于机器翻译，图像分类，对象检测和语音识别任务的模型降解。 Adascale的定性行为类似于“热身”启发式方法，但与热身不同，这种行为自然而然地来自原则上的机制。该算法引入了可忽略不计的计算开销，没有新的超参数，这使Adascale成为实践中大规模培训的有吸引力的选择。

When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with AdaScale's convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular "linear learning rate scaling" rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. AdaScale's qualitative behavior is similar to that of "warm-up" heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making AdaScale an attractive choice for large-scale training in practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题