在超级计算环境中分布深度学习的TOP-K梯度稀疏的经验分析

论文标题

在超级计算环境中分布深度学习的TOP-K梯度稀疏的经验分析

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

论文作者

Yoon, Daegun, Oh, Sangyoon

论文摘要

为了更快地培训深度学习模型，近年来对多个GPU的分布培训是非常受欢迎的计划。但是，沟通带宽仍然是训练表现的主要瓶颈。为了提高整体培训表现，最近的工作提出了梯度稀疏方法，以大大减少通信流量。他们中的大多数都需要梯度排序以选择有意义的梯度，例如TOP-K梯度稀疏（Top-K SGD）。但是，Top-K SGD具有提高速度总体训练性能的限制，因为梯度分类对GPU的效率显着效率低下。在本文中，我们进行的实验表明了Top-K SGD的效率低下，并提供了低性能的见解。根据我们的经验分析的观察结果，我们计划将高性能梯度稀疏方法作为未来的工作。

To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training performance, recent works have proposed gradient sparsification methods that reduce the communication traffic significantly. Most of them require gradient sorting to select meaningful gradients such as Top-k gradient sparsification (Top-k SGD). However, Top-k SGD has a limit to increase the speed up overall training performance because gradient sorting is significantly inefficient on GPUs. In this paper, we conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance. Based on observations from our empirical analysis, we plan to yield a high performance gradient sparsification method as a future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题