论文标题
符号钻头就足够了:用于多跳全减量的学习同步框架,并具有最终压缩
Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression
论文作者
论文摘要
传统的一位压缩随机梯度下降不能直接用于多跳全面,这是网络密集型高性能计算系统(例如公共云)中广泛采用的分布式培训范式。根据我们的理论发现,由于级联压缩,训练过程对收敛性能有很大的恶化。为了克服这一限制,我们实施了基于签名的基于压缩的学习同步框架Marsit。它可以通过精心策略的位置操作来防止级联压缩,以无偏的符号聚集及其特定的全球补偿机制,以减轻压缩偏差。所提出的框架保留了与非压缩机制相同的理论收敛速率。实验结果表明,MARSIT最多减少了35%的训练时间,同时保持与没有压缩的训练相同的准确性。
Traditional one-bit compressed stochastic gradient descent can not be directly employed in multi-hop all-reduce, a widely adopted distributed training paradigm in network-intensive high-performance computing systems such as public clouds. According to our theoretical findings, due to the cascading compression, the training process has considerable deterioration on the convergence performance. To overcome this limitation, we implement a sign-bit compression-based learning synchronization framework, Marsit. It prevents cascading compression via an elaborate bit-wise operation for unbiased sign aggregation and its specific global compensation mechanism for mitigating compression deviation. The proposed framework retains the same theoretical convergence rate as non-compression mechanisms. Experimental results demonstrate that Marsit reduces up to 35% training time while preserving the same accuracy as training without compression.