hetumoe：有效的万亿尺度混合物分布式训练系统

论文标题

hetumoe：有效的万亿尺度混合物分布式训练系统

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

论文作者

Nie, Xiaonan, Zhao, Pinxue, Miao, Xupeng, Zhao, Tong, Cui, Bin

论文摘要

随着巨型密集的模型提高质量，但需要大量的GPU预算进行训练，因此提出了一种有条件的计算体系结构的稀疏门控混合物（MOE），以扩展其计算恒定。具体而言，输入令牌由门网络路由，仅激活专家网络的一部分。现有的MOE培训系统仅支持昂贵的高带宽GPU集群下主流MOE模型（例如顶级K）培训的一部分。在本文中，我们提出了Hetumoe，这是建立在Hetu上的高性能的大规模稀疏MOE培训系统。 Hetumoe提供多种门控策略和有效的GPU内核实现。为了进一步提高商品GPU集群的培训效率（例如，只有1个NIC），我们介绍了结合层次结构网络和汇总消息的分层通信。与现有的最新MOE系统相比，Hetumoe至少获得15％的加速。具体来说，Hetumoe在开关门下方的DeepSpeed-Moe优于批量32的DeepSpeed-Moe。我们的代码可在以下位置提供：https：//github.com/pku-dair/hetu。

As giant dense models advance quality but require large amounts of GPU budgets for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, is proposed to scale models while keeping their computation constant. Specifically, the input tokens are routed by the gate network and only activates part of the expert network. Existing MoE training systems only support part of mainstream MoE models (e.g. Top k) training under expensive high-bandwidth GPU clusters. In this paper, we present HetuMoE, a high-performance large-scale sparse MoE training system built on Hetu. HetuMoE provides multiple gating strategies and efficient GPU kernel implementations. To further improve the training efficiency on commodity GPU clusters (e.g, with only 1 NiC), we introduce the hierarchical AllToAll communication that combines hierarchical networks and aggregating messages. Compared with existing state-of-the-art MoE systems, HetuMoE obtains at least 15% speedup. Specifically, HetuMoE outperforms DeepSpeed-MoE up to 8.1x under the switch gate with a batch size of 32. Our code is available at: https://github.com/PKU-DAIR/Hetu.

下载PDF全文

下载文献需遵守相关版权规定

论文标题