放大器：自动找到具有异质意识的模型并行策略

论文标题

放大器：自动找到具有异质意识的模型并行策略

AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness

论文作者

Li, Dacheng, Wang, Hongyi, Xing, Eric, Zhang, Hao

论文摘要

扩展模型尺寸可能会导致许多机器学习（ML）任务的新功能。但是，培训大型模型需要强大的分布式系统专业知识来仔细设计适合模型体系结构和集群设置的模型并行执行策略。在本文中，我们开发了AMP，这是一个自动得出此类策略的框架。 AMP通过利用旨在捕获模型和集群规范的异质性的成本模型来确定模型并行策略的有效空间，并有效地搜索高性能策略的空间。与现有方法不同，AMP专门针对由不均匀的层和群集设置组成的复杂模型量身定制，并具有更异质的加速器和带宽。我们从公共云中评估了流行模型和群集设置的AMP，并表明AMP返回并行策略，这些策略符合典型集群设置的专家调整策略。在具有异质体系结构的异质群集或模型上，AMP分别比最先进的模型平行系统找到了1.54倍和1.77倍吞吐量的策略。

Scaling up model sizes can lead to fundamentally new capabilities in many machine learning (ML) tasks. However, training big models requires strong distributed system expertise to carefully design model-parallel execution strategies that suit the model architectures and cluster setups. In this paper, we develop AMP, a framework that automatically derives such strategies. AMP identifies a valid space of model parallelism strategies and efficiently searches the space for high-performed strategies, by leveraging a cost model designed to capture the heterogeneity of the model and cluster specifications. Unlike existing methods, AMP is specifically tailored to support complex models composed of uneven layers and cluster setups with more heterogeneous accelerators and bandwidth. We evaluate AMP on popular models and cluster setups from public clouds and show that AMP returns parallel strategies that match the expert-tuned strategies on typical cluster setups. On heterogeneous clusters or models with heterogeneous architectures, AMP finds strategies with 1.54x and 1.77x higher throughput than state-of-the-art model-parallel systems, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题