论文标题
竹子:使可享有的实例有弹性用于负担得起的大型DNN
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
论文作者
论文摘要
许多领域的DNN模型的规模继续增长,从而为有效的培训提供了高度的资源需求,并且对于跨尺度的组织和研究实验室来说,对组织和研究实验室的成本不足(通常是无法承受的)。本文旨在通过有效使用可预先抢先实例(即可以以便宜得多的价格获得,但可以在优先用户要求时会抢占的,旨在大大降低培训成本。但是,这样做需要新形式的弹性和效率来应对频繁的先发制人的可能性 - 一个失败模型与在现有检查点技术目标的正常群集设置中偶尔出现故障截然不同。 我们提出了竹子,这是一种分布式系统,通过将冗余计算引入训练管道,即,一个节点不仅对其自己的层,而且在其邻居中的某些图层上执行计算,从而解决了这些挑战。我们的关键见解是,训练大型模型通常需要“管道气泡”自然存在的管道并行性。竹子小心地将多余的计算填充到这些气泡中,以低成本提供弹性。在各种广泛使用的DNN型号中,与使用按需实例的设置相比,竹子在训练吞吐量中优于3.7倍的传统检查点,并将成本降低2.4倍。
DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales. This paper aims to significantly reduce training costs with effective use of preemptible instances, i.e., those that can be obtained at a much cheaper price while idle, but may be preempted whenever requested by priority users. Doing so, however, requires new forms of resiliency and efficiency to cope with the possibility of frequent preemptions - a failure model that is drastically different from the occasional failures in normal cluster settings that existing checkpointing techniques target. We present Bamboo, a distributed system that tackles these challenges by introducing redundant computations into the training pipeline, i.e., whereby one node performs computations over not only its own layers but also over some layers in its neighbor. Our key insight is that training large models often requires pipeline parallelism where "pipeline bubbles" naturally exist. Bamboo carefully fills redundant computations into these bubbles, providing resilience at a low cost. Across a variety of widely used DNN models, Bamboo outperforms traditional checkpointing by 3.7x in training throughput, and reduces costs by 2.4x compared to a setting where on-demand instances are used.