论文标题
深度神经网络大规模训练中检查点的研究
A Study of Checkpointing in Large Scale Training of Deep Neural Networks
论文作者
论文摘要
深度学习(DL)应用程序越来越多地被部署在HPC系统上,以利用这些系统的大量并行性和计算能力进行DL模型培训。尽管已经付出了巨大的努力来促进DL框架的分布培训,但容忍度在很大程度上被忽略了。在这项工作中,我们评估了CheckPoint-Restart,这是HPC工作负载中一种常见的容错技术。我们对HPC Chainer,Pytorch和Tensorflow中常见的三个最先进的DL框架进行实验。我们评估了检查点,文件格式和文件大小的计算成本,比例的影响以及确定性检查点。我们的评估显示了检查点机制的一些关键差异,并在现有检查点实现中暴露了几个瓶颈。我们提供讨论点,可以帮助用户选择易于故障的框架以在HPC中使用。我们还提供了框架开发人员可以用来促进HPC中DL工作负载更好检查点的外卖点。
Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.