减少，再利用，回收利用：提高培训效率，然后蒸馏

论文标题

减少，再利用，回收利用：提高培训效率，然后蒸馏

Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation

论文作者

Blakeney, Cody, Forde, Jessica Zosa, Frankle, Jonathan, Zong, Ziliang, Leavitt, Matthew L.

论文摘要

提高深层网络培训效率的方法（即获得给定模型质量级别所需的资源）对深度学习从业人员具有直接的好处。蒸馏通常用于压缩模型或提高模型质量，但尚不清楚蒸馏是否真的提高了训练效率。蒸馏的质量改进是否可以转换为训练加速，还是只是在没有资源节省的情况下增加最终模型质量？我们进行了一系列实验，以研究是否以及如何使用对Imagenet进行培训的Resnet-50进行蒸馏来加速训练，并使用胶胶语建模目标对C4进行了训练，并使用Common Enterprise硬件（8X NVIDIA A100）对胶水进行了评估。我们发现，在胶水上评估时，蒸馏可以加快训练的训练，最高为1.96倍，在ImageNet进行训练的Resnet-50训练，在BERT上训练1.42倍。此外，当仅在训练的前20-50％执行时，BERT的蒸馏会产生最佳的结果。我们还观察到，即使在Resnet-50和Bert中都使用最差的质量模型作为老师，进行蒸馏的训练几乎总是比没有蒸馏的训练效率更高。最后，我们发现，通过随机在每个步骤中从教师模型池中随机采样一位老师，这是有可能从教师模型合奏中提取的好处的好处。综上所述，这些结果表明，蒸馏可以大大提高图像分类和语言建模的训练效率，并且对蒸馏协议进行一些简单的优化可以进一步提高这些效率的提高。

Methods for improving the efficiency of deep network training (i.e. the resources required to achieve a given level of model quality) are of immediate benefit to deep learning practitioners. Distillation is typically used to compress models or improve model quality, but it's unclear if distillation actually improves training efficiency. Can the quality improvements of distillation be converted into training speed-ups, or do they simply increase final model quality with no resource savings? We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100). We found that distillation can speed up training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x on BERT when evaluated on GLUE. Furthermore, distillation for BERT yields optimal results when it is only performed for the first 20-50% of training. We also observed that training with distillation is almost always more efficient than training without distillation, even when using the poorest-quality model as a teacher, in both ResNet-50 and BERT. Finally, we found that it's possible to gain the benefit of distilling from an ensemble of teacher models, which has O(n) runtime cost, by randomly sampling a single teacher from the pool of teacher models on each step, which only has a O(1) runtime cost. Taken together, these results show that distillation can substantially improve training efficiency in both image classification and language modeling, and that a few simple optimizations to distillation protocols can further enhance these efficiency improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题