论文标题
如果可以的话,为什么要跳过:中间层的简单知识蒸馏技术
Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers
论文作者
论文摘要
随着计算功率神经机器翻译(NMT)模型的增长,也会变得相应增长并变得更好。但是,由于内存限制,它们也更难在边缘设备上部署。为了解决这个问题,一种常见的做法是将知识从大型且精确的教师网络(T)提炼成紧凑的学生网络。尽管知识蒸馏(KD)在大多数情况下很有用,但我们的研究表明,现有的KD技术可能不够适合Deep NMT发动机,因此我们提出了一种新颖的选择。在我们的模型中,除了匹配T和S预测外,我们还具有一种组合机制,可以从T到S的层级监督。在本文中,我们针对低资源设置,并评估我们针对葡萄牙语的翻译引擎(英格兰,土耳其语 - 英格兰和英语 - german-gresman的方向)。使用我们技术培训的学生的参数少了50%,仍然可以与12层老师的学生提供可比的结果。
With the growth of computing power neural machine translation (NMT) models also grow accordingly and become better. However, they also become harder to deploy on edge devices due to memory constraints. To cope with this problem, a common practice is to distill knowledge from a large and accurately-trained teacher network (T) into a compact student network (S). Although knowledge distillation (KD) is useful in most cases, our study shows that existing KD techniques might not be suitable enough for deep NMT engines, so we propose a novel alternative. In our model, besides matching T and S predictions we have a combinatorial mechanism to inject layer-level supervision from T to S. In this paper, we target low-resource settings and evaluate our translation engines for Portuguese--English, Turkish--English, and English--German directions. Students trained using our technique have 50% fewer parameters and can still deliver comparable results to those of 12-layer teachers.