重新访问标签平滑和知识蒸馏兼容性：缺少什么？

论文标题

重新访问标签平滑和知识蒸馏兼容性：缺少什么？

Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?

论文作者

Chandrasegaran, Keshigeyan, Tran, Ngoc-Trung, Zhao, Yunqing, Cheung, Ngai-Man

论文摘要

这项工作研究了标签平滑（LS）和知识蒸馏（KD）之间的兼容性。解决这一论文陈述的当代发现以二分法的观点：Muller等。（2019）和Shen等。（2021b）。至关重要的是，没有努力理解和解决这些矛盾的发现，而留下了原始问题 - 以平稳还是不平滑教师网络？ - 未解决。我们作品的主要贡献是对系统扩散的发现，分析和验证是缺失的概念，在理解和解决这些矛盾的发现方面起着重要作用。这种系统的扩散本质上削减了从LS训练的老师中提取的好处，从而使KD在升高的温度无效时使KD无效。我们的发现得到了大规模实验，分析和案例研究的全面支持，包括图像分类，神经机器翻译和紧凑的学生蒸馏任务，这些任务跨越了多个数据集和教师学生的体系结构。根据我们的分析，我们建议从业者使用具有低温转移的LS训练的老师来实现高性能学生。代码和型号可在https://keshik6.github.io/revisiting-ls-kd-compatibility/

This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019) and Shen et al. (2021b). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question -- to smooth or not to smooth a teacher network? -- unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students. Code and models are available at https://keshik6.github.io/revisiting-ls-kd-compatibility/

下载PDF全文

下载文献需遵守相关版权规定

论文标题