知识蒸馏作为有效的预训练：更快的收敛性，更高的数据效率和更好的可传递性

论文标题

知识蒸馏作为有效的预训练：更快的收敛性，更高的数据效率和更好的可传递性

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

论文作者

He, Ruifei, Sun, Shuyang, Yang, Jihan, Bai, Song, Qi, Xiaojuan

论文摘要

事实证明，大规模的预训练对于各种计算机视觉任务至关重要。但是，随着培训前数据量，模型体系结构金额和私有/无法访问的数据的增加，在大规模数据集中预先培训所有模型体系结构并不是很有效或可能。在这项工作中，我们研究了一种预训练的替代策略，即作为有效培训（KDEP）的知识蒸馏（KDEP），旨在有效地将学习的功能表示形式从现有的预培训模型转移到新的学生模型中，以实现未来下游任务。我们观察到，现有的知识蒸馏（KD）方法不适合预训练，因为它们通常会蒸馏出转移到下游任务时将要丢弃的逻辑。为了解决此问题，我们提出了一种具有非参数特征维度对齐的基于特征的KD方法。值得注意的是，我们的方法在3个下游任务和9个下游数据集中的监督前培训对应物中执行相当的性能，需要减少10倍的数据，减少5倍的训练时间。代码可在https://github.com/cvmi-lab/kdep上找到。

Large-scale pre-training has been proven to be crucial for various computer vision tasks. However, with the increase of pre-training data amount, model architecture amount, and the private/inaccessible data, it is not very efficient or possible to pre-train all the model architectures on large-scale datasets. In this work, we investigate an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP), aiming to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks. We observe that existing Knowledge Distillation (KD) methods are unsuitable towards pre-training since they normally distill the logits that are going to be discarded when transferred to downstream tasks. To resolve this problem, we propose a feature-based KD method with non-parametric feature dimension aligning. Notably, our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time. Code is available at https://github.com/CVMI-Lab/KDEP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题