通过减少词汇的知识蒸馏俄罗斯语言模型

论文标题

通过减少词汇的知识蒸馏俄罗斯语言模型

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

论文作者

Kolesnikova, Alina, Kuratov, Yuri, Konovalov, Vasily, Burtsev, Mikhail

论文摘要

如今，变压器语言模型是大多数自然语言处理任务的核心组成部分。这种模型的工业应用需要最小化计算时间和内存足迹。知识蒸馏是解决这一目标的方法之一。该字段中的现有方法主要集中于减少嵌入/隐藏表示形式的层数或尺寸的数量。替代选择是减少词汇中的令牌数量，从而减少学生模型的嵌入矩阵。词汇最小化的主要问题是教师和学生模型的输入序列与输出类别分布之间的不匹配。结果，不可能直接应用基于KL的知识蒸馏。我们提出了两种简单而有效的对准技术，以使词汇减少的学生进行知识蒸馏。 Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only.我们制作代码和蒸馏型号。

Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题