论文标题

CTC声学模型中半监督学习的知识蒸馏和数据选择

Knowledge Distillation and Data Selection for Semi-Supervised Learning in CTC Acoustic Models

论文作者

Swarup, Prakhar, Chakrabarty, Debmalya, Sapru, Ashtosh, Tulsiani, Hitesh, Arsikere, Harish, Garimella, Sri

论文摘要

半监督学习(SSL)是一个活跃的研究领域,旨在利用未标记的数据来提高语音识别系统的准确性。当前的研究提出了一种整合两个关键思想的方法:1)使用Connectionist Perimal分类(CTC)基于教师和教师学习的学习2)设计有效的数据选择机制,以利用未标记的数据来促进学生模型的性能。我们的目的是确定良好标准在根据置信度度量,说话者和内容变异性等属性中从大量未标记数据中选择样本的重要性。我们试图回答的问题是:是否可以设计一个数据选择机制,该机制可以减少对一组随机选择的未标记样本的依赖,而不会损害单词错误率(WER)?我们对不同的数据选择方法进行实证研究,以回答这个问题并量化不同采样策略的效果。在具有40000小时精心选择的未标记数据的半监督ASR设置中,我们的CTC-SSL方法可与基线CTC系统相对改善17%,该基线CTC系统接受了标记的数据训练。它还通过基于随机抽样的数量级较大的未标记数据训练的CTC-SSL系统实现了PAR性能。

Semi-supervised learning (SSL) is an active area of research which aims to utilize unlabelled data in order to improve the accuracy of speech recognition systems. The current study proposes a methodology for integration of two key ideas: 1) SSL using connectionist temporal classification (CTC) objective and teacher-student based learning 2) Designing effective data-selection mechanisms for leveraging unlabelled data to boost performance of student models. Our aim is to establish the importance of good criteria in selecting samples from a large pool of unlabelled data based on attributes like confidence measure, speaker and content variability. The question we try to answer is: Is it possible to design a data selection mechanism which reduces dependence on a large set of randomly selected unlabelled samples without compromising on Word Error Rate (WER)? We perform empirical investigations of different data selection methods to answer this question and quantify the effect of different sampling strategies. On a semi-supervised ASR setting with 40000 hours of carefully selected unlabelled data, our CTC-SSL approach gives 17% relative WER improvement over a baseline CTC system trained with labelled data. It also achieves on-par performance with CTC-SSL system trained on order of magnitude larger unlabeled data based on random sampling.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源