论文标题

自动语音识别的上下文训练

Contextual-Utterance Training for Automatic Speech Recognition

论文作者

Gomez-Alanis, Alejandro, Drude, Lukas, Schwarz, Andreas, Swaminathan, Rupak Vignesh, Wiesler, Simon

论文摘要

流媒体自动语音识别(ASR)复发性神经网络传感器(RNN-T)的系统的最新研究已为编码器提供了过去的上下文信息,以提高其单词错误率(WER)性能。在本文中,我们首先提出了一种上下文训练技术,该技术利用了以前和未来的上下文话语,以对说话者,主题和声学环境进行隐式适应。此外,我们提出了一种用于流动语音识别(ASR)系统的双模式上下文训练技术。这种提出的方​​法允许通过将“就地的”知识提炼到流式传输模型中的可用声环境,而教师的知识可以看到过去和将来的上下文话语,而只能看到当前和过去的上下文话语。实验结果表明,通过所提出的技术训练的构型转换系统的表现优于经典RNN-T损失训练的系统。具体而言,该提出的技术能够将WER和平均代币发射潜伏期分别降低超过6%和40ms。

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源