自动语音识别的上下文训练

论文标题

自动语音识别的上下文训练

Contextual-Utterance Training for Automatic Speech Recognition

论文作者

Gomez-Alanis, Alejandro, Drude, Lukas, Schwarz, Andreas, Swaminathan, Rupak Vignesh, Wiesler, Simon

论文摘要

流媒体自动语音识别（ASR）复发性神经网络传感器（RNN-T）的系统的最新研究已为编码器提供了过去的上下文信息，以提高其单词错误率（WER）性能。在本文中，我们首先提出了一种上下文训练技术，该技术利用了以前和未来的上下文话语，以对说话者，主题和声学环境进行隐式适应。此外，我们提出了一种用于流动语音识别（ASR）系统的双模式上下文训练技术。这种提出的方法允许通过将“就地的”知识提炼到流式传输模型中的可用声环境，而教师的知识可以看到过去和将来的上下文话语，而只能看到当前和过去的上下文话语。实验结果表明，通过所提出的技术训练的构型转换系统的表现优于经典RNN-T损失训练的系统。具体而言，该提出的技术能够将WER和平均代币发射潜伏期分别降低超过6％和40ms。

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题