基于传感器语音识别的语言模型集成的经验研究

论文标题

基于传感器语音识别的语言模型集成的经验研究

An Empirical Study of Language Model Integration for Transducer based Speech Recognition

论文作者

Zheng, Huahuan, An, Keyu, Ou, Zhijian, Huang, Chen, Ding, Ke, Wan, Guanglu

论文摘要

在端到端RNN-TransDucer（RNN-T）中使用外部语言模型（ELM）的纯文本数据进行语音识别很具有挑战性。最近，已经开发了一类方法，例如密度比（DR）和内部语言模型估计（ILME），表现优于经典的浅融合（SF）方法。这些方法背后的基本思想是，RNN-T后验应首先先于隐式学习的内部语言模型（ILM），以便整合ELM。尽管最近的研究表明RNN-T仅学习一些低阶语言模型信息，但DR方法使用具有完整背景的训练有素的神经语言模型，这可能不适合估计ILM并恶化整合性能。基于DR方法，我们通过用低阶弱语言模型代替估计来提出低阶密度比方法（LODR）。在英语Librispeech＆Tedlium-2和中国wenetspeech和Aishell-1数据集的内域和跨域情景上，都进行了广泛的经验实验。结果表明，在大多数测试中，LODR在所有任务中始终胜过所有任务，而在大多数测试中的执行通常都比DR更好。

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题