语言增强的变压器和CTC嵌入语音识别

论文标题

语言增强的变压器和CTC嵌入语音识别

Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition

论文作者

Zhang, Xulong, Wang, Jianzong, Cheng, Ning, Zhao, Mengyuan, Zhang, Zhiyong, Xiao, Jing

论文摘要

联合CTC注意模型的最新出现显示自动语音识别（ASR）的显着改善。改进很大程度上在于解码器对语言信息的建模。与声学编码器进行了联合优化的解码器在训练过程中以自动回归方式从地面真相序列中呈现语言模型。但是，解码器的训练语料库仅限于语音转录，远小于培训可接受的语言模型所需的语料库。这导致解码器的鲁棒性差。为了减轻这个问题，我们提出了语言增强的变压器，该变压器在培训过程中将精致的CTC信息引入了解码器，以便解码器可以更强大。我们在Aishell-1语音语料库上的实验表明，字符错误率（CER）相对降低了7％。我们还发现，在联合CTC注意ASR模型中，解码器比声学信息对语言信息更敏感。

The recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR). The improvement largely lies in the modeling of linguistic information by decoder. The decoder joint-optimized with an acoustic encoder renders the language model from ground-truth sequences in an auto-regressive manner during training. However, the training corpus of the decoder is limited to the speech transcriptions, which is far less than the corpus needed to train an acceptable language model. This leads to poor robustness of decoder. To alleviate this problem, we propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process, so that the decoder can be more robust. Our experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%. We also find that in joint CTC-Attention ASR model, decoder is more sensitive to linguistic information than acoustic information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题