端到端普通话语音识别的多层建模单元

论文标题

端到端普通话语音识别的多层建模单元

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

论文作者

Yang, Yuting, Du, Binbin, Li, Yuke

论文摘要

建模单元的选择对于自动语音识别（ASR）任务至关重要。在普通话的情况下，汉字表示含义，但与发音没有直接相关。因此，只有将汉字作为建模单元的写作不足以捕获语音特征。在本文中，我们提出了一种新颖的方法，涉及多层建模单元，该单元集成了多层次信息以进行普通话语音识别。具体而言，编码器块将音节视为建模单元，而解码器块则处理字符级建模单元。为了促进从音节功能到角色特征的增量转换，我们设计了一个辅助任务，该任务将跨凝胶拷贝（CE）损失应用于中间解码器层。在推断过程中，输入特征序列通过编码器块将输入特征序列转换为音节序列，然后通过解码器块转换为汉字。在广泛使用的Aishell-1语料库上进行的实验表明，我们的方法分别使用构象异构体和变压器骨干，以4.1％/4.6％和4.6％/5.2％的CER实现了有希望的结果。

The choice of modeling units is crucial for automatic speech recognition (ASR) tasks. In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of Chinese characters as modeling units is insufficient to capture speech features. In this paper, we present a novel method involves with multi-level modeling units, which integrates multi-level information for mandarin speech recognition. Specifically, the encoder block considers syllables as modeling units and the decoder block deals with character-level modeling units. To facilitate the incremental conversion from syllable features to character features, we design an auxiliary task that applies cross-entropy (CE) loss to intermediate decoder layers. During inference, the input feature sequences are converted into syllable sequences by the encoder block and then converted into Chinese characters by the decoder block. Experiments on the widely used AISHELL-1 corpus demonstrate that our method achieves promising results with CER of 4.1%/4.6% and 4.6%/5.2%, using the Conformer and the Transformer backbones respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题