论文标题
蒙版CTC:带有CTC的非自动回归端到端ASR预测
Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
论文作者
论文摘要
我们提出了Mask CTC,这是一种新型的非自动回旋端到端自动语音识别(ASR)框架,该框架通过完善连接派时间分类(CTC)的输出来生成序列。神经序列到序列模型通常是\ textIt {自动回归}:每个输出令牌都是通过对先前生成的令牌进行条件来生成的,而需要与输出长度一样多的迭代。另一方面,非自动回旋模型可以同时在恒定数量的迭代次数内同时生成令牌,从而导致明显的推理时间缩短,并在现实世界中为端到端的ASR模型提供了更好的拟合。在这项工作中,使用蒙版预测和CTC联合培训的变压器编码器培训蒙版CTC模型。在推断期间,目标序列用贪婪的CTC输出初始化,并且根据CTC概率掩盖了低信心令牌。基于输出令牌之间的条件依赖性,这些掩盖的低信任令牌然后预测在高信心令牌上有条件。不同语音识别任务的实验结果表明,蒙版CTC的表现优于标准CTC模型(例如,WSJ上的17.9% - > 12.1%WER),接近自回旋模型,需要使用CPU的推理时间少得多(python实施中的0.07 RTF)。我们所有的代码都将公开使用。
We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually \textit{autoregressive}: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9% -> 12.1% WER on WSJ) and approaches the autoregressive model, requiring much less inference time using CPUs (0.07 RTF in Python implementation). All of our codes will be publicly available.