论文标题
统一的端到端语音识别和快速高效的语音系统的端点
Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems
论文作者
论文摘要
自动语音识别(ASR)系统通常依靠外部端量(EP)模型来识别语音边界。在这项工作中,我们提出了一种方法,可以通过单一端到端(E2E)多任务模型共同培训ASR和EP任务,从而通过可选地利用ASR音频编码器的信息来提高EP质量。我们引入了一个“开关”连接,该连接训练EP直接消耗音频帧或来自ASR模型的低级潜在表示。这会导致单个E2E模型,该模型可以在推理过程中以低成本进行帧过滤,并且还基于正在进行的ASR计算进行了高质量的疑问(EOQ)预测。我们在语音搜索测试集上介绍了结果,该结果表明,与单独的单任务模型相比,此方法将中值延迟延迟降低120 ms(减少30.8%),而第90%的延迟延迟降低了170 ms(降低23.0%),而无需恢复单词误差率。对于连续识别,WER提高了10.6%(相对)。
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by 10.6% (relative).