统一的端到端语音识别和快速高效的语音系统的端点

论文标题

统一的端到端语音识别和快速高效的语音系统的端点

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

论文作者

Bijwadia, Shaan, Chang, Shuo-yiin, Li, Bo, Sainath, Tara, Zhang, Chao, He, Yanzhang

论文摘要

自动语音识别（ASR）系统通常依靠外部端量（EP）模型来识别语音边界。在这项工作中，我们提出了一种方法，可以通过单一端到端（E2E）多任务模型共同培训ASR和EP任务，从而通过可选地利用ASR音频编码器的信息来提高EP质量。我们引入了一个“开关”连接，该连接训练EP直接消耗音频帧或来自ASR模型的低级潜在表示。这会导致单个E2E模型，该模型可以在推理过程中以低成本进行帧过滤，并且还基于正在进行的ASR计算进行了高质量的疑问（EOQ）预测。我们在语音搜索测试集上介绍了结果，该结果表明，与单独的单任务模型相比，此方法将中值延迟延迟降低120 ms（减少30.8％），而第90％的延迟延迟降低了170 ms（降低23.0％），而无需恢复单词误差率。对于连续识别，WER提高了10.6％（相对）。

Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by 10.6% (relative).

下载PDF全文

下载文献需遵守相关版权规定

论文标题