动作中的变压器：用于大规模语音识别应用的基于变压器的声学模型的比较研究

论文标题

动作中的变压器：用于大规模语音识别应用的基于变压器的声学模型的比较研究

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

论文作者

Wang, Yongqiang, Shi, Yangyang, Zhang, Frank, Wu, Chunyang, Chan, Julian, Yeh, Ching-Feng, Xiao, Alex

论文摘要

在本文中，我们总结了变压器的应用及其基于Emformer的流媒体模型在大规模语音识别应用中的应用。我们将基于变压器的声学模型与其在工业规模任务上的LSTM对应物进行比较。具体而言，我们将EMFormer与中延迟任务中的延迟控制的BLSTM（LCBLSTM）和低延迟任务上的LSTM进行了比较。在低延迟语音助手任务上，Emformer获得24％至26％的相对单词错误率降低（WERRS）。对于中等延迟方案，与具有相似模型大小和延迟的LCBLSTM相比，Emformer在视频字幕数据集中获得了四种语言的显着WERR，并减少了推理实时因子2-3倍。

In this paper, we summarize the application of transformer and its streamable variant, Emformer based acoustic model for large scale speech recognition applications. We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. On a low latency voice assistant task, Emformer gets 24% to 26% relative word error rate reductions (WERRs). For medium latency scenarios, comparing with LCBLSTM with similar model size and latency, Emformer gets significant WERR across four languages in video captioning datasets with 2-3 times inference real-time factors reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题