上下文RNN-T用于开放域ASR

论文标题

上下文RNN-T用于开放域ASR

Contextual RNN-T For Open Domain ASR

论文作者

Jain, Mahaveer, Keren, Gil, Mahadeokar, Jay, Zweig, Geoffrey, Metze, Florian, Saraf, Yatharth

论文摘要

用于自动语音识别的端到端（E2E）系统（ASR），例如RNN传感器（RNN-T）和听力范围 - 塞旋转（LAS）（LAS），将传统混合ASR系统的各个组件（语言模型，语言模型，发音模型）融合到一个单个神经网络中。尽管这具有一些不错的优势，但它限制了仅使用配对音频和文本训练的系统。因此，E2E模型往往会遇到困难，并且正确识别培训期间不经常看到的稀有单词，例如实体名称。在本文中，我们建议对RNN-T模型进行修改，该模型允许该模型利用其他元数据文本，目的是改善这些命名实体单词的性能。我们在从取消识别的公共社交媒体视频中采样的内部数据集上评估了我们的方法，该视频代表了一个开放的域ASR任务。通过使用注意力模型和一个偏见模型来利用视频伴随的上下文元数据，我们观察到具有相关元数据的视频的命名实体（WER-NE）的单词错误率（WER-NE）的相对提高约为16％。

End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.

下载PDF全文

下载文献需遵守相关版权规定

论文标题