托梁：ASR的联合演讲和文字流模型

论文标题

托梁：ASR的联合演讲和文字流模型

JOIST: A Joint Speech and Text Streaming Model For ASR

论文作者

Sainath, Tara N., Prabhavalkar, Rohit, Bapna, Ankur, Zhang, Yu, Huo, Zhouyuan, Chen, Zhehuai, Li, Bo, Wang, Weiran, Strohman, Trevor

论文摘要

我们提出了托梁，这是一种算法，用于训练带有语音文本配对输入的流式，级联编码器的端到端（E2E）模型，以及仅文本不成对的输入。与以前的作品不同，我们探索了两种方式的联合培训，而不是预训练和微调。此外，我们还使用流媒体E2E模型探索托梁，该模型具有数量级的数据，与以前的工作相比，这也是新颖性。通过一系列消融研究，我们探索了不同类型的文本建模，包括如何建模文本序列的长度和适当的文本子字表示。我们发现，与未接受文本训练的模型相比，托梁的最佳文本表示相对相对4-14％，可以改善4-14％的相对。此外，我们定量地表明托梁保持流式功能，这对于良好的用户级体验很重要。

We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text sub-word unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.

下载PDF全文

下载文献需遵守相关版权规定

论文标题