论文标题
托梁:ASR的联合演讲和文字流模型
JOIST: A Joint Speech and Text Streaming Model For ASR
论文作者
论文摘要
我们提出了托梁,这是一种算法,用于训练带有语音文本配对输入的流式,级联编码器的端到端(E2E)模型,以及仅文本不成对的输入。与以前的作品不同,我们探索了两种方式的联合培训,而不是预训练和微调。此外,我们还使用流媒体E2E模型探索托梁,该模型具有数量级的数据,与以前的工作相比,这也是新颖性。通过一系列消融研究,我们探索了不同类型的文本建模,包括如何建模文本序列的长度和适当的文本子字表示。我们发现,与未接受文本训练的模型相比,托梁的最佳文本表示相对相对4-14%,可以改善4-14%的相对。此外,我们定量地表明托梁保持流式功能,这对于良好的用户级体验很重要。
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text sub-word unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.