论文标题
分离器 - 变形器分段器:多方语音的流识别和分割
Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech
论文作者
论文摘要
流媒体识别和对多方对话进行重叠语音的分割对于下一代语音助手应用程序至关重要。在这项工作中,我们通过一种新型方法,一种新方法,分离器 - 传播器分离器(STS),解决了先前关于多转弯神经网络传感器(MT-RNN-T)的挑战,从而使单个模型中的语音分离,识别和分割都可以更严格地集成。首先,我们通过转弯和转弯令牌提出了一种新的分割建模策略,可以改善细分而没有识别精度降低。其次,我们通过发射正规化方法,快速仪和多任务培训,通过语音活动信息作为额外的训练信号,进一步提高了语音识别和分割精度。第三,我们试验了转弯终止排放延迟惩罚,以改善每个说话者转弯的终点检测。最后,我们建立了一个新颖的框架,用于通过发射延迟指标对多方对话进行分割分析。有了我们的最佳模型,我们报告了4.6%的ABS。转弯计算准确性提高和17%的相关性。与先前发布的工作相比,图书馆数据集的单词错误率(WER)改进。
Streaming recognition and segmentation of multi-party conversations with overlapping speech is crucial for the next generation of voice assistant applications. In this work we address its challenges discovered in the previous work on multi-turn recurrent neural network transducer (MT-RNN-T) with a novel approach, separator-transducer-segmenter (STS), that enables tighter integration of speech separation, recognition and segmentation in a single model. First, we propose a new segmentation modeling strategy through start-of-turn and end-of-turn tokens that improves segmentation without recognition accuracy degradation. Second, we further improve both speech recognition and segmentation accuracy through an emission regularization method, FastEmit, and multi-task training with speech activity information as an additional training signal. Third, we experiment with end-of-turn emission latency penalty to improve end-point detection for each speaker turn. Finally, we establish a novel framework for segmentation analysis of multi-party conversations through emission latency metrics. With our best model, we report 4.6% abs. turn counting accuracy improvement and 17% rel. word error rate (WER) improvement on LibriCSS dataset compared to the previously published work.