ICASSP 2022多渠道多方会议转录挑战的Volcspeech系统

论文标题

ICASSP 2022多渠道多方会议转录挑战的Volcspeech系统

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

论文作者

Shen, Chen, Liu, Yi, Fan, Wenzhi, Wang, Bin, Wen, Shixue, Tian, Yao, Zhang, Jun, Yang, Jingsheng, Ma, Zejun

论文摘要

本文介绍了我们提交的ICASSP 2022多渠道多方会议转录（M2MET）挑战。对于轨道1，我们提出了几种方法，以增强基于聚类的扬声器诊断系统来处理重叠的语音。前端替验和到达方向（DOA）估计用于提高扬声器诊断的准确性。使用多通道组合和重叠检测来减少错过的扬声器错误。还提出了修改后的多佛兰普，以融合不同系统的结果。我们在评估集中达到了5.79％的最终DER，在测试集上达到了7.23％。对于轨道2，我们使用联合CTC注意结构中的构象模型开发系统。序列化的输出培训对多扬声器重叠的语音识别采用。我们提出了一个神经前端模块，以建模多通道音频并端到端训练模型。多种数据增强方法用于减轻多渠道多扬声器E2E系统中的过度拟合。 Transformer语言模型融合的开发是为了实现更好的性能。最终CER在评估集中为19.2％，测试集为20.8％。

This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge. For Track 1, we propose several approaches to empower the clustering-based speaker diarization system to handle overlapped speech. Front-end dereverberation and the direction-of-arrival (DOA) estimation are used to improve the accuracy of speaker diarization. Multi-channel combination and overlap detection are applied to reduce the missed speaker error. A modified DOVER-Lap is also proposed to fuse the results of different systems. We achieve the final DER of 5.79% on the Eval set and 7.23% on the Test set. For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture. Serialized output training is adopted to multi-speaker overlapped speech recognition. We propose a neural front-end module to model multi-channel audio and train the model end-to-end. Various data augmentation methods are utilized to mitigate over-fitting in the multi-channel multi-speaker E2E system. Transformer language model fusion is developed to achieve better performance. The final CER is 19.2% on the Eval set and 20.8% on the Test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题