通过序列到序列预测，目标扬声器语音活动检测

论文标题

通过序列到序列预测，目标扬声器语音活动检测

Target-Speaker Voice Activity Detection via Sequence-to-Sequence Prediction

论文作者

Cheng, Ming, Wang, Weiqing, Zhang, Yucong, Qin, Xiaoyi, Li, Ming

论文摘要

目前，在复杂的声学环境中，扬声器的语音活动检测是说话者诊断的一种有前途的方法。本文介绍了一种新型的序列到序列目标扬声器语音活动检测（SEQ2SEQ-TSVAD）方法，该方法可以有效地解决大型扬声器的关节建模并预测高分辨率的语音活动。实验结果表明，较大的扬声器能力和较高的输出分辨率可以显着降低诊断误差率（DER），这在voxconverse测试集上达到了4.55％的新最新性能，而在广泛使用的评估仪表下，在DIHARD-III III评估集的轨道1上，达到了10.77％。

Target-speaker voice activity detection is currently a promising approach for speaker diarization in complex acoustic environments. This paper presents a novel Sequence-to-Sequence Target-Speaker Voice Activity Detection (Seq2Seq-TSVAD) method that can efficiently address the joint modeling of large-scale speakers and predict high-resolution voice activities. Experimental results show that larger speaker capacity and higher output resolution can significantly reduce the diarization error rate (DER), which achieves the new state-of-the-art performance of 4.55% on the VoxConverse test set and 10.77% on Track 1 of the DIHARD-III evaluation set under the widely-used evaluation metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题