基于迭代精制适应的强大扬声器提取网络

论文标题

基于迭代精制适应的强大扬声器提取网络

Robust Speaker Extraction Network Based on Iterative Refined Adaptation

论文作者

Deng, Chengyun, Ma, Shiqian, Zhang, Yi, Sha, Yongtao, Zhang, Hui, Song, Hui, Li, Xiangang

论文摘要

鉴于目标扬声器的参考信息，扬声器提取旨在从具有干扰扬声器和周围噪声的多对话者环境中提取目标语音信号。大多数扬声器提取系统在训练时间遇到测试扬声器的前提下实现了令人满意的性能。鉴于看不见的目标扬声器和/或不匹配的参考语音印刷信息，此类系统会遭受性能退化。在本文中，我们提出了一种名为迭代精制适应（IRA）的新型策略，以提高上述场景中说话者提取系统的鲁棒性和概括能力。给定一个由辅助网络编码的初始扬声器嵌入，提取网络可以获得目标扬声器的潜在表示，该表示者可以回到辅助网络中，以获取精致的嵌入式，以提供更准确的提取网络指导。 WSJ0-2MIX-EXTR和WHAM的实验！数据集在SI-SDR和PESQ的改进方面，确认所提出的方法的优越性能超过没有IRA的网络。

Speaker extraction aims to extract target speech signal from a multi-talker environment with interference speakers and surrounding noise, given the target speaker's reference information. Most speaker extraction systems achieve satisfactory performance on the premise that the test speakers have been encountered during training time. Such systems suffer from performance degradation given unseen target speakers and/or mismatched reference voiceprint information. In this paper we propose a novel strategy named Iterative Refined Adaptation (IRA) to improve the robustness and generalization capability of speaker extraction systems in the aforementioned scenarios. Given an initial speaker embedding encoded by an auxiliary network, the extraction network can obtain a latent representation of the target speaker, which is fed back to the auxiliary network to get a refined embedding to provide more accurate guidance for the extraction network. Experiments on WSJ0-2mix-extr and WHAM! dataset confirm the superior performance of the proposed method over the network without IRA in terms of SI-SDR and PESQ improvement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题