论文标题
提高目标语音提取到入学变化的鲁棒性的策略
Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations
论文作者
论文摘要
目标语音提取是一种技术,可以使用预录用的注册话语从混合信号中提取目标扬声器的声音,以表征目标扬声器的语音特征。目标语音提取的一个主要困难在于处理``宣言者''特征的变异性,即目标语音与入学话语之间的特征不匹配。尽管大多数传统的方法都集中在给定一系列注册语言的情况下提高{\ it平均性能},但在这里我们建议保证{\ it最糟糕的绩效},我们认为这至关重要。在这项工作中,我们提出了一个称为“最差注册源源比率(SDR)”的评估度量,以定量测量对入学变化的鲁棒性。我们还介绍了一种新颖的培训计划,旨在通过专注于训练的训练,而在提取表现不佳的情况下,旨在直接优化最差的表现。此外,我们研究了辅助说话者识别损失(SI-LOSS)的有效性,这是提高入学率鲁棒性的另一种方法。实验验证揭示了通过提高说话者的可区分性来提高稳健性与入学差异的鲁棒性,从而揭示了最差的注册目标训练和SI-loss训练的有效性。
Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving {\it average performance} given a set of enrollment utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case performance by focusing on training with difficult enrollment cases where extraction does not perform well. In addition, we investigate the effectiveness of auxiliary speaker identification loss (SI-loss) as another way to improve robustness over enrollments. Experimental validation reveals the effectiveness of both worst-enrollment target training and SI-loss training to improve robustness against enrollment variations, by increasing speaker discriminability.