论文标题
只听我的话!目标语音提取如何处理错误警报?
Listen only to me! How well can target speech extraction handle false alarms?
论文作者
论文摘要
目标语音提取(TSE)在给定表征说话者的辅助线索中,在混合物中提取了目标说话者的语音,例如入学话语。 TSE解决了同时执行分离和说话者识别的具有挑战性的问题。在神经网络最近发展以增强语音和分离之后,提取性能取得了很大进展。大多数研究的重点是处理目标扬声器积极讲话的混合物。但是,目标扬声器有时在实践中是沉默的,即不活动的说话者(IS)。典型的TSE系统将倾向于在情况下输出信号,从而导致错误警报。这是TSE系统实际部署的严重问题。本文旨在更好地了解TSE系统如何处理情况。我们考虑两种处理IS的方法,(1)训练系统直接输出零信号或(2)检测是使用额外的扬声器验证模块。我们在提取性能方面对这些方案进行了广泛的实验比较,并使用Librimix数据集进行检测并揭示其优点和缺点。
Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. It is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.