使用扬声器清单进行长期多对话者记录的持续语音分离

论文标题

使用扬声器清单进行长期多对话者记录的持续语音分离

Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

论文作者

Han, Cong, Luo, Yi, Li, Chenda, Zhou, Tianyan, Kinoshita, Keisuke, Watanabe, Shinji, Delcroix, Marc, Erdogan, Hakan, Hershey, John R., Mesgarani, Nima, Chen, Zhuo

论文摘要

近年来，利用其他说话者信息促进语音分离已受到越来越多的关注。最近的研究包括通过使用目标扬声器的语音段来提取目标语音，并通过使用额外的扬声器信号来共同将所有参与的演讲者分开，该信号使用扬声器库存（SSUSI）称为语音分离。但是，所有这些系统理想地都假定可用的扬声器信号可用，并且仅在简单的数据配置上进行评估。在现实的多对话对话中，语音信号包含很大一部分的非封闭区域，我们可以在其中得出嵌入单个说话者的强大扬声器。在这项工作中，我们在长录音中采用SSUSI模型，并为长录音提出了一个自我信息，基于聚类的库存成型方案，其中说话者库存是根据输入信号完全构建的，而无需外部扬声器信号。实验结果模拟嘈杂的回响长录制数据集表明，所提出的方法可以显着改善各种条件的分离性能。

Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker's voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all these systems ideally assume that the pre-enrolled speaker signals are available and are only evaluated on simple data configurations. In realistic multi-talker conversations, the speech signal contains a large proportion of non-overlapped regions, where we can derive robust speaker embedding of individual talkers. In this work, we adopt the SSUSI model in long recordings and propose a self-informed, clustering-based inventory forming scheme for long recording, where the speaker inventory is fully built from the input signal without the need for external speaker signals. Experiment results on simulated noisy reverberant long recording datasets show that the proposed method can significantly improve the separation performance across various conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题