论文标题
使用人机对话的自我监督的说话者识别培训
Self-supervised Speaker Recognition Training Using Human-Machine Dialogues
论文作者
论文摘要
说话者认可,仅基于声音就可以识别说话者身份,可以实现重要的下游应用程序,例如个性化和身份验证。在监督学习的背景下,学习说话者的表示在很大程度上取决于清洁和足够的标记数据,这总是很难获得。另一方面,嘈杂的未标记数据还提供了有价值的信息,可以使用自我监督的培训方法来利用这些信息。在这项工作中,我们通过利用客户和智能扬声器设备之间的对话来调查如何为说话者识别模型预识。但是,这种对话中的监督信息本质上是嘈杂的,因为在同一对话过程中,多个扬声器可能会与设备交谈。为了解决这个问题,我们提出了一种有效的拒绝机制,该机制根据其声学同质性有选择地从对话中学习。比较了基于重建和基于学习的基于学习的自我监督方法。实验表明,所提出的方法可提供显着的性能改进,优于早期工作。与没有自我监督预处理的模型相比,与拒绝机制结合使用时,对话预处理会产生27.10%的误差率(EER)降低。
Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication. Learning speaker representations, in the context of supervised learning, heavily depends on both clean and sufficient labeled data, which is always difficult to acquire. Noisy unlabeled data, on the other hand, also provides valuable information that can be exploited using self-supervised training methods. In this work, we investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices. However, the supervisory information in such dialogues is inherently noisy, as multiple speakers may speak to a device in the course of the same dialogue. To address this issue, we propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity. Both reconstruction-based and contrastive-learning-based self-supervised methods are compared. Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work. Dialogue pretraining when combined with the rejection mechanism yields 27.10% equal error rate (EER) reduction in speaker recognition, compared to a model without self-supervised pretraining.