论文标题
多模式聚类和扬声器诊断的角色诱导限制
Multimodal Clustering with Role Induced Constraints for Speaker Diarization
论文作者
论文摘要
扬声器聚类是常规扬声器诊断系统中的重要步骤,通常被视为仅一项仅声音的语音处理任务。但是,参与者在对话中使用的语言提供了其他信息,可以帮助提高聚类性能。在对话互动(例如商务会议,访谈和讲座)中尤其如此,在对话者(经理,客户,老师等)中扮演的具体角色通常与可区分的语言模式相关联。在本文中,我们建议采用有监督的基于文本的模型来提取扬声器角色,然后使用此信息来指导基于音频的光谱群集阶跃,通过在段之间强加必要链接和不能链接约束。所提出的方法应用于两个不同的领域,即医学相互作用和播客发作,与仅限音频方法相比,显示出可以改善的结果。
Speaker clustering is an essential step in conventional speaker diarization systems and is typically addressed as an audio-only speech processing task. The language used by the participants in a conversation, however, carries additional information that can help improve the clustering performance. This is especially true in conversational interactions, such as business meetings, interviews, and lectures, where specific roles assumed by interlocutors (manager, client, teacher, etc.) are often associated with distinguishable linguistic patterns. In this paper we propose to employ a supervised text-based model to extract speaker roles and then use this information to guide an audio-based spectral clustering step by imposing must-link and cannot-link constraints between segments. The proposed method is applied on two different domains, namely on medical interactions and on podcast episodes, and is shown to yield improved results when compared to the audio-only approach.