通过无限高斯混合模型的深层展开，基于神经和聚类的诊断的紧密整合

论文标题

通过无限高斯混合模型的深层展开，基于神经和聚类的诊断的紧密整合

Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model

论文作者

Kinoshita, Keisuke, Delcroix, Marc, Iwata, Tomoharu

论文摘要

说话者诊断已被广泛研究，这是进行分析的重要核心任务。最近的趋势表明，端到端神经（EEND）和基于聚类的诊断的整合是一种有前途的方法，可以处理包含大量说话者的重叠语音的现实对话数据，并在各种任务上取得了最新的结果。但是，到目前为止提出的方法尚未意识到{\ it紧密}集成，因为在其中采用的聚类在任何意义上都不是群集由EEND模块估计的说话者嵌入的任何意义上的最佳选择。为了解决这个问题，本文通过深度解释了一种称为Infinite Gaussian混合模型（IGMM）的非参数贝叶斯模型，将{\ it可训练}聚类算法引入集成框架。具体而言，根据基于调整后的RAND指数（ARI）的新型聚类损失，在训练过程中对扬声器的嵌入进行了优化，从而使其更适合IGMM聚类。基于呼叫者数据的实验结果表明，所提出的方法在诊断错误率（DER）方面优于常规方法，尤其是通过实质上减少说话者的混淆错误，这确实反映了所提出的IgMM整合的有效性。

Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)-and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. However, the approaches proposed so far have not realized {\it tight} integration yet, because the clustering employed therein was not optimal in any sense for clustering the speaker embeddings estimated by the EEND module. To address this problem, this paper introduces a {\it trainable} clustering algorithm into the integration framework, by deep-unfolding a non-parametric Bayesian model called the infinite Gaussian mixture model (iGMM). Specifically, the speaker embeddings are optimized during training such that it better fits iGMM clustering, based on a novel clustering loss based on Adjusted Rand Index (ARI). Experimental results based on CALLHOME data show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题