基于BISTM评分的相似性测量与助手诊断的聚集分层聚类（AHC）的相似性测量

论文标题

基于BISTM评分的相似性测量与助手诊断的聚集分层聚类（AHC）的相似性测量

Bi-LSTM Scoring Based Similarity Measurement with Agglomerative Hierarchical Clustering (AHC) for Speaker Diarization

论文作者

Nijhawan, Siddharth S., Beigi, Homayoon

论文摘要

在不同场景中，大多数语音信号永远无法使用，只有一个只有单个扬声器的音频段。两位演讲者之间的典型对话包括部分，它们的声音在多个句子之间重叠，相互打断或停止演讲。诊断技术的最新进步利用基于神经网络的方法即兴即兴创作说话者诊断系统的多个子系统，该系统包括提取细分市场的嵌入功能以及在对话过程中检测说话者的变化。但是，为了通过聚类来识别说话者，模型取决于PLDA等方法论，以从给定的对话音频中产生两个提取段之间的相似性度量。由于这些算法忽略了对话的时间结构，因此它们倾向于达到较高的诊断错误率（DER），从而导致在说话者和变化识别方面导致错误探测。因此，为了比较两个语音段的相似性，无论是独立和顺序的，我们提出了一个双向长期记忆网络，以估计相似性矩阵中存在的元素。一旦生成了相似性矩阵，将应用集聚性层次聚类（AHC），以进一步根据阈值识别说话者段。为了评估性能，使用诊断错误率（DER％）度量。与传统的基于PLDA的相似性测量机制相比，提出的模型在ICSI Meeting Coppus的测试音频样本上达到了34.80％的低DER，该样品的相似性测量机制达到了39.90％。

Majority of speech signals across different scenarios are never available with well-defined audio segments containing only a single speaker. A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences. Recent advancements in diarization technology leverage neural network-based approaches to improvise multiple subsystems of speaker diarization system comprising of extracting segment-wise embedding features and detecting changes in the speaker during conversation. However, to identify speaker through clustering, models depend on methodologies like PLDA to generate similarity measure between two extracted segments from a given conversational audio. Since these algorithms ignore the temporal structure of conversations, they tend to achieve a higher Diarization Error Rate (DER), thus leading to misdetections both in terms of speaker and change identification. Therefore, to compare similarity of two speech segments both independently and sequentially, we propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix. Once the similarity matrix is generated, Agglomerative Hierarchical Clustering (AHC) is applied to further identify speaker segments based on thresholding. To evaluate the performance, Diarization Error Rate (DER%) metric is used. The proposed model achieves a low DER of 34.80% on a test set of audio samples derived from ICSI Meeting Corpus as compared to traditional PLDA based similarity measurement mechanism which achieved a DER of 39.90%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题