对速率不变的扬声器验证的深度表示分解

论文标题

对速率不变的扬声器验证的深度表示分解

Deep Representation Decomposition for Rate-Invariant Speaker Verification

论文作者

Tong, Fuchuan, Zheng, Siqi, Zhou, Haodong, Xie, Xingjia, Hong, Qingyang, Li, Lin

论文摘要

虽然Deep Speaker Embeddings实现了演讲者验证的有希望的性能，但在说话风格的可变性的情况下，优势将降低。在实用的说话者验证系统中，通常会观察到口语率不匹配，这实际上可能会降低系统性能。为了减少由口语率引起的类内部差异，我们提出了一种深层表示分解方法，并通过对抗性学习来学习说话速率不变的说话者的嵌入。具体而言，我们采用了注意力块，将原始嵌入到身份相关的组件中，并通过多任务训练分解为与速率相关的组件。此外，为了减少两个分解成分之间的潜在关系，我们进一步提出了一个余弦映射块来对抗训练参数，以最大程度地减少两个分解成分之间的余弦相似性。结果，与身份相关的特征对口语率变得强大，然后用于验证。实验是在Voxceleb1数据和HI-MIA数据上进行的，以证明我们提出的方法的有效性。

While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability. Speaking rate mismatch is often observed in practical speaker verification systems, which may actually degrade the system performance. To reduce intra-class discrepancy caused by speaking rate, we propose a deep representation decomposition approach with adversarial learning to learn speaking rate-invariant speaker embeddings. Specifically, adopting an attention block, we decompose the original embedding into an identity-related component and a rate-related component through multi-task training. Additionally, to reduce the latent relationship between the two decomposed components, we further propose a cosine mapping block to train the parameters adversarially to minimize the cosine similarity between the two decomposed components. As a result, identity-related features become robust to speaking rate and then are used for verification. Experiments are conducted on VoxCeleb1 data and HI-MIA data to demonstrate the effectiveness of our proposed approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题