论文标题

对速率不变的扬声器验证的深度表示分解

Deep Representation Decomposition for Rate-Invariant Speaker Verification

论文作者

Tong, Fuchuan, Zheng, Siqi, Zhou, Haodong, Xie, Xingjia, Hong, Qingyang, Li, Lin

论文摘要

虽然Deep Speaker Embeddings实现了演讲者验证的有希望的性能,但在说话风格的可变性的情况下,优势将降低。在实用的说话者验证系统中,通常会观察到口语率不匹配,这实际上可能会降低系统性能。为了减少由口语率引起的类内部差异,我们提出了一种深层表示分解方法,并通过对抗性学习来学习说话速率不变的说话者的嵌入。具体而言,我们采用了注意力块,将原始嵌入到身份相关的组件中,并通过多任务训练分解为与速率相关的组件。此外,为了减少两个分解成分之间的潜在关系,我们进一步提出了一个余弦映射块来对抗训练参数,以最大程度地减少两个分解成分之间的余弦相似性。结果,与身份相关的特征对口语率变得强大,然后用于验证。实验是在Voxceleb1数据和HI-MIA数据上进行的,以证明我们提出的方法的有效性。

While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability. Speaking rate mismatch is often observed in practical speaker verification systems, which may actually degrade the system performance. To reduce intra-class discrepancy caused by speaking rate, we propose a deep representation decomposition approach with adversarial learning to learn speaking rate-invariant speaker embeddings. Specifically, adopting an attention block, we decompose the original embedding into an identity-related component and a rate-related component through multi-task training. Additionally, to reduce the latent relationship between the two decomposed components, we further propose a cosine mapping block to train the parameters adversarially to minimize the cosine similarity between the two decomposed components. As a result, identity-related features become robust to speaking rate and then are used for verification. Experiments are conducted on VoxCeleb1 data and HI-MIA data to demonstrate the effectiveness of our proposed approach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源