培训扬声器使用多演讲器音频嵌入提取器，并具有未知的扬声器边界

论文标题

培训扬声器使用多演讲器音频嵌入提取器，并具有未知的扬声器边界

Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

论文作者

Stafylakis, Themos, Mošner, Ladislav, Plchot, Oldřich, Rohdin, Johan, Silnova, Anna, Burget, Lukáš, Černocký, Jan "Honza''

论文摘要

在本文中，我们演示了一种使用弱注释来培训扬声器嵌入提取器的方法。更具体地说，我们使用的是完整的Voxceleb录音以及每个视频中出现的名人的名字，而无需了解名人出现在视频中的时间间隔。我们表明，通过将不需要训练或参数调整的基线扬声器诊断算法结合使用，通过细分的聚合进行了修改的损失以及两阶段的训练方法，我们可以培训基于竞争性的基于RESNET的嵌入式提取器。最后，我们实验了两个不同的聚集函数，并根据其梯度分析其行为。

In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parameter tuning, a modified loss with aggregation over segments, and a two-stage training approach, we are able to train a competitive ResNet-based embedding extractor. Finally, we experiment with two different aggregation functions and analyze their behaviour in terms of their gradients.

下载PDF全文

下载文献需遵守相关版权规定

论文标题