论文标题
端到端扬声器验证系统的特征重新校准和归一化的自动多层聚合
Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization for End-to-End Speaker Verification System
论文作者
论文摘要
端到端扬声器验证系统中最重要的部分之一是扬声器嵌入生成。在上一篇论文中,我们报告说,基于快捷连接的多层聚合可以提高说话者嵌入的代表力。但是,模型参数的数量相对较大,并且未指定的变化增加了多层聚合。因此,我们提出了一个自动训练的多层聚合,具有特征重新校准和端到端扬声器验证系统的归一化。为了减少模型参数的数量,使用缩放的通道宽度和层深度的Resnet用作基线。为了控制训练的变异性,使用自我注意力的机制来执行多层聚合,并通过辍学正规化和批准化。然后,使用完全连接的层和非线性激活功能将特征重新校准层应用于聚合特征。在端到端训练过程中,还使用了深度归一化的重新校准功能。使用Voxceleb1评估数据集的实验结果表明,使用Voxceleb1和Voxceleb2培训数据集的实验性能与最新模型的性能相当(相等的错误率为4.95%和2.86%)。
One of the most important parts of an end-to-end speaker verification system is the speaker embedding generation. In our previous paper, we reported that shortcut connections-based multi-layer aggregation improves the representational power of the speaker embedding. However, the number of model parameters is relatively large and the unspecified variations increase in the multi-layer aggregation. Therefore, we propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system. To reduce the number of model parameters, the ResNet, which scaled channel width and layer depth, is used as a baseline. To control the variability in the training, a self-attention mechanism is applied to perform the multi-layer aggregation with dropout regularizations and batch normalizations. Then, a feature recalibration layer is applied to the aggregated feature using fully-connected layers and nonlinear activation functions. Deep length normalization is also used on a recalibrated feature in the end-to-end training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).