使用双编码器变压器混合物模型对说话者的年龄和语音信号的高度估计

论文标题

使用双编码器变压器混合物模型对说话者的年龄和语音信号的高度估计

Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model

论文作者

Gupta, Tarun, Truong, Duc-Tuan, Anh, Tran The, Siong, Chng Eng

论文摘要

对年龄和身高等说话者特征的估计是一项艰巨的任务，在语音法医分析中具有许多应用。在这项工作中，我们提出了一个双重编码器变压器混合物模型，以进行扬声器年龄和高度估计。考虑到男性和女性语音特征的广泛差异，例如共振峰和基本频率的差异，我们建议使用两个单独的变压器编码器，用于在男性和女性中提取特定语音特征，使用wav2Vec 2.0作为公共级别的特征提取器。该体系结构降低了返回过程中的干扰效应，并提高了模型的普遍性。我们在TIMIT数据集上执行实验，并在年龄估计上显着优于当前最新结果。具体来说，我们的男性和女性年龄估计分别达到5.54岁和6.49岁的根平方误差（RMSE）。进一步评估不同语音类型对我们任务的相对重要性的实验表明，元音的声音是年龄估计最有区别的。

The estimation of speaker characteristics such as age and height is a challenging task, having numerous applications in voice forensic analysis. In this work, we propose a bi-encoder transformer mixture model for speaker age and height estimation. Considering the wide differences in male and female voice characteristics such as differences in formant and fundamental frequencies, we propose the use of two separate transformer encoders for the extraction of specific voice features in the male and female gender, using wav2vec 2.0 as a common-level feature extractor. This architecture reduces the interference effects during backpropagation and improves the generalizability of the model. We perform our experiments on the TIMIT dataset and significantly outperform the current state-of-the-art results on age estimation. Specifically, we achieve root mean squared error (RMSE) of 5.54 years and 6.49 years for male and female age estimation, respectively. Further experiment to evaluate the relative importance of different phonetic types for our task demonstrate that vowel sounds are the most distinguishing for age estimation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题