深元素的早期频率关注深度扬声器表示

论文标题

深元素的早期频率关注深度扬声器表示

Fine-grained Early Frequency Attention for Deep Speaker Representation Learning

论文作者

Hajavi, Amirhossein, Etemad, Ali

论文摘要

近年来，深度学习技术已大大改善了语音处理。通过深度学习模型提取的说话者表示，正在使用诸如说话者识别和言语情感识别之类的各种任务中使用。注意机制已经开始在改善语音处理领域的深度学习模型中发挥重要作用。尽管如此，尽管可以将重要的与说话者相关的信息嵌入到输入频谱表示的单个频率键中，但当前的注意模型仍无法以光谱表示中的细粒度信息项进行。在本文中，我们提出了对说话者表示学习的细粒早期频率关注（FEFA）。我们的模型是一个简单且轻巧的模型，可以集成到各种CNN管道中，并能够专注于小如频率的信息项。我们评估了提议的模型，以说话者识别，语音情感识别和口语识别的三个任务。我们使用三个广泛使用的公共数据集，即Voxceleb，Iemocap和免费的口头数据集用于我们的实验。我们将FEFA附加到几种突出的深度学习模型上，并评估其对最终表现的影响。我们还将我们的工作与该地区的其他相关作品进行了比较。我们的实验表明，通过将FEFA添加到不同的CNN体系结构中，通过大量利润来始终提高性能，而配备FEFA的模型表现优于所有其他细心模型。我们还针对不同级别的增加噪声测试了我们的模型，与骨干网络相比，鲁棒性和敏感性较小。

Deep learning techniques have considerably improved speech processing in recent years. Speaker representations extracted by deep learning models are being used in a wide range of tasks such as speaker recognition and speech emotion recognition. Attention mechanisms have started to play an important role in improving deep learning models in the field of speech processing. Nonetheless, despite the fact that important speaker-related information can be embedded in individual frequency-bins of the input spectral representations, current attention models are unable to attend to fine-grained information items in spectral representations. In this paper we propose Fine-grained Early Frequency Attention (FEFA) for speaker representation learning. Our model is a simple and lightweight model that can be integrated into various CNN pipelines and is capable of focusing on information items as small as frequency-bins. We evaluate the proposed model on three tasks of speaker recognition, speech emotion recognition, and spoken digit recognition. We use Three widely used public datasets, namely VoxCeleb, IEMOCAP, and Free Spoken Digit Dataset for our experiments. We attach FEFA to several prominent deep learning models and evaluate its impact on the final performance. We also compare our work with other related works in the area. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, and the models equipped with FEFA outperform all the other attentive models. We also test our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题