论文标题
朝着可转移的语音情感表示:跨语义潜在表示的损失功能
Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations
论文作者
论文摘要
近年来,从医疗保健到商业部门,语音情绪识别(SER)已用于广泛的应用程序中。除了信号处理方法外,SER的方法现在还使用了提供转移学习可能性的深度学习技术。但是,对语言,语料库和记录条件进行概括仍然是一个公开挑战。在这项工作中,我们通过探索有助于可转让性的损失功能,特别是对非音调语言的损失功能来解决这一差距。我们提出了一个带有KL退火的变异自动编码器(VAE)和半监督VAE,以在数据集跨数据集中获得更一致的潜在嵌入分布。为了确保可传递性,潜在嵌入的分布应在非色调语言(数据集)之间相似。我们首先根据Denoising-AutoEncoder提出低复杂性SER,该SER的分类精度超过52.09%,用于四级情感分类。该性能与类似的基线方法相当。此后,我们采用了VAE,半监督的VAE和带有KL退火的VAE,以获得更正规的潜在空间。我们表明,尽管DAE在方法之间具有最高的分类精度,但半监督的VAE具有可比的分类准确性,并且在数据集中具有更一致的潜在嵌入分布。
In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoising-autoencoder, which achieves an unweighted classification accuracy of over 52.09% for four-class emotion classification. This performance is comparable to that of similar baseline methods. Following this, we employ a VAE, the semi-supervised VAE and the VAE with KL annealing to obtain a more regularized latent space. We show that while the DAE has the highest classification accuracy among the methods, the semi-supervised VAE has a comparable classification accuracy and a more consistent latent embedding distribution over data sets.