通过新的情感语音数据集看到和看不见的情感风格转换，用于语音转换

论文标题

通过新的情感语音数据集看到和看不见的情感风格转换，用于语音转换

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

论文作者

Zhou, Kun, Sisman, Berrak, Liu, Rui, Li, Haizhou

论文摘要

情感语音转换旨在改变语音中的情感韵律，同时保留语言内容和说话者的身份。先前的研究表明，可以使用以离散表示为条件的编码器网络（例如一热情绪标签）来解散情绪韵律。这样的网络学会记住一系列固定的情感风格。在本文中，我们提出了一个基于各种自动编码的Wasserstein生成对抗网络（VAW-GAN）的新型框架，该框架利用预先训练的语音情感识别（SER）模型在训练期间和运行时间推断期间转移情感风格。通过这种方式，网络能够将可见的情感风格转移到新的话语中。我们表明，所提出的框架通过始终优于基线框架来实现出色的性能。本文还标志着语音转换的情感语音数据集（ESD）的发布，该语音转换具有多种扬声器和语言。

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题