论文标题
ecapa-tdnn用于多演讲者文本到语音综合
ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
论文作者
论文摘要
近年来,基于神经网络的多演讲者文本到语音综合(TTS)的方法取得了重大进展。但是,这些方法中使用的当前扬声器编码器模型仍然无法捕获足够的扬声器信息。在本文中,我们专注于准确的扬声器编码器建模,并提出了一种端到端方法,该方法可以产生高质量的语音和更好的扬声器。所提出的架构由三个单独训练的组件组成:一个基于最先进的ECAPA-TDNN模型的扬声器编码器,该模型源自扬声器验证任务,基于FastSpeech2的合成器和Hifi-Gan vocoder。不同扬声器编码器模型之间的比较表明,我们提出的方法可以实现更好的自然性和相似性。为了有效评估我们的综合语音,我们是第一个采用基于深度学习的自动MOS评估方法来评估我们的结果的人,这些方法在自动语音质量评估中表现出巨大的潜力。
In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate high-quality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.