论文标题
WAVETT:具有联合时频域损失的基于TACOTRON的TT
WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
论文作者
论文摘要
基于TACOTRON的文本对语音(TTS)系统直接从文本输入中综合语音。这样的框架通常由特征预测网络组成,该特征预测网络将字符序列映射到频域声学特征,然后是波形重建算法或神经声码编码器,该算法从声学特征生成时间域波形。由于通常仅针对频域声学特征计算损耗函数,因此无法直接控制生成的时间域波形的质量。为了解决这个问题,我们为基于TACOTRON的TT提出了一种新的训练方案,称为Wavetts,具有2个损失功能:1)时间域损失,表示为波形损失,以衡量自然波形和生成波形之间的失真; 2)频域损失,可以测量自然和产生的声学特征之间的MEL尺度声学特征损失。 Wavetts确保声学特征的质量和由此产生的语音波形。据我们所知,这是带有联合时频域损失的TACOTRON的首次实施。实验结果表明,所提出的框架的表现优于基准,并实现高质量的合成语音。
Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only for frequency-domain acoustic features, that doesn't directly control the quality of the generated time-domain waveform. To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features. WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform. To our best knowledge, this is the first implementation of Tacotron with joint time-frequency domain loss. Experimental results show that the proposed framework outperforms the baselines and achieves high-quality synthesized speech.