使用长格式含量和多演讲者多式建模来提高神经TT的质量

论文标题

使用长格式含量和多演讲者多式建模来提高神经TT的质量

Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling

论文作者

Raitio, Tuomo, Latorre, Javier, Davis, Andrea, Morrill, Tuuli, Golipour, Ladan

论文摘要

如果有足够数量的高质量语音材料可供培训，则神经文本到语音（TTS）可以提供接近自然语音的质量。但是，获取用于TTS培训的语音数据是昂贵且耗时的，尤其是在目标是生成不同的口语风格的情况下。在这项工作中，我们表明我们可以通过培训具有长形式录音的多扬声器多式扬声器（MSMS）模型来跨扬声器转移演讲风格，并提高合成语音的质量，此外还有常规的TTS录音。特别是，我们表明1）多扬声器建模改善了总体TTS质量，2）提出的MSMS方法在利用其他多演讲者数据时，提出的MSMS方法的表现优于预训练和微调方法，而3）无论目标文本域，长期说话风格都高度评估。

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the overall TTS quality, 2) the proposed MSMS approach outperforms pre-training and fine-tuning approach when utilizing additional multi-speaker data, and 3) long-form speaking style is highly rated regardless of the target text domain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题