论文标题
基于单语录音和跨语性语音转换的组合,朝着自然双语和代码开关的语音综合
Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion
论文作者
论文摘要
最近最新的神经文本到语音(TTS)合成模型已大大提高了文本中产生的语音的清晰度和自然性。但是,为特定声音构建良好的双语或代码开关TTS仍然是一个挑战。主要原因是,从两种语言中获得本地流利度的演讲者那里获得双语语料库并不容易。在本文中,我们探讨了普通话演讲者的普通话录音,以及另一位英语说话者的英语语音录音,以构建两位演讲者的高质量双语和代码开关TTS。基于Tacotron2的跨语性语音转换系统用于产生普通话的英语演讲和英语说话者的普通话演讲,这些演讲表现出良好的自然性和说话者的相似性。然后使用使用变压器模型合成的代码切换话语来增强所获得的双语数据。借助这些数据,将三种神经TTS模型-Tacotron2,Transformer和FastSpeech应用于构建双语和代码开关TTS。主观评估结果表明,所有三个系统都可以为每种说话者使用两种语言的本地级别语音(接近)。
Recent state-of-the-art neural text-to-speech (TTS) synthesis models have dramatically improved intelligibility and naturalness of generated speech from text. However, building a good bilingual or code-switched TTS for a particular voice is still a challenge. The main reason is that it is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. In this paper, we explore the use of Mandarin speech recordings from a Mandarin speaker, and English speech recordings from another English speaker to build high-quality bilingual and code-switched TTS for both speakers. A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech, which show good naturalness and speaker similarity. The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model. With these data, three neural TTS models -- Tacotron2, Transformer and FastSpeech are applied for building bilingual and code-switched TTS. Subjective evaluation results show that all the three systems can produce (near-)native-level speech in both languages for each of the speaker.