论文标题
端到端语音转换与信息扰动
End-to-End Voice Conversion with Information Perturbation
论文作者
论文摘要
语音转换的理想目标是将源演讲者的演讲转换为自然而然的声音,就像目标扬声器一样,同时保持语言内容和源语音的韵律。但是,当前的方法不足以实现转换后的语音中的全面源韵律转移和目标扬声器的音色,并且由于声学模型与声码器之间的不匹配,转换后的语音的质量也不满意。在本文中,我们利用了信息扰动的最新进展,并提出了一种完全端到端的方法来进行高质量的语音转换。我们首先采用信息扰动以在源语音中删除与说话者相关的信息,以删除说话者的音色和语言内容,因此语言信息随后由内容编码器建模。为了更好地将源语音的韵律转移到目标上,我们特别引入了与扬声器相关的音调编码器,该编码器可以保持源扬声器的一般音调模式,同时灵活地修改生成的语音的音调强度。最后,通过连续的扬声器空间建模来建立一声的语音转换。实验结果表明,所提出的端到端方法在清晰度,自然性和说话者的相似性方面显着优于最先进的模型。
The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.