模仿：神经文本到语音的多一到多的细粒度转移

论文标题

模仿：神经文本到语音的多一到多的细粒度转移

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

论文作者

Karlapati, Sri, Moinet, Alexis, Joly, Arnaud, Klimkov, Viacheslav, Sáez-Trigueros, Daniel, Drugman, Thomas

论文摘要

韵律转移（PT）是一种旨在在综合语音的同时使用源音频中的韵律作为参考的技术。细颗粒的PT旨在捕获诸如节奏，重点，旋律，持续时间和响度之类的韵律方面，这些方面是从非常粒状的源音频中捕获的，并在以不同的目标扬声器的声音中综合语音时转移它们。当前的细粒度PT方法遭受了来源扬声器泄漏的影响，在该方法中，合成的语音具有源发言人的语音身份，而不是目标扬声器。为了减轻此问题，它们妥协了PT的质量。在本文中，我们提出了模仿者，这是一种新颖的，多一到许多的PT系统，它在不使用并行数据的情况下对源扬声器泄漏非常可靠。我们通过一个新颖的参考编码器体系结构来实现这一目标，该体系结构能够捕获暂时的韵律表示，这些表示对源说话者的泄漏非常可靠。我们通过各种主观评估将模仿者与最先进的细粒度PT模型进行了比较，在该评估中，我们在韵律转移质量的相对提高了$ 47 \％$，在保持目标扬声器身份的同时仍然保持了同样的自然性。

Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of $47\%$ in the quality of prosody transfer and $14\%$ in preserving the target speaker identity, while still maintaining the same naturalness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题