可控的重音文本到语音综合

论文标题

可控的重音文本到语音综合

Controllable Accented Text-to-Speech Synthesis

论文作者

Liu, Rui, Sisman, Berrak, Gao, Guanglai, Li, Haizhou

论文摘要

重音文本到语音（TTS）合成旨在以重音（L2）作为标准版本（L1）的变体生成语音。重音tts的合成具有挑战性，因为在语音渲染和韵律模式方面，L2与L1不同。此外，在发言中无法控制重音强度的解决方案。在这项工作中，我们提出了一种神经TTS架构，使我们能够控制推理过程中的口音及其强度。这是通过三种新型机制（1）重音方差适配器实现的，可以用三个韵律控制因子（即螺距，能量和持续时间）对复杂的重音方差进行建模； 2）一种重音强度建模策略来量化重音强度； 3）一个一致性约束模块，以鼓励TTS系统在良好的水平上呈现预期的重音强度。实验表明，在重音渲染和强度控制方面，所提出的系统在基线模型中获得了卓越的性能。据我们所知，这是对具有明确强度控制的重音TT合成的首次研究。

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture, that allows us to control the accent and its intensity during inference. This is achieved through three novel mechanisms, 1) an accent variance adaptor to model the complex accent variance with three prosody controlling factors, namely pitch, energy and duration; 2) an accent intensity modeling strategy to quantify the accent intensity; 3) a consistency constraint module to encourage the TTS system to render the expected accent intensity at a fine level. Experiments show that the proposed system attains superior performance to the baseline models in terms of accent rendering and intensity control. To our best knowledge, this is the first study of accented TTS synthesis with explicit intensity control.

下载PDF全文

下载文献需遵守相关版权规定

论文标题