可控的序列到序列神经TTs具有LPCNET后端，以实时语音合成CPU

论文标题

可控的序列到序列神经TTs具有LPCNET后端，以实时语音合成CPU

Controllable Sequence-To-Sequence Neural TTS with LPCNET Backend for Real-time Speech Synthesis on CPU

论文作者

Shechtman, Slava, Rabinovitz, Carmel, Sorin, Alex, Kons, Zvi, Hoory, Ron

论文摘要

最新的序列到序列的声学网络，将语音序列转换为没有明确韵律预测的一系列光谱特征，当用神经声码器（例如waveNet）级联时，会产生近乎自然质量的语音。但是，组合系统通常太重，无法在CPU上实时语音综合。在这项工作中，我们提出了一个序列到序列的声学网络，结合了轻质LPCNET神经声码器，该神经声码器设计用于CPU上的实时语音综合。此外，该系统允许句子级的速度和推理时表达性控制。我们证明，所提出的系统可以在通用CPU上实时合成高质量的22 kHz语音。在相对于PCM的MOS评分降解方面，与具有WaveNet VocoDer后端的类似系统相比，质量的质量低至6.1-6.5％，表现力的6.3-7.0％的表现力达到等效或更好的质量。

State-of-the-art sequence-to-sequence acoustic networks, that convert a phonetic sequence to a sequence of spectral features with no explicit prosody prediction, generate speech with close to natural quality, when cascaded with neural vocoders, such as Wavenet. However, the combined system is typically too heavy for real-time speech synthesis on a CPU. In this work we present a sequence-to-sequence acoustic network combined with lightweight LPCNet neural vocoder, designed for real-time speech synthesis on a CPU. In addition, the system allows sentence-level pace and expressivity control at inference time. We demonstrate that the proposed system can synthesize high quality 22 kHz speech in real-time on a general-purpose CPU. In terms of MOS score degradation relative to PCM, the system attained as low as 6.1-6.5% for quality and 6.3- 7.0% for expressiveness, reaching equivalent or better quality when compared to a similar system with a Wavenet vocoder backend.

下载PDF全文

下载文献需遵守相关版权规定

论文标题