论文标题

在序列到序列文本到语音综合的语言特征上的学习能力调查

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

论文作者

Yasuda, Yusuke, Wang, Xin, Yamagishi, Junichi

论文摘要

神经序列到序列文本到语音合成(TTS)可以直接从文本或简单语言特征(例如音素)中产生高质量的语音。与传统的管道TT不同,神经序列到序列TTS不需要手动注释和复杂的语言特征,例如词性训练标签和句法结构。但是,它必须经过精心设计和优化,以便可以从输入功能中隐含地提取有用的语言特征。在本文中,我们研究了在什么条件下,神经序列到序列TT可以在日语和英语中与基于DEAP神经网络(DNN)管道TTS系统的比较一起工作。与过去的比较研究不同,管道系统还使用自回归的概率建模和神经声码器。我们从三个方面研究了系统:a)模型体系结构,b)模型参数大小和c)语言。对于模型体系结构方面,我们采用了先前提出的修改后的Tacotron系统及其使用Tacotron或Tacotron2的编码器的变体。对于模型参数大小方面,我们研究了两个模型参数尺寸。对于语言方面,我们对日语和英语进行听力测试,以查看我们的发现是否可以跨语言概括。我们的实验表明,a)神经序列到序列TTS系统应具有足够数量的模型参数来产生高质量的语音,b)当它以字符为输入时,它也应使用功能强大的编码器,而c)编码器仍然具有改进的房间,并且需要具有改进的Suppra-Specra-seltecture supra-sepra-sepra-ectempla-sectempla-sectementemal-sectementem-sectempla。

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use autoregressive probabilistic modeling and a neural vocoder. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源