交互式文本到语音系统通过联合样式分析

论文标题

交互式文本到语音系统通过联合样式分析

Interactive Text-to-Speech System via Joint Style Analysis

论文作者

Gao, Yang, Zheng, Weiyi, Yang, Zhaojun, Kohler, Thilo, Fuegen, Christian, He, Qing

论文摘要

尽管现代TTS技术在音频质量方面取得了重大进步，但与与人交流相比，仍然缺乏行为自然。我们提出了一个样式包装的TTS系统，该系统根据语音查询样式生成样式响应。为了实现这一目标，该系统包括一种样式的提取模型，该模型从语音查询中提取样式嵌入，然后由TTS使用该模型来产生匹配响应。我们面临两个主要挑战：1）只有一小部分TTS培训数据集具有样式标签，这是训练在推理过程中尊重不同样式嵌入的多样式TTS所需的。 2）TTS系统和样式提取模型具有不相交的培训数据集。我们需要在这两个数据集中进行一致的样式标签，以便TT可以学会尊重推断过程中样式提取模型产生的标签。为了解决这些问题，我们采用了一种半监督的方法，该方法使用样式提取模型为TTS数据集创建样式标签，并应用传输学习以共同学习样式。我们的实验结果显示了用户对样式TTS响应的偏好，并演示了样式包装的TTS系统模仿语音查询样式的能力。

While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response. We faced two main challenges: 1) only a small portion of the TTS training dataset has style labels, which is needed to train a multi-style TTS that respects different style embeddings during inference. 2) The TTS system and the style extraction model have disjoint training datasets. We need consistent style labels across these two datasets so that the TTS can learn to respect the labels produced by the style extraction model during inference. To solve these, we adopted a semi-supervised approach that uses the style extraction model to create style labels for the TTS dataset and applied transfer learning to learn the style embedding jointly. Our experiment results show user preference for the styled TTS responses and demonstrate the style-embedded TTS system's capability of mimicking the speech query style.

下载PDF全文

下载文献需遵守相关版权规定

论文标题