使用未配对的文本数据，对TTS和ASR的半监督联合培训的说话者一致性损失和逐步优化

论文标题

使用未配对的文本数据，对TTS和ASR的半监督联合培训的说话者一致性损失和逐步优化

Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

论文作者

Makishima, Naoki, Suzuki, Satoshi, Ando, Atsushi, Masumura, Ryo

论文摘要

在本文中，我们研究了对语音（TTS）和自动语音识别（ASR）的半监督联合培训（ASR），其中有少量的配对数据和大量未配对的文本数据。常规研究形成了一个称为TTS-ASR管道的周期，其中多言式言论tts模型将文本中的语音与参考语音合成语音，而ASR模型从合成的语音中重建了文本，此后，这两个模型均经过周期固定损失的培训。但是，综合语音并不能反映参考语音的说话者特征，而综合语音对于ASR模型在训练后识别而变得过于容易。这不仅降低了TTS模型质量，而且还限制了ASR模型的改进。为了解决这个问题，我们建议通过说话者的一致性损失和逐步优化来改善基于偏差的培训。说话者的一致性损失使综合语音的说话者特征更接近参考语音。在逐步优化中，我们首先在对两个模型进行训练之前首先冻结TTS模型的参数，以避免TTS模型过度适应为ASR模型。实验结果证明了该方法的功效。

In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and automatic speech recognition (ASR), where a small amount of paired data and a large amount of unpaired text data are available. Conventional studies form a cycle called the TTS-ASR pipeline, where the multispeaker TTS model synthesizes speech from text with a reference speech and the ASR model reconstructs the text from the synthesized speech, after which both models are trained with a cycle-consistency loss. However, the synthesized speech does not reflect the speaker characteristics of the reference speech and the synthesized speech becomes overly easy for the ASR model to recognize after training. This not only decreases the TTS model quality but also limits the ASR model improvement. To solve this problem, we propose improving the cycleconsistency-based training with a speaker consistency loss and step-wise optimization. The speaker consistency loss brings the speaker characteristics of the synthesized speech closer to that of the reference speech. In the step-wise optimization, we first freeze the parameter of the TTS model before both models are trained to avoid over-adaptation of the TTS model to the ASR model. Experimental results demonstrate the efficacy of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题