带有离散语音表示形式的无文本直接语音到语音翻译

论文标题

带有离散语音表示形式的无文本直接语音到语音翻译

Textless Direct Speech-to-Speech Translation with Discrete Speech Representation

论文作者

Li, Xinjian, Jia, Ye, Chiu, Chung-Cheng

论文摘要

近年来，对语音到语音翻译（S2ST）的研究迅速发展。已经提出了许多端到端系统，并显示出比常规级联系统的优势，这些系统通常由识别，翻译和综合子系统组成。但是，大多数端到端系统仍然依赖于培训期间中间文本监督，这使得无需书面形式的语言工作就不可避免。在这项工作中，我们提出了一个基于Translatotron 2的新型模型，无文本的Translatotron，用于训练无文本监督的端到端直接S2ST模型。该模型没有像翻译2中那样使用辅助任务进行预测目标音素的辅助任务，而是使用辅助任务，可预测从学习或随机语音量化器中获得的离散语音表示。当两种模型都使用无监督语音数据进行预训练的语音编码器时，提议的模型在多语言CVSS-C语料库以及双语Fisher Spanish-English-English Corpus上获得了Translatotron 2的转换质量。在后者上，它的表现优于+18.5 bleu的先前最新无文本模型。

Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to work for languages without written forms. In this work, we propose a novel model, Textless Translatotron, which is based on Translatotron 2, for training an end-to-end direct S2ST model without any textual supervision. Instead of jointly training with an auxiliary task predicting target phonemes as in Translatotron 2, the proposed model uses an auxiliary task predicting discrete speech representations which are obtained from learned or random speech quantizers. When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2 on the multilingual CVSS-C corpus as well as the bilingual Fisher Spanish-English corpus. On the latter, it outperforms the prior state-of-the-art textless model by +18.5 BLEU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题