论文标题

Gigast:10,000小时的伪语音翻译语料库

GigaST: A 10,000-hour Pseudo Speech Translation Corpus

论文作者

Ye, Rong, Zhao, Chengqi, Ko, Tom, Meng, Chutong, Wang, Tao, Wang, Mingxuan, Cao, Jun

论文摘要

本文介绍了Gigast,这是一个大规模的伪语音翻译(ST)语料库。我们通过将英语ASR语料库GigAspeech翻译成德语和中文来创建语料库。训练集由强大的机器翻译系统翻译,测试集由人类翻译。经过添加的语料库训练的ST模型获得了必不可少的英语基准测试套件的新最先进的结果。我们提供了翻译过程的详细描述并验证其质量。我们将翻译的文本数据公开,并希望促进语音翻译中的研究。此外,我们还发布了神经上的培训脚本,以使复制我们的系统易于复制。 Gigast数据集可从https://st-benchmark.github.io/resources/gigast获得。

This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set. We provide a detailed description of the translation process and verify its quality. We make the translated text data public and hope to facilitate research in speech translation. Additionally, we also release the training scripts on NeurST to make it easy to replicate our systems. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源