论文标题
Facebook AI的WMT20新闻翻译任务提交
Facebook AI's WMT20 News Translation Task Submission
论文作者
论文摘要
本文描述了Facebook AI对WMT20共享新闻翻译任务的提交。我们专注于低资源设置,并参与两种语言对,泰米尔语<->英语和inuktitut <->英语,那里有限的bitext和单语言数据有限。我们使用两种主要策略来解决低资源问题,利用所有可用的数据并将系统调整为目标新闻领域。我们探讨了从所有语言中利用Bitext和单语数据的技术,例如自我监督的模型预处理,多语言模型,数据增强和重新播放。为了更好地使翻译系统适应测试域,我们探索数据集标记和对内域数据的微调。我们观察到,不同的技术根据语言对的可用数据提供了各种改进。根据发现,我们将这些技术集成到一个培训管道中。对于En-> ta,我们探索了一个带有泰米尔语Bitext和单语言数据的不受约束的设置,并表明可以获得进一步的改进。在测试集中,我们最好的提交系统分别为Ta-> en和en-> ta实现21.5和13.7 bleu,分别为IU-> en-> en-> iu的27.9和13.0。
This paper describes Facebook AI's submission to WMT20 shared news translation task. We focus on the low resource setting and participate in two language pairs, Tamil <-> English and Inuktitut <-> English, where there are limited out-of-domain bitext and monolingual data. We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain. We explore techniques that leverage bitext and monolingual data from all languages, such as self-supervised model pretraining, multilingual models, data augmentation, and reranking. To better adapt the translation system to the test domain, we explore dataset tagging and fine-tuning on in-domain data. We observe that different techniques provide varied improvements based on the available data of the language pair. Based on the finding, we integrate these techniques into one training pipeline. For En->Ta, we explore an unconstrained setup with additional Tamil bitext and monolingual data and show that further improvement can be obtained. On the test set, our best submitted systems achieve 21.5 and 13.7 BLEU for Ta->En and En->Ta respectively, and 27.9 and 13.0 for Iu->En and En->Iu respectively.