论文标题
低资源神经机器翻译:五种非洲语言的基准
Low Resource Neural Machine Translation: A Benchmark for Five African Languages
论文作者
论文摘要
神经机器翻译(NMT)的最新优势显示了低资源语言(LRL)翻译任务的改善。在这项工作中,我们在英语和五个非洲LRL对之间进行基准NMT(斯瓦希里语,阿姆哈里克,蒂格里尼亚,奥罗莫,索马里[satos])。我们收集了SATOS语言上的可用资源,以评估LRLS的NMT当前状态。我们的评估,将基线单语言对NMT模型与半监督学习,转移学习和多语言建模进行比较,在EN-LRL和LRL-EN方向上都显示出显着的性能改进。在平均BLEU得分方面,多语言方法在十分之六的翻译方向中显示出最大的收益,最高+5点。为了证明每个模型的概括能力,我们还报告了多域测试集的结果。我们发布了标准化的实验数据和未来工作的测试集,以解决资源不足的设置,尤其是SATOS语言中NMT的挑战。
Recent advents in Neural Machine Translation (NMT) have shown improvements in low-resource language (LRL) translation tasks. In this work, we benchmark NMT between English and five African LRL pairs (Swahili, Amharic, Tigrigna, Oromo, Somali [SATOS]). We collected the available resources on the SATOS languages to evaluate the current state of NMT for LRLs. Our evaluation, comparing a baseline single language pair NMT model against semi-supervised learning, transfer learning, and multilingual modeling, shows significant performance improvements both in the En-LRL and LRL-En directions. In terms of averaged BLEU score, the multilingual approach shows the largest gains, up to +5 points, in six out of ten translation directions. To demonstrate the generalization capability of each model, we also report results on multi-domain test sets. We release the standardized experimental data and the test sets for future works addressing the challenges of NMT in under-resourced settings, in particular for the SATOS languages.