通过单语言数据改进同时的机器翻译

论文标题

通过单语言数据改进同时的机器翻译

Improving Simultaneous Machine Translation with Monolingual Data

论文作者

Deng, Hexuan, Ding, Liang, Liu, Xuebo, Zhang, Meishan, Tao, Dacheng, Zhang, Min

论文摘要

同时的机器翻译（SIMT）通常是通过序列级知识蒸馏（SEQ-kd）从全句神经机器转换（NMT）模型完成的。但是，NMT和SIMT之间仍然存在显着的性能差距。在这项工作中，我们建议利用单语数据来改善SIMT，该数据将SIMT学生培训双语数据和由SEQ-KD蒸馏的外部单语数据的组合。关于EN-ZH和EN-JA新闻领域Corpora的初步实验表明，单语言数据可以显着提高翻译质量（例如，+3.15 bleu on-zh上的bleu）。受人类同时口译员的行为的启发，我们为Simt提出了一种新颖的单语言采样策略，考虑了块长度和单调性。实验结果表明，我们的抽样策略始终优于随机抽样策略（以及其他常规的NMT单语言抽样策略），避免了SIMT的关键问题 - 幻觉，并且具有更好的可伸缩性。对于EN-ZH和EN-JA上的随机采样，我们平均实现了+0.72 BLEU的改进。数据和代码可以在https://github.com/hexuandeng/mono4simt上找到。

Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题