论文标题

带有话语注释的双语平行语料库

A Bilingual Parallel Corpus with Discourse Annotations

论文作者

Jiang, Yuchen Eleanor, Liu, Tianyu, Ma, Shuming, Zhang, Dongdong, Sachan, Mrinmaya, Cotterell, Ryan

论文摘要

机器翻译(MT)在句子级翻译时几乎达到了人类的平价。作为回应,MT社区部分将其重点转移到文档级翻译上。但是,由于缺乏平行文件的情况,可以阻碍文档级MT系统的开发。本文描述了BWB,这是江等首次引入的大型平行语料库。 (2022),以及带注释的测试集。 BWB语料库由专家翻译成英语的中国小说组成,带注释的测试集旨在探究机器翻译系统对各种话语现象进行建模的能力。我们的资源是免费的,我们希望它将作为文档级机器翻译中更多工作的指南和灵感。

Machine translation (MT) has almost achieved human parity at sentence-level translation. In response, the MT community has, in part, shifted its focus to document-level translation. However, the development of document-level MT systems is hampered by the lack of parallel document corpora. This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set. The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena. Our resource is freely available, and we hope it will serve as a guide and inspiration for more work in document-level machine translation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源