论文标题

宣布带有2个超过2吉沃德的春

Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords

论文作者

Kocmi, Tom, Popel, Martin, Bojar, Ondrej

论文摘要

我们介绍了捷克语 - 英语平行语料库2.0的新版本,每种语言中包含超过20亿个单词(2个“ gigawords”)。该语料库包含文档级信息,并用几种技术过滤以降低噪声量。除了先前版本的春还包含新地道和高质量的合成并行数据外,它还包含了数据。尚恩可自由地用于研究和教育目的。

We present a new release of the Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise. In addition to the data in the previous version of CzEng, it contains new authentic and also high-quality synthetic parallel data. CzEng is freely available for research and educational purposes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源