论文标题

Kinnews和Kirnews:Kinyarwanda和Kirundi的基准分类

KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi

论文作者

Niyongabo, Rubungo Andre, Qu, Hong, Kreutzer, Julia, Huang, Li

论文摘要

文本分类的最新进展集中在英语和中文等高源语言上。对于低资源语言,其中大多数非洲语言,缺乏通知的数据和有效的预处理,这阻碍了成功方法的进度和转移。在本文中,我们介绍了两个新闻数据集(Kinnews and Kirnews),以在Kinyarwanda和Kirundi(两种低资源的非洲语言)中对新闻文章进行多级分类。两种语言是相互理解的,但是尽管Kinyarwanda在一定程度上研究了自然语言处理(NLP),但这项工作构成了对Kirundi的首次研究。除数据集外,我们还提供统计数据,预处理准则以及单语和跨语性基线模型。我们的实验表明,对相对较高资源的Kinyarwanda上的培训嵌入会成功地转移到Kirundi。此外,创建的数据集的设计允许在未来的研究中更广泛地在NLP中使用文本分类,例如代表性学习,具有更遥远的语言的跨语性学习,或作为解析,POS Taging和NER等任务的新注释的基础。数据集,停止词和预训练的嵌入在https://github.com/andrews2017/kinnews-and-kirnews-corpus上可公开获得。

Recent progress in text classification has been focused on high-resource languages such as English and Chinese. For low-resource languages, amongst them most African languages, the lack of well-annotated data and effective preprocessing, is hindering the progress and the transfer of successful methods. In this paper, we introduce two news datasets (KINNEWS and KIRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible, but while Kinyarwanda has been studied in Natural Language Processing (NLP) to some extent, this work constitutes the first study on Kirundi. Along with the datasets, we provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models. Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi. In addition, the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. The datasets, stopwords, and pre-trained embeddings are publicly available at https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus .

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源