论文标题
部分可观测时空混沌系统的无模型预测
Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations
论文作者
论文摘要
本文介绍了也门,苏丹人,伊拉克和利比亚阿拉伯语方言Lisan Corpora。 Lisan具有约120万个令牌。我们从几个社交媒体平台收集了CORPORA的内容。也门语料库(〜10.5亿代币)是从Twitter自动收集的。其他三个方言的语料库(每个方言)(每个标记)手动来自Facebook和YouTube帖子和评论。 三十五(35)个注释者是针对目标方言的母语者进行了注释。注释者将四个语料库中的所有单词都分为前缀,词干和后缀,并标记每个具有不同的形态特征,例如言语,引理和英语的光泽。为了环状目的,开发了阿拉伯语方言注释工具包ADAT。注释者接受了一套准则以及如何使用ADAT的培训。我们开发了ADAT来协助注释者并确保与SAMA和Curras标签的兼容性。该工具是开源的,这四个语料库也可以在线获得。
This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (~ 50K tokens each) came manually from Facebook and YouTube posts and comments. Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use ADAT. We developed ADAT to assist the annotators and to ensure compatibility with SAMA and Curras tagsets. The tool is open source, and the four corpora are also available online.