用于国家级方言识别的多划分阿拉伯语BERT

论文标题

用于国家级方言识别的多划分阿拉伯语BERT

Multi-Dialect Arabic BERT for Country-Level Dialect Identification

论文作者

Talafha, Bashar, Ali, Mohammad, Za'ter, Muhy Eddin, Seelawi, Haitham, Tuffaha, Ibraheem, Samir, Mostafa, Farhan, Wael, Al-Natsheh, Hussein T.

论文摘要

阿拉伯方言标识是语言本身许多固有属性的复杂问题。在本文中，我们介绍了进行的实验，以及我们竞争团队Mawdoo3 AI开发的模型，以实现我们的获胜解决方案，以对差异的阿拉伯方言标识（NADI）共享任务。方言标识子任务提供了21,000个国家级标签的推文，涵盖了所有21个阿拉伯国家。竞争组织者还提出了从同一领域的1000万推文的未标记语料库，以供可选使用。我们的获胜解决方案本身以我们预先训练的BERT模型的不同训练迭代的合奏形式出现，该迭代率在手头的子任务上达到了26.78％的微平均F1分数。我们将以多核心 - 阿比伯特模型的名义公开发布获奖解决方案的预训练的语言模型组成部分，适用于任何感兴趣的研究人员。

Arabic dialect identification is a complex problem for a number of inherent properties of the language itself. In this paper, we present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI, along the way to achieving our winning solution to subtask 1 of the Nuanced Arabic Dialect Identification (NADI) shared task. The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries. An unlabeled corpus of 10M tweets from the same domain is also presented by the competition organizers for optional use. Our winning solution itself came in the form of an ensemble of different training iterations of our pre-trained BERT model, which achieved a micro-averaged F1-score of 26.78% on the subtask at hand. We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model, for any interested researcher out there.

下载PDF全文

下载文献需遵守相关版权规定

论文标题