论文标题
Afrolm:一种基于自动学习的多语言审慎语言模型,用于23种非洲语言
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
论文作者
论文摘要
近年来,由于其在众多下游自然语言处理任务(NLP)上的表现出色,因此多语言预训练的语言模型已获得突出。但是,预先培训这些大型多语言语言模型需要大量的培训数据,这对于非洲语言不可用。主动学习是一种半监督的学习算法,在该算法中,模型始终如一地动态学习以识别自身进行训练的最有益的样本,以便在下游任务上获得更好的优化和性能。此外,积极学习有效,实际上可以解决现实世界中的数据稀缺性。尽管有所有好处,但在NLP,尤其是多语言语言模型的背景下,积极学习的考虑很少。在本文中,我们介绍了Afrolm,这是一种使用我们新颖的自动学习框架的23种非洲语言(迄今为止最大的努力)从头开始概述的多语言模型。在各种NLP下游任务(NER,文本分类和情感分析)上,Afrolm在数据集上预估计(14倍)(14倍)(14倍)比现有基线小(14倍)。其他室外情感分析实验表明,\ textbf {Afrolm}能够在各个域中很好地概括。我们发布了代码源,并在框架中使用的数据集在https://github.com/bonaventuredusedossou/mlm_al中发布。
In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.