论文标题
罗马尼亚伯特的诞生
The birth of Romanian BERT
论文作者
论文摘要
大规模预处理的语言模型在自然语言处理中已变得无处不在。但是,这些模型中的大多数都可以使用高资源语言,尤其是英语,也可以作为损害各种语言的多种语言模型以供覆盖范围。本文介绍了罗马尼亚伯特(Romanian Bert),这是第一个纯罗马尼亚变压器的语言模型,在大型文本语料库上审议。我们讨论了语料库组成和清洁,模型培训过程,以及对各种罗马尼亚数据集的模型的广泛评估。我们不仅开源模型本身,而且还包含有关如何获得语料库,微调和使用此模型的信息的存储库(带有实际示例),以及如何完全复制评估过程。
Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We open source not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.