摩尔达维亚与罗马尼亚方言身份的机器学习的不合理有效性

论文标题

摩尔达维亚与罗马尼亚方言身份的机器学习的不合理有效性

The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification

论文作者

Găman, Mihaela, Ionescu, Radu Tudor

论文摘要

在摩尔达维亚语和罗马尼亚方言识别中的机器学习模型中看似高的准确性以及对该主题的研究兴趣越来越高的动机，我们提供了摩尔达维亚与罗马尼亚跨核心主题识别（MRC）的后续行动，该任务是Vardial 2019评估活动的共同任务。共享任务包括两种子任务类型：一种由摩尔达维亚和罗马尼亚方言区分的一种组成，其中一种包括在罗马尼亚语的两个方言中按主题对文档进行分类。参与者取得了令人印象深刻的分数，例如摩尔达维亚与罗马尼亚方言识别的顶级模型获得了0.895的宏F1得分。我们对人类注释者进行了主观评估，表明与机器学习（ML）模型相比，人类的准确率要低得多。因此，尚不清楚参与者提出的方法为什么达到如此高的精度率。我们的目标是了解（i）为什么所提出的方法能够很好地工作（通过可视化判别特征）以及（ii）这些方法在多大程度上可以保持其高精度水平，例如当我们将文本样本缩短为单句或在推理时使用推文时。我们工作的次要目标是使用集合学习提出改进的ML模型。我们的实验表明，即使在句子级别和跨不同领域（新闻文章与推文），ML模型也可以准确地识别方言。我们还分析了最佳性能模型的最歧视性特征，从而为这些模型做出的决策提供了一些解释。有趣的是，我们学习以前对我们或人类注释者所不知道的新方言模式。此外，我们进行实验，表明MRC共享任务上的机器学习性能可以通过基于堆叠的合奏来提高。

Motivated by the seemingly high accuracy levels of machine learning models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 Evaluation Campaign. The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, e.g. the top model for Moldavian versus Romanian dialect identification obtained a macro F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, e.g. when we shorten the text samples to single sentences or when we use tweets at inference time. A secondary goal of our work is to propose an improved ML model using ensemble learning. Our experiments show that ML models can accurately identify the dialects, even at the sentence level and across different domains (news articles versus tweets). We also analyze the most discriminative features of the best performing models, providing some explanations behind the decisions taken by these models. Interestingly, we learn new dialectal patterns previously unknown to us or to our human annotators. Furthermore, we conduct experiments showing that the machine learning performance on the MRC shared task can be improved through an ensemble based on stacking.

下载PDF全文

下载文献需遵守相关版权规定

论文标题