论文标题

Thamizhiudp:泰米尔语的依赖解析器

ThamizhiUDp: A Dependency Parser for Tamil

论文作者

Sarveswaran, Kengatharaiyer, Dias, Gihan

论文摘要

本文介绍了我们如何开发基于神经的依赖解析器,即thamizhiudp,该解析器为使用普遍的依赖性形式主义提供了完整的依赖性解析依赖性解析的管道。我们已经考虑了依赖性解析管道的阶段,并确定了这些阶段中每个阶段的工具和资源,以提高准确性并解决数据稀缺。 thamizhiudp使用节来进行象征化和诱饵,thamizhipost和thamizhimorph来产生一部分语音(POS)和形态学注释,并通过多语言培训进行依赖的依赖培训。 Thamizhipost是我们的POS Tagger,它基于STANZA,接受了Amrita Pos标记的语料库的训练。这是泰米尔人POS标签中的最新最新,F1得分为93.27。我们的形态分析仪Thamizhimorph是一个基于规则的系统,其泰米尔语的覆盖范围很好。我们的依赖性解析器thamizhiudp是使用多语言数据训练的。它显示了标记的分配得分(LAS)为62.39,比泰米尔依赖性解析最佳的最佳目前高4分。因此,我们表明,分解依赖性解析管道以适应现有工具和资源是低资源语言的可行方法。

This paper describes how we developed a neural-based dependency parser, namely ThamizhiUDp, which provides a complete pipeline for the dependency parsing of the Tamil language text using Universal Dependency formalism. We have considered the phases of the dependency parsing pipeline and identified tools and resources in each of these phases to improve the accuracy and to tackle data scarcity. ThamizhiUDp uses Stanza for tokenisation and lemmatisation, ThamizhiPOSt and ThamizhiMorph for generating Part of Speech (POS) and Morphological annotations, and uuparser with multilingual training for dependency parsing. ThamizhiPOSt is our POS tagger, which is based on the Stanza, trained with Amrita POS-tagged corpus. It is the current state-of-the-art in Tamil POS tagging with an F1 score of 93.27. Our morphological analyzer, ThamizhiMorph is a rule-based system with a very good coverage of Tamil. Our dependency parser ThamizhiUDp was trained using multilingual data. It shows a Labelled Assigned Score (LAS) of 62.39, 4 points higher than the current best achieved for Tamil dependency parsing. Therefore, we show that breaking up the dependency parsing pipeline to accommodate existing tools and resources is a viable approach for low-resource languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源