Stanza：许多人类语言的Python自然语言处理工具包

论文标题

Stanza：许多人类语言的Python自然语言处理工具包

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

论文作者

Qi, Peng, Zhang, Yuhao, Zhang, Yuhui, Bolton, Jason, Manning, Christopher D.

论文摘要

我们介绍了Stanza，这是一种自然语言处理工具包，支持66种人类语言。与现有广泛使用的工具包相比，STANZA具有语言不可能的完全神经管道，用于文本分析，包括令牌化，多词的令牌扩展，lemmatization，lemmatization，spechech和形态学特征标记，依赖性解析和命名实体识别。我们已经培训了总共112个数据集的STANZA，包括通用依赖性树库和其他多语言语料库，并表明相同的神经体系结构良好地概括并在测试的所有语言上实现竞争性能。此外，STANZA还包括广泛使用的Java Stanford Corenlp软件的本机Python接口，该界面进一步扩展了其功能，以涵盖其他任务，例如Coreference解决方案和关系提取。可以在https://stanfordnlp.github.io/stanza上获得66种语言的源代码，文档和预验证的模型。

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlp.github.io/stanza.

下载PDF全文

下载文献需遵守相关版权规定

论文标题