论文标题
PYSBD:务实的句子边界歧义
PySBD: Pragmatic Sentence Boundary Disambiguation
论文作者
论文摘要
在本文中,我们提出了一个基于规则的句子边界歧义python软件包,该软件包可用于22种语言。我们旨在提供一个现实的细分器,即使输入文本的格式和域未知,也可以提供逻辑句子。在我们的工作中,我们适应了最初以Ruby Gem -pragmatic_segmenter实现的黄金规则集(特定于语言的句子边界示例) - 我们将其移植到Python,并具有其他改进和功能。 PYSBD通过了英语的Golden规则集典范的97.92%,比下一个最佳开源Python工具提高了25%。
In this paper, we present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language-specific set of sentence boundary exemplars) originally implemented as a ruby gem - pragmatic_segmenter - which we ported to Python with additional improvements and functionality. PySBD passes 97.92% of the Golden Rule Set exemplars for English, an improvement of 25% over the next best open-source Python tool.