论文标题

在斯洛文尼亚传记词典中处理缩写

Dealing with Abbreviations in the Slovenian Biographical Lexicon

论文作者

Daza, Angel, Fokkens, Antske, Erjavec, Tomaž

论文摘要

缩写对NLP系统带来了重大挑战,因为它们会导致令牌化和播放量错误。它们还可以使文本不那么可读,尤其是在广泛使用的参考印刷书籍中。在低资源设置中,缩写尤其有问题,在该设置中,系统一开始就不太强大。在本文中,我们提出了一种解决文本中高密度域特异性缩写引起的问题的新方法。我们将此方法应用于斯洛文尼亚传记词典的情况,并在新开发的51个斯洛文尼亚传记的金标准数据集上对其进行评估。我们的缩写鉴定方法的性能明显优于常用的临时解决方案,尤其是在识别看不见的缩写方面。我们还提出并提出了一种方法的结果,以扩大上下文中确定的缩写。

Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源