论文标题

使用梵语遗产工具对DCS语料库进行验证和归一化,以构建标记的金语料库

Validation and Normalization of DCS corpus using Sanskrit Heritage tools to build a tagged Gold Corpus

论文作者

Krishnan, Sriram, Kulkarni, Amba, Huet, Gérard

论文摘要

梵语的数字语料库记录了约65万句话以及其形态和词汇标记。但是形态分析的不一致,以及提供诸如分割单词之类的关键信息,敦促需要对此语料库进行标准化和验证。自动化验证过程需要有效的分析仪,这也提供丢失的信息。梵语遗产引擎的读者通过形态和词汇分析产生所有可能的分割。对齐这些系统将有助于我们记录语言差异,这些差异可用于更新这些系统以产生标准化的结果,还将提供带有完整形态和词汇信息的金色语料库以及分段的单词。克里希纳等。 (2017年)考虑了一些语言差异,对115,000个句子进行了调整。由于这两个系统都显着发展,因此,考虑到这些系统之间的所有剩余语言差异,对齐将再次进行。本文详细介绍了修改后的对齐过程,并记录了观察到的其他语言差异。 参考:Amrith Krishna,Pavankumar Satuluri和Pawan Goyal。 2017。梵语单词分割的数据集。在有关文化遗产,社会科学,人文和文学的计算语言学联合演习研讨会上,第105-114页。计算语言学协会,八月。

The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging. But inconsistencies in morphological analysis, and in providing crucial information like the segmented word, urges the need for standardization and validation of this corpus. Automating the validation process requires efficient analyzers which also provide the missing information. The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses. Aligning these systems would help us in recording the linguistic differences, which can be used to update these systems to produce standardized results and will also provide a Gold corpus tagged with complete morphological and lexical information along with the segmented words. Krishna et al. (2017) aligned 115,000 sentences, considering some of the linguistic differences. As both these systems have evolved significantly, the alignment is done again considering all the remaining linguistic differences between these systems. This paper describes the modified alignment process in detail and records the additional linguistic differences observed. Reference: Amrith Krishna, Pavankumar Satuluri, and Pawan Goyal. 2017. A dataset for Sanskrit word segmentation. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, page 105-114. Association for Computational Linguistics, August.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源