论文标题

UVA资源的生物医学词汇对齐方式在UMLS Metathesaurus中的规模

UVA Resources for the Biomedical Vocabulary Alignment at Scale in the UMLS Metathesaurus

论文作者

Nguyen, Vinh, Bodenreider, Olivier

论文摘要

UMLS(统一的医学语言系统)的构建和维护过程是耗时,昂贵且容易出错的,因为它依赖于(1)(1)建议同义词的词汇和语义处理,以及(2)UMLS编辑的专业知识,用于策划建议。为了改善UMLS Metathesaurus构建过程,我们的研究小组定义了一个名为UVA(UMLS词汇对齐)的新任务,并生成了一个用于评估任务的数据集。我们的小组还使用逻辑规则(RBA)和神经网络(LEXLM和CONLM)开发了此任务的不同基准。 在本文中,我们提供了一组可重复使用和可重复的资源,包括(1)数据集生成器,(2)使用发电机生成的三个数据集,以及(3)三个基线方法。我们描述了UVA数据集生成器及其在任何给定UMLS版本中概括的实现。我们通过生成对应于三个UMLS版本,2020AA,2021AA和2021AB的数据集来证明数据集生成器的使用。我们使用三种现有方法(LEXLM,CONLM和RBA)提供三个UVA基线。代码,数据集和实验可公开可用,可重复使用和可重现的任何UMLS版本(下载UMLS都需要无需许可协议)。

The construction and maintenance process of the UMLS (Unified Medical Language System) Metathesaurus is time-consuming, costly, and error-prone as it relies on (1) the lexical and semantic processing for suggesting synonymous terms, and (2) the expertise of UMLS editors for curating the suggestions. For improving the UMLS Metathesaurus construction process, our research group has defined a new task called UVA (UMLS Vocabulary Alignment) and generated a dataset for evaluating the task. Our group has also developed different baselines for this task using logical rules (RBA), and neural networks (LexLM and ConLM). In this paper, we present a set of reusable and reproducible resources including (1) a dataset generator, (2) three datasets generated by using the generator, and (3) three baseline approaches. We describe the UVA dataset generator and its implementation generalized for any given UMLS release. We demonstrate the use of the dataset generator by generating datasets corresponding to three UMLS releases, 2020AA, 2021AA, and 2021AB. We provide three UVA baselines using the three existing approaches (LexLM, ConLM, and RBA). The code, the datasets, and the experiments are publicly available, reusable, and reproducible with any UMLS release (a no-cost license agreement is required for downloading the UMLS).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源