论文标题

歧义医学缩写的令牌分类

Token Classification for Disambiguating Medical Abbreviations

论文作者

Cevik, Mucahit, Jafari, Sanaz Mohammad, Myers, Mitchell, Yildirim, Savas

论文摘要

缩写不可避免地是医学文本的关键部分。使用缩写,尤其是在临床患者笔记中,可以节省时间和空间,保护敏感信息并帮助避免重复。但是,大多数缩写可能具有多种感官,并且缺乏标准化的映射系统使缩写缩写是一项艰巨且耗时的任务。这项研究的主要目的是检查令牌分类方法对医学缩写歧义的可行性。具体来说,我们探讨了令牌分类方法在单个文本中处理多种唯一缩写的能力。我们使用两个公共数据集比较和对比了在不同的科学和医学语料库中预先培训的几种变压器模型的性能。我们提出的令牌分类方法的表现优于缩写歧义任务的更常用的文本分类模型。特别是,SCIBERT模型在两个考虑的数据集上显示了令牌和文本分类任务的强劲性能。此外,我们发现,对于文本分类模型的缩写歧义性歧义性能才能与令牌分类相当。仅当将后处理应用于其预测时,这涉及根据训练数据过滤可能的缩写标签。

Abbreviations are unavoidable yet critical parts of the medical text. Using abbreviations, especially in clinical patient notes, can save time and space, protect sensitive information, and help avoid repetitions. However, most abbreviations might have multiple senses, and the lack of a standardized mapping system makes disambiguating abbreviations a difficult and time-consuming task. The main objective of this study is to examine the feasibility of token classification methods for medical abbreviation disambiguation. Specifically, we explore the capability of token classification methods to deal with multiple unique abbreviations in a single text. We use two public datasets to compare and contrast the performance of several transformer models pre-trained on different scientific and medical corpora. Our proposed token classification approach outperforms the more commonly used text classification models for the abbreviation disambiguation task. In particular, the SciBERT model shows a strong performance for both token and text classification tasks over the two considered datasets. Furthermore, we find that abbreviation disambiguation performance for the text classification models becomes comparable to that of token classification only when postprocessing is applied to their predictions, which involves filtering possible labels for an abbreviation based on the training data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源