MCSCSET：用于医疗域的专家通知数据集

论文标题

MCSCSET：用于医疗域的专家通知数据集

MCSCSet: A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction

论文作者

Jiang, Wangjie, Ye, Zhihao, Ou, Zijing, Zhao, Ruihui, Zheng, Jianguang, Liu, Yi, Li, Siheng, Liu, Bang, Yang, Yujiu, Zheng, Yefeng

论文摘要

中国拼写校正（CSC）由于承诺自动检测和纠正中文文本中的拼写错误，因此引起了人们的关注。尽管它在许多应用中广泛使用，例如搜索引擎和光学角色识别系统，但在医学场景中几乎没有探索过复杂且罕见的医疗实体很容易被拼写错误。纠正医疗实体的拼写错误比开放型领域的拼写更加困难，因为其对特定领域知识的要求。在这项工作中，我们定义了中国拼写校正的任务，并提出了MCSCSET，MCSCSET是一个大规模专家注销的数据集，其中包含约200K样品。与现有的开放域CSC数据集相反，MCSCSET涉及：i）从腾讯伊迪安（tencent Yidian）收集的广泛的现实医疗查询，ii）医学专家手动注释的相应拼写错误的句子。为了确保自动化数据集策划，MCSCSET进一步提供了一个医学混乱集，该集合由给定的中国医学术语的通常拼写错误的字符组成。这使人们能够自动创建医疗拼写数据集。广泛的经验研究表明，开放域和医疗域拼写校正之间的性能差距很大，这突出了需要开发高质量数据集的需求，从而可以在特定领域进行中国拼写校正。此外，我们的工作基准了几种代表性的中国拼写校正模型，为将来的工作建立了基准。

Chinese Spelling Correction (CSC) is gaining increasing attention due to its promise of automatically detecting and correcting spelling errors in Chinese texts. Despite its extensive use in many applications, like search engines and optical character recognition systems, little has been explored in medical scenarios in which complex and uncommon medical entities are easily misspelled. Correcting the misspellings of medical entities is arguably more difficult than those in the open domain due to its requirements of specificdomain knowledge. In this work, we define the task of Medical-domain Chinese Spelling Correction and propose MCSCSet, a large scale specialist-annotated dataset that contains about 200k samples. In contrast to the existing open-domain CSC datasets, MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists. To ensure automated dataset curation, MCSCSet further offers a medical confusion set consisting of the commonly misspelled characters of given Chinese medical terms. This enables one to create the medical misspelling dataset automatically. Extensive empirical studies have shown significant performance gaps between the open-domain and medical-domain spelling correction, highlighting the need to develop high-quality datasets that allow for Chinese spelling correction in specific domains. Moreover, our work benchmarks several representative Chinese spelling correction models, establishing baselines for future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题