通过采用自然语言处理来对网络风险临床笔记进行分类

论文标题

通过采用自然语言处理来对网络风险临床笔记进行分类

Classifying Cyber-Risky Clinical Notes by Employing Natural Language Processing

论文作者

Schmeelk, Suzanna, Dogo, Martins Samuel, Peng, Yifan, Patra, Braja Gopal

论文摘要

临床注释，可以嵌入电子病历中，记录患者护理的交付并总结医疗保健提供者与患者之间的相互作用。这些临床笔记直接为患者护理提供了信息，还可以间接为研究和质量/安全指标提供信息，以及其他间接指标。最近，美国境内的某些州要求患者可以公开使用其临床笔记，以改善患者护理的患者信息的交换。因此，在共享和交换数据之前开发方法来评估临床笔记的网络风险至关重要。虽然现有的自然语言处理技术旨在取消识别临床注意事项，但据我们所知，很少有人专注于对敏感信息的风险进行分类，这是朝着开发有效，广泛保护患者健康信息的基本步骤。为了弥合这一差距，这项研究研究了识别临床注释中安全/隐私风险的方法。该分类可以在上游用来识别可能包含敏感信息或下游的注释中的区域，以改善尚未完全识别的临床笔记的识别。我们使用带有不同分类器的Unigram和Word2VEC功能开发多个模型来对句子风险进行分类。 I2B2去识别数据集的实验表明，使用Word2VEC功能的SVM分类器获得的最大F1分数为0.792。未来的研究涉及在不同的全球监管要求方面表达和差异风险。

Clinical notes, which can be embedded into electronic medical records, document patient care delivery and summarize interactions between healthcare providers and patients. These clinical notes directly inform patient care and can also indirectly inform research and quality/safety metrics, among other indirect metrics. Recently, some states within the United States of America require patients to have open access to their clinical notes to improve the exchange of patient information for patient care. Thus, developing methods to assess the cyber risks of clinical notes before sharing and exchanging data is critical. While existing natural language processing techniques are geared to de-identify clinical notes, to the best of our knowledge, few have focused on classifying sensitive-information risk, which is a fundamental step toward developing effective, widespread protection of patient health information. To bridge this gap, this research investigates methods for identifying security/privacy risks within clinical notes. The classification either can be used upstream to identify areas within notes that likely contain sensitive information or downstream to improve the identification of clinical notes that have not been entirely de-identified. We develop several models using unigram and word2vec features with different classifiers to categorize sentence risk. Experiments on i2b2 de-identification dataset show that the SVM classifier using word2vec features obtained a maximum F1-score of 0.792. Future research involves articulation and differentiation of risk in terms of different global regulatory requirements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题