哪种匿名技术最适合哪种NLP任务？ - 这取决于。一项有关临床文本处理的系统研究

论文标题

哪种匿名技术最适合哪种NLP任务？ - 这取决于。一项有关临床文本处理的系统研究

Which anonymization technique is best for which NLP task? -- It depends. A Systematic Study on Clinical Text Processing

论文作者

Larbi, Iyadh Ben Cheikh, Burchardt, Aljoscha, Roller, Roland

论文摘要

近年来，临床文本处理引起了越来越多的关注。另一方面，访问敏感的患者数据仍然是一个巨大的挑战，因为如果没有法律障碍，就无法共享文本，而无需删除个人信息。有许多技术可以修改或删除与患者相关的信息，每个信息都具有不同的优势。本文使用对应于五个不同NLP任务的多个数据集研究了不同匿名技术对ML模型性能的影响。提出了一些学习和建议。这项工作证实，特别强大的匿名技术会导致大量的性能下降。除此之外，基于相似性搜索的重新识别攻击，大多数提出的技术都不安全。

Clinical text processing has gained more and more attention in recent years. The access to sensitive patient data, on the other hand, is still a big challenge, as text cannot be shared without legal hurdles and without removing personal information. There are many techniques to modify or remove patient related information, each with different strengths. This paper investigates the influence of different anonymization techniques on the performance of ML models using multiple datasets corresponding to five different NLP tasks. Several learnings and recommendations are presented. This work confirms that particularly stronger anonymization techniques lead to a significant drop of performance. In addition to that, most of the presented techniques are not secure against a re-identification attack based on similarity search.

下载PDF全文

下载文献需遵守相关版权规定

论文标题