临床文本分析的无监督领域适应框架

论文标题

临床文本分析的无监督领域适应框架

A Privacy-Preserving Unsupervised Domain Adaptation Framework for Clinical Text Analysis

论文作者

An, Qiyuan, Li, Ruijiang, Gu, Lin, Zhang, Hao, Chen, Qingyu, Lu, Zhiyong, Wang, Fei, Zhu, Yingying

论文摘要

无监督的域适应性（UDA）通常将未标记的目标域数据与源域的分布保持一致，以减轻分布偏移问题。标准UDA需要与目标共享源数据，并具有潜在的数据隐私泄漏风险。为了保护源数据的隐私，我们首先建议共享源功能分布而不是源数据。但是，仅共享源功能分布仍可能会遭受会员推理攻击，这些攻击可以通过Black-Box访问源模型来推断个人的成员资格。为了解决此隐私问题，我们进一步研究了探索不足的隐私域适应性问题，并提出了一种具有新颖的差异隐私培训策略来保护源数据隐私的方法。我们在差异隐私设置下通过高斯混合模型（GMM）对源特征分布进行建模，并将其发送给目标客户端进行适应。目标客户端重新示例从GMM中差异化私人源功能，并使用几个最先进的UDA主链对目标数据进行调整。通过我们提出的方法，源数据提供商可以避免在域适应过程中泄漏源数据隐私以及保留实用程序。为了评估我们提出的方法的效用和隐私损失，我们使用两个嘈杂的挑战临床文本数据集应用了医疗报告疾病标签分类任务。结果表明，我们提出的方法可以在文本分类任务上具有较小的性能影响来保护源数据的隐私。

Unsupervised domain adaptation (UDA) generally aligns the unlabeled target domain data to the distribution of the source domain to mitigate the distribution shift problem. The standard UDA requires sharing the source data with the target, having potential data privacy leaking risks. To protect the source data's privacy, we first propose to share the source feature distribution instead of the source data. However, sharing only the source feature distribution may still suffer from the membership inference attack who can infer an individual's membership by the black-box access to the source model. To resolve this privacy issue, we further study the under-explored problem of privacy-preserving domain adaptation and propose a method with a novel differential privacy training strategy to protect the source data privacy. We model the source feature distribution by Gaussian Mixture Models (GMMs) under the differential privacy setting and send it to the target client for adaptation. The target client resamples differentially private source features from GMMs and adapts on target data with several state-of-art UDA backbones. With our proposed method, the source data provider could avoid leaking source data privacy during domain adaptation as well as reserve the utility. To evaluate our proposed method's utility and privacy loss, we apply our model on a medical report disease label classification task using two noisy challenging clinical text datasets. The results show that our proposed method can preserve source data's privacy with a minor performance influence on the text classification task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题