通过深层生成模型生成隐私保护过程数据

论文标题

通过深层生成模型生成隐私保护过程数据

Generating Privacy-Preserving Process Data with Deep Generative Models

论文作者

Li, Keyi, Yang, Sen, Sullivan, Travis M., Burd, Randall S., Marsic, Ivan

论文摘要

带有机密信息的流程数据不能直接在公共场所共享，这阻碍了过程数据挖掘和分析的研究。已经研究了数据加密方法以保护数据，但仍可能被解密，从而导致个人识别。我们尝试了不同的表示模型学习模型，并使用了学习的模型来生成合成过程数据。我们引入了一个针对生成数据生成网络（ProcessGAN），其中有两个用于生成器和鉴别器的变压器网络。我们在六个现实世界数据集上评估了ProcessGAN和传统模型，其中两个是公共的，四个是在医疗领域中收集的。我们使用统计指标和监督学习分数来评估综合数据。我们还使用工艺挖掘来发现真实和合成数据集的工作流程，并让医学专家评估合成工作流的临床适用性。我们发现，当在复杂过程的小型真实数据集中训练时，ProcessGAN优于传统的顺序模型。 ProcessGAN更好地代表了活动之间的远程依赖性，这对于复杂的过程（例如医疗过程）很重要。传统的顺序模型在对简单过程的大数据进行培训时表现更好。我们得出的结论是，ProcessGAN可以与真实数据产生大量可共享的合成过程数据。

Process data with confidential information cannot be shared directly in public, which hinders the research in process data mining and analytics. Data encryption methods have been studied to protect the data, but they still may be decrypted, which leads to individual identification. We experimented with different models of representation learning and used the learned model to generate synthetic process data. We introduced an adversarial generative network for process data generation (ProcessGAN) with two Transformer networks for the generator and the discriminator. We evaluated ProcessGAN and traditional models on six real-world datasets, of which two are public and four are collected in medical domains. We used statistical metrics and supervised learning scores to evaluate the synthetic data. We also used process mining to discover workflows for the authentic and synthetic datasets and had medical experts evaluate the clinical applicability of the synthetic workflows. We found that ProcessGAN outperformed traditional sequential models when trained on small authentic datasets of complex processes. ProcessGAN better represented the long-range dependencies between the activities, which is important for complicated processes such as the medical processes. Traditional sequential models performed better when trained on large data of simple processes. We conclude that ProcessGAN can generate a large amount of sharable synthetic process data indistinguishable from authentic data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题