论文标题
MC-Gen:用于私人合成数据生成的多级聚类
MC-GEN:Multi-level Clustering for Private Synthetic Data Generation
论文作者
论文摘要
随着机器学习和数据科学的发展,公司和研究机构之间的数据共享非常普遍,以避免数据稀缺。但是,共享包含私人信息的原始数据集可能会导致隐私泄漏。一个可靠的解决方案是利用私人合成数据集,该数据集可保留原始数据集中的统计信息。在本文中,我们提出了Mc-Gen,这是一种在机器学习分类任务的差异隐私保证下保存隐私的合成数据生成方法。 MC-Gen应用多级聚类和差异私有生成模型来改善合成数据的效用。在实验评估中,我们评估了参数的影响和MC-GEN的有效性。结果表明,在多个分类任务上,MC-Gen可以在某些隐私保证下实现重大效力。此外,我们将MC-Gen与三种现有方法进行了比较。结果表明,Mc-Gen在实用性方面的表现优于其他方法。
With the development of machine learning and data science, data sharing is very common between companies and research institutes to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy leakage. A reliable solution is to utilize private synthetic datasets which preserve statistical information from original datasets. In this paper, we propose MC-GEN, a privacy-preserving synthetic data generation method under differential privacy guarantee for machine learning classification tasks. MC-GEN applies multi-level clustering and differential private generative model to improve the utility of synthetic data. In the experimental evaluation, we evaluated the effects of parameters and the effectiveness of MC-GEN. The results showed that MC-GEN can achieve significant effectiveness under certain privacy guarantees on multiple classification tasks. Moreover, we compare MC-GEN with three existing methods. The results showed that MC-GEN outperforms other methods in terms of utility.