论文标题
免费隐私:数据集凝结如何帮助隐私?
Privacy for Free: How does Dataset Condensation Help Privacy?
论文作者
论文摘要
为了防止无意的数据泄漏,研究社区已诉诸于可以生成私人数据进行模型培训的数据生成器。但是,为了数据隐私,现有的解决方案遭受昂贵的培训成本或概括性不佳的效果。因此,我们提出了一个问题,是否可以同时实现培训效率和隐私。在这项工作中,我们首次确定最初旨在提高培训效率的数据集凝结(DC)也是替代传统数据生成器以替代私人数据生成的更好的解决方案,从而免费提供隐私。为了证明DC的隐私益处,我们在DC和差异隐私之间建立了联系,从理论上证明了线性特征提取器(然后扩展到非线性特征提取器),一个样本的存在有限的影响($ O(M/N)$)($ O(M/N)$)对$ M $ $ $ N(N(N)的参数分布的$ M $ Samples的$ n(N \ n \ n \ gg m)的参数分布。我们还通过启动基于损失的和最新的可能的会员推理攻击,从经验上验证了DC合成数据的视觉隐私和成员隐私。我们将这项工作视为数据效率和隐私机器学习的里程碑。
To prevent unintentional data leakage, research community has resorted to data generators that can produce differentially private data for model training. However, for the sake of the data privacy, existing solutions suffer from either expensive training cost or poor generalization performance. Therefore, we raise the question whether training efficiency and privacy can be achieved simultaneously. In this work, we for the first time identify that dataset condensation (DC) which is originally designed for improving training efficiency is also a better solution to replace the traditional data generators for private data generation, thus providing privacy for free. To demonstrate the privacy benefit of DC, we build a connection between DC and differential privacy, and theoretically prove on linear feature extractors (and then extended to non-linear feature extractors) that the existence of one sample has limited impact ($O(m/n)$) on the parameter distribution of networks trained on $m$ samples synthesized from $n (n \gg m)$ raw samples by DC. We also empirically validate the visual privacy and membership privacy of DC-synthesized data by launching both the loss-based and the state-of-the-art likelihood-based membership inference attacks. We envision this work as a milestone for data-efficient and privacy-preserving machine learning.