论文标题
在受管制域中启用合成数据采用
Enabling Synthetic Data adoption in regulated domains
论文作者
论文摘要
从以模型为中心到以数据为中心的思维定势的转换正在强调数据及其质量而不是算法,从而提出了新的挑战。特别是,需要考虑到高度调节的方案中信息的敏感性。解决隐私问题的特定方法已作为增强隐私技术。但是,它们常常导致信息丢失,并在数据质量和隐私之间提出至关重要的权衡。绕过这种难题的一种巧妙方法依赖于合成数据:从生成过程中获得的数据,学习真实的数据属性。学术界和行业都意识到评估合成数据质量的重要性:没有全方位可靠的指标,创新的数据生成任务没有适当的目标功能来最大化。尽管如此,这个话题仍然探讨了。因此,我们系统地分类了合成数据质量和隐私的重要特征,并设计了一种特定方法来测试它们。结果是Daisynt(采用人工智能综合):一套全面的高级测试套件,为合成数据评估设定了事实上的标准。作为一种实用的用例,已经对现实世界信用局数据进行了多种生成算法的培训。使用Daisynt在不同的合成复制品上评估了最佳模型。更多的潜在用途,包括生成模型的审核和微调或确保给定合成数据集的高质量。从规范性的角度来看,最终,Daisynt可以为在高度监管的领域中采用合成数据铺平道路,从财务到医疗保健,保险和教育。
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms, bringing forward new challenges. In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for. Specific approaches to address the privacy issue have been developed, as Privacy Enhancing Technologies. However, they frequently cause loss of information, putting forward a crucial trade-off among data quality and privacy. A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties. Both Academia and Industry realized the importance of evaluating synthetic data quality: without all-round reliable metrics, the innovative data generation task has no proper objective function to maximize. Despite that, the topic remains under-explored. For this reason, we systematically catalog the important traits of synthetic data quality and privacy, and devise a specific methodology to test them. The result is DAISYnt (aDoption of Artificial Intelligence SYnthesis): a comprehensive suite of advanced tests, which sets a de facto standard for synthetic data evaluation. As a practical use-case, a variety of generative algorithms have been trained on real-world Credit Bureau Data. The best model has been assessed, using DAISYnt on the different synthetic replicas. Further potential uses, among others, entail auditing and fine-tuning of generative models or ensuring high quality of a given synthetic dataset. From a prescriptive viewpoint, eventually, DAISYnt may pave the way to synthetic data adoption in highly regulated domains, ranging from Finance to Healthcare, through Insurance and Education.