论文标题
一致性约束发现:衡量数据驱动系统的信任
Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems
论文作者
论文摘要
数据驱动的应用程序的可靠性和正确功能取决于数据的持续符合应用程序的初始设计。当数据偏离此初始配置文件时,系统行为将变得不可预测。数据分析技术(例如功能依赖性和拒绝约束)编码可用于检测偏差的数据。但是传统方法通常集中于确切的约束和分类属性,并且不适合诸如确定机器学习系统的预测是否可以信任或量化数据漂移等任务。在本文中,我们介绍了数据不变性,这是一种新的数据构图原始性,该数据涉及涉及(嘈杂)数据集中多个数值属性的算术关系,并补充了现有的数据推广技术。我们提出了一种定量语义,以衡量违反数据不变的程度,并确定可以通过对给定数据集的差异较低的观察结果来构建强大的数据不变性。该原理的一个具体实例给出了一个令人惊讶的结果,即通常被丢弃的主成分分析(PCA)的低变义成分比高变量成分产生更好的不变性。我们证明了数据不变性在两个应用程序上的价值:值得信赖的机器学习和数据漂移。我们从经验上表明,数据不变性可以(1)可靠地检测到不应信任机器学习模型的预测的元素,并且(2)比最先进的方法更准确地量化了数据漂移。此外,我们展示了四个案例研究,其中以干预为中心的解释工具使用数据不变性来解释元组不合格的原因。
The reliability and proper function of data-driven applications hinge on the data's continued conformance to the applications' initial design. When data deviates from this initial profile, system behavior becomes unpredictable. Data profiling techniques such as functional dependencies and denial constraints encode patterns in the data that can be used to detect deviations. But traditional methods typically focus on exact constraints and categorical attributes, and are ill-suited for tasks such as determining whether the prediction of a machine learning system can be trusted or for quantifying data drift. In this paper, we introduce data invariants, a new data-profiling primitive that models arithmetic relationships involving multiple numerical attributes within a (noisy) dataset and which complements the existing data-profiling techniques. We propose a quantitative semantics to measure the degree of violation of a data invariant, and establish that strong data invariants can be constructed from observations with low variance on the given dataset. A concrete instance of this principle gives the surprising result that low-variance components of a principal component analysis (PCA), which are usually discarded, generate better invariants than the high-variance components. We demonstrate the value of data invariants on two applications: trusted machine learning and data drift. We empirically show that data invariants can (1) reliably detect tuples on which the prediction of a machine-learned model should not be trusted, and (2) quantify data drift more accurately than the state-of-the-art methods. Additionally, we show four case studies where an intervention-centric explanation tool uses data invariants to explain causes for tuple non-conformance.