解释混杂的偏见

论文标题

解释混杂的偏见

On Explaining Confounding Bias

论文作者

Youngmann, Brit, Cafarella, Michael, Moskovitch, Yuval, Salimi, Babak

论文摘要

分析大型数据集时，分析师通常对其查询产生的令人惊讶或意外结果的解释感兴趣。在这项工作中，我们专注于揭示数据中相关性的汇总SQL查询。阻碍这种查询的解释的主要挑战是混淆偏见，这可能导致意外的相关性。我们用一组混杂变量来产生解释，这些变量解释了查询中观察到的意外相关性。我们建议挖掘候选人从外部来源混淆变量，因为在许多现实生活中，这些解释不仅包含在输入数据中。我们提出了一种有效的算法，该算法找到了解释意外相关性的最佳属性子集（从外部来源和输入数据集挖掘）。该算法体现在称为MESA的系统中。我们通过多个现实生活数据集进行了实验证明，通过用户研究，我们的方法会产生洞察力的解释，优于现有方法，这些方法仅在输入数据中搜索说明。我们进一步证明了系统对缺少数据的鲁棒性以及台面处理包含数百万个元组的输入数据集的能力以及候选人混淆属性的广泛搜索空间。

When analyzing large datasets, analysts are often interested in the explanations for surprising or unexpected results produced by their queries. In this work, we focus on aggregate SQL queries that expose correlations in the data. A major challenge that hinders the interpretation of such queries is confounding bias, which can lead to an unexpected correlation. We generate explanations in terms of a set of confounding variables that explain the unexpected correlation observed in a query. We propose to mine candidate confounding variables from external sources since, in many real-life scenarios, the explanations are not solely contained in the input data. We present an efficient algorithm that finds the optimal subset of attributes (mined from external sources and the input dataset) that explain the unexpected correlation. This algorithm is embodied in a system called MESA. We demonstrate experimentally over multiple real-life datasets and through a user study that our approach generates insightful explanations, outperforming existing methods that search for explanations only in the input data. We further demonstrate the robustness of our system to missing data and the ability of MESA to handle input datasets containing millions of tuples and an extensive search space of candidate confounding attributes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题