转盘：有偏见，不平衡，动态表格数据集用于ML评估

论文标题

转盘：有偏见，不平衡，动态表格数据集用于ML评估

Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

论文作者

Jesus, Sérgio, Pombal, José, Alves, Duarte, Cruz, André, Saleiro, Pedro, Ribeiro, Rita P., Gama, João, Bizarro, Pedro

论文摘要

在现实数据集上评估新技术在ML研究的发展及其更广泛地采用从业者中起着至关重要的作用。近年来，用于计算机视觉和NLP任务的公开可用数据资源大大增加了。但是，表格数据（在许多高风险域中都普遍存在）一直落后。为了弥合这一差距，我们提出了银行帐户欺诈（BAF），这是第一个公开可用的隐私，大规模，现实的表格数据集套件。该套件是通过在匿名的，现实世界的银行帐户开放欺诈检测数据集上应用最先进的表格数据生成技术来生成的。该设置带来了一系列在现实世界应用中司空见惯的挑战，包括时间动态和重大的类失衡。此外，为了允许从业者强调ML方法的性能和公平性，BAF的每个数据集变体都包含特定类型的数据偏差。借助此资源，我们旨在为研究社区提供更现实，完整，健壮的测试床，以评估新颖和现有方法。

Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题