论文标题
RealPatch:用于使用真实样品进行模型修补的统计匹配框架
RealPatch: A Statistical Matching Framework for Model Patching with Real Samples
论文作者
论文摘要
通常对机器学习分类器进行培训,以最大程度地减少数据集的平均误差。不幸的是,在实践中,此过程通常会利用训练数据中亚组失衡引起的虚假相关性,从而导致高平均性能,但在子组之间的性能高度可变。解决此问题的最新工作提出了使用骆驼进行模型修补。以前的方法使用生成的对抗网络来执行类内的群间数据增强,需要(a)培训许多计算昂贵的模型,以及(b)模型的合成输出的足够质量。在这项工作中,我们提出了一个实体,这是一个基于统计匹配的简单,更快,更快的数据增强的框架。我们的框架通过使用真实样本增强数据集来执行模型修补程序,从而减轻了为目标任务训练生成模型的需求。我们证明了实次捕获在三个基准数据集,Celeba,Waterbird和IwildCam的子集上的有效性,显示了最差的亚组性能和二进制分类中亚组性能差距的改善。此外,我们使用IMSITU数据集进行了211个类的实验,在这种设置中,基于生成模型的修补(例如骆驼)是不切实际的。我们表明,RealPatch可以成功消除数据集泄漏,同时减少模型泄漏并保持高实用性。可以在https://github.com/wearepal/realpatch上找到RealPatch的代码。
Machine learning classifiers are typically trained to minimise the average error across a dataset. Unfortunately, in practice, this process often exploits spurious correlations caused by subgroup imbalance within the training data, resulting in high average performance but highly variable performance across subgroups. Recent work to address this problem proposes model patching with CAMEL. This previous approach uses generative adversarial networks to perform intra-class inter-subgroup data augmentations, requiring (a) the training of a number of computationally expensive models and (b) sufficient quality of model's synthetic outputs for the given domain. In this work, we propose RealPatch, a framework for simpler, faster, and more data-efficient data augmentation based on statistical matching. Our framework performs model patching by augmenting a dataset with real samples, mitigating the need to train generative models for the target task. We demonstrate the effectiveness of RealPatch on three benchmark datasets, CelebA, Waterbirds and a subset of iWildCam, showing improvements in worst-case subgroup performance and in subgroup performance gap in binary classification. Furthermore, we conduct experiments with the imSitu dataset with 211 classes, a setting where generative model-based patching such as CAMEL is impractical. We show that RealPatch can successfully eliminate dataset leakage while reducing model leakage and maintaining high utility. The code for RealPatch can be found at https://github.com/wearepal/RealPatch.