论文标题
以覆盖范围为中心的核心选择高修剪率
Coverage-centric Coreset Selection for High Pruning Rates
论文作者
论文摘要
鉴于修剪速率,一声核心的选择旨在选择训练数据的代表性子集,后来可以用来训练未来模型,同时保持高精度。最先进的核心选择方法选择了基于重要性度量的最高重要性示例,并发现以低的修剪率表现良好。但是,以高度修剪的速度,它们的灾难性准确性下降,其性能比随机采样差。本文探讨了这种准确性的原因,理论上和经验上都下降了。我们首先提出了一个新颖的指标,以通过将经典的几何套装问题扩展到分布覆盖问题,以测量数据集对特定分布的覆盖范围。该指标有助于解释为什么通过SOTA方法以高修剪率选择的核心与随机抽样相比,由于数据覆盖范围较差,因此其性能较差。然后,我们提出了一种新型的单发核心选择方法,即以覆盖率为中心的核心选择(CCS),该方法共同考虑分布时的总体数据覆盖范围以及每个示例的重要性。我们在五个数据集上评估了CC,并表明,以高修剪率(例如90%),它的准确率明显高于以前的SOTA方法(例如,CIFAR10上至少提高19.56%),并且随机选择的随机选择(例如,CIFAR10上的7.04%)和可比较的准确率和低率高的速度高7.04%。我们在https://github.com/haizhongzheng/coverage-centric-coreset-selection上公开提供代码。
One-shot coreset selection aims to select a representative subset of the training data, given a pruning rate, that can later be used to train future models while retaining high accuracy. State-of-the-art coreset selection methods pick the highest importance examples based on an importance metric and are found to perform well at low pruning rates. However, at high pruning rates, they suffer from a catastrophic accuracy drop, performing worse than even random sampling. This paper explores the reasons behind this accuracy drop both theoretically and empirically. We first propose a novel metric to measure the coverage of a dataset on a specific distribution by extending the classical geometric set cover problem to a distribution cover problem. This metric helps explain why coresets selected by SOTA methods at high pruning rates perform poorly compared to random sampling because of worse data coverage. We then propose a novel one-shot coreset selection method, Coverage-centric Coreset Selection (CCS), that jointly considers overall data coverage upon a distribution as well as the importance of each example. We evaluate CCS on five datasets and show that, at high pruning rates (e.g., 90%), it achieves significantly better accuracy than previous SOTA methods (e.g., at least 19.56% higher on CIFAR10) as well as random selection (e.g., 7.04% higher on CIFAR10) and comparable accuracy at low pruning rates. We make our code publicly available at https://github.com/haizhongzheng/Coverage-centric-coreset-selection.