高维度的基于保证金的采样：当活动效率不如保持被动效率时

论文标题

高维度的基于保证金的采样：当活动效率不如保持被动效率时

Margin-based sampling in high dimensions: When being active is less efficient than staying passive

论文作者

Tifrea, Alexandru, Clarysse, Jacob, Yang, Fanny

论文摘要

人们普遍认为，鉴于具有相同的标签预算，主动学习（AL）等算法，例如基于利润的主动学习，比被动学习（PL）获得了更好的预测性能，尽管其计算成本更高。最近的经验证据表明，这种增加的成本可能是徒劳的，因为基于保证金的AL有时会比PL更糟。尽管现有作品在低维度中提供了不同的解释，但本文表明，基本机制在高维度上是完全不同的：我们证明了逻辑回归的证明，即使在无噪声数据的情况下，以及在使用贝叶斯最佳决策边界进行抽样的逻辑回归。我们证明的见解表明，当类别之间的分离很小时，这种高维现象会加剧。我们通过在20个高维数据集上进行实验来证实这一直觉，这些数据集涵盖了各种应用程序，从金融和组织学到化学和计算机视觉。

It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题