论文标题
在半监督学习中的非随机缺失标签上
On Non-Random Missing Labels in Semi-Supervised Learning
论文作者
论文摘要
半监督学习(SSL)从根本上是一个缺失的标签问题,与广泛的随机假设完全既有既定的却完全失踪的标签(MNAR)问题更现实和挑战,在该假设上,标签和未标记的数据都具有相同的类别分布。与现有的SSL解决方案不同,这些解决方案忽略了“类”在导致非随机性中的作用,例如,用户更有可能将流行类标记,我们将“类”明确地纳入SSL。我们的方法是三倍:1)我们建议使用偏见的标记数据来利用未标记的数据来利用未标记的数据来训练改进的分类器。 2)为了鼓励稀有的类培训,他们的模型是低回调但很高的精确训练,它丢弃了太多的伪标记的数据,我们提出了班级感知的插补(CAI),以动态降低(或增加)稀有(或频繁)类别的伪标签分配阈值(或增加)。 3)总体而言,我们将CAP和CAI集成到训练无偏的SSL模型的双重稳健估计器中。在各种MNAR设置和消融中,我们的方法不仅显着优于现有基线,而且超过了其他标签偏置删除SSL方法。请通过以下方式查看我们的代码:https://github.com/joyhuyy1412/cadr-fixmatch。
Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solutions that overlook the role of "class" in causing the non-randomness, e.g., users are more likely to label popular classes, we explicitly incorporate "class" into SSL. Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data. 2) To encourage rare class training, whose model is low-recall but high-precision that discards too many pseudo-labeled data, we propose Class-Aware Imputation (CAI) that dynamically decreases (or increases) the pseudo-label assignment threshold for rare (or frequent) classes. 3) Overall, we integrate CAP and CAI into a Class-Aware Doubly Robust (CADR) estimator for training an unbiased SSL model. Under various MNAR settings and ablations, our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods. Please check our code at: https://github.com/JoyHuYY1412/CADR-FixMatch.