论文标题
自我训练:一项调查
Self-Training: A Survey
论文作者
论文摘要
半监督算法旨在从一小部分标记的观测值和一组未标记的观测值中学习预测功能。由于此框架在许多应用中都具有相关性,因此他们在学术界和行业中都引起了很多兴趣。在现有技术中,近年来,自训练方法无疑引起了更多的关注。这些模型旨在在低密度区域找到决策边界,而无需对数据分布做出其他假设,并使用学习分类器或其边缘的无符号输出评分作为置信度的指标。自我训练算法的工作原理是通过将伪标签分配给一组未标记的训练样本,以大于一定的阈值来学习分类器迭代。然后,使用伪标记的示例来丰富标记的训练数据,并与标记的训练集一起训练新的分类器。在本文中,我们提出了二进制和多类分类的自我训练方法。除了它们的变体和两种相关方法,即基于一致性的方法和转导学习。我们使用不同的一般和图像分类基准检查了重要的自我训练特征对各种方法的影响,并讨论了我们在自我训练中进行未来研究的想法。据我们所知,这是对此主题的第一次彻底完整的调查。
Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.