论文标题
基于基于集合的主动学习文本分类及其在线社交媒体的应用
Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media
论文作者
论文摘要
在线社交媒体中自然语言处理(NLP)的最新进展显然归功于大规模数据集。但是,标记,存储和处理大量文本数据点(例如推文)仍然具有挑战性。最重要的是,在诸如仇恨言论检测之类的应用中,标记有足够大的包含进攻内容的大型数据集可以在人类注释中对人类注释征收精神和情感上的征税。因此,可以充分利用较少标记的数据点的NLP方法引起了极大的兴趣。在本文中,我们提出了一种新型的基于池的主动学习方法,该方法可用于以最低注释成本培训大型未标记的语料库。为此,我们建议在特征空间中找到主要的本地簇集。这些集合在数据中代表最大粘性结构。然后,选择不属于任何主要集合的样本被选择用于训练模型,因为它们代表了本地群集的边界,并且对分类更具挑战性。我们提出的方法没有任何参数要调整,因此与数据集无关,并且可以达到与完整培训数据相同的分类精度,并且数据点要少得多。此外,与最先进的积极学习策略相比,我们的方法还取得了更高的性能。此外,我们提出的算法能够将常规的主动学习分数(例如基于不确定性的分数)纳入其选择标准。我们在不同的数据集上显示了我们方法的有效性,并使用不同的神经网络体系结构。
Recent advances in natural language processing (NLP) in online social media are evidently owed to large-scale datasets. However, labeling, storing, and processing a large number of textual data points, e.g., tweets, has remained challenging. On top of that, in applications such as hate speech detection, labeling a sufficiently large dataset containing offensive content can be mentally and emotionally taxing for human annotators. Thus, NLP methods that can make the best use of significantly less labeled data points are of great interest. In this paper, we present a novel pool-based active learning method that can be used for the training of large unlabeled corpus with minimum annotation cost. For that, we propose to find the dominant sets of local clusters in the feature space. These sets represent maximally cohesive structures in the data. Then, the samples that do not belong to any of the dominant sets are selected to be used to train the model, as they represent the boundaries of the local clusters and are more challenging to classify. Our proposed method does not have any parameters to be tuned, making it dataset-independent, and it can approximately achieve the same classification accuracy as full training data, with significantly fewer data points. Additionally, our method achieves a higher performance in comparison to the state-of-the-art active learning strategies. Furthermore, our proposed algorithm is able to incorporate conventional active learning scores, such as uncertainty-based scores, into its selection criteria. We show the effectiveness of our method on different datasets and using different neural network architectures.