论文标题

仅使用标签名称的文本分类:语言模型自我训练方法

Text Classification Using Label Names Only: A Language Model Self-Training Approach

论文作者

Meng, Yu, Zhang, Yunyi, Huang, Jiaxin, Xiong, Chenyan, Ji, Heng, Zhang, Chao, Han, Jiawei

论文摘要

当前的文本分类方法通常需要大量的人类标记的文档作为培训数据,这在实际应用中可能很昂贵且难以获得。人类可以执行分类而不看到任何标记的示例,但仅基于描述要分类的类别的一小部分单词。在本文中,我们探讨了仅使用每个班级的标签名称在未标记的数据上训练分类模型的潜力,而无需使用任何标记的文档。我们使用预训练的神经语言模型作为类别理解的一般语言知识来源,也是文档分类的表示模型。我们的方法(1)将语义相关的单词与标签名称相关联,(2)找到类别指标单词并训练模型以预测其隐含类别,并且(3)通过自我训练将模型推广。我们表明,我们的模型在四个基准数据集(包括主题和情感分类)上达到了90%的精度,而无需使用任何标记的文档,但要从最多3个单词(大多数情况下)作为标签名称的未标记的数据中学习。

Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源