论文标题
在检索相关文档进行分析的背景下的分类问题的方法比较
A Comparison of Approaches for Imbalanced Classification Problems in the Context of Retrieving Relevant Documents for an Analysis
论文作者
论文摘要
许多基于文本的社会科学研究的第一步之一是检索与大型语料库中的本其他无关文档分析相关的文档。社会科学中解决此检索任务的常规方法是应用一组关键字,并将这些文档视为至少包含关键字之一的相关文档。但是,不完整关键字的应用列出了绘制有偏见的推论的风险。更复杂和昂贵的方法,例如查询扩展技术,基于主题模型的分类规则,主动和被动监督学习可能会使可能更准确地与无关文档分开,从而减少偏见的潜在规模。但是,与关键字列表相比,是否采用这些更昂贵的方法会提高检索性能,如果是,则缺少对这些方法的比较,尚不清楚多少。这项研究通过比较与德国推文数据相关的三个检索任务中的这些方法(Linder,2017),社会偏见推理语料库(SBIC)(SAP等,2020)和Reuters-21578 Corpus(Lewis,1997)。结果表明,在大多数研究的设置中,查询扩展技术和基于主题模型的分类规则往往会降低而不是增加检索性能。但是,如果应用于不小的标记培训实例(例如1,000个文档),则积极监督的学习将比关键字列表的检索更高。
One of the first steps in many text-based social science studies is to retrieve documents that are relevant for the analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists risks drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder, 2017), the Social Bias Inference Corpus (SBIC) (Sap et al., 2020), and the Reuters-21578 corpus (Lewis, 1997). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1,000 documents), reaches a substantially higher retrieval performance than keyword lists.