论文标题

通过数据增强进攻性语言检测

Enhanced Offensive Language Detection Through Data Augmentation

论文作者

Liu, Ruibo, Xu, Guangxuan, Vosoughi, Soroush

论文摘要

在社交媒体上检测进攻性语言是一项重要任务。 ICWSM-2020数据挑战任务2旨在使用包含100K标记推文的人群数据集识别进攻内容。但是,数据集患有类不平衡,与其他类别相比,某些标签极为罕见(例如,仇恨类仅占数据的5%)。在这项工作中,我们提出了基于一代数据​​增强方法的达格(Data Evermer),该方法改善了在不平衡和低资源数据(例如进攻语言数据集)上的分类性能。达格(Dager)提取给定类的词汇特征,并使用这些功能来指导在GPT-2上构建的条件发生器的生成。然后可以将生成的文本添加到培训集中作为增强数据。我们表明,当我们使用整个数据集的1%进行培训时,应用达格尔可以将数据挑战的F1得分提高11%(使用BERT进行分类);此外,生成的数据还很好地保留了原始标签。我们在四个不同的分类器(BERT,CNN,BI-LSTM引起注意和变压器)上测试DAGER,观察到检测的普遍改进,这表明我们的方法有效且分类器 - 静态。

Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we present Dager (Data Augmenter), a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data such as the offensive language dataset. Dager extracts the lexical features of a given class, and uses these features to guide the generation of a conditional generator built on GPT-2. The generated text can then be added to the training set as augmentation data. We show that applying Dager can increase the F1 score of the data challenge by 11% when we use 1% of the whole dataset for training (using BERT for classification); moreover, the generated data also preserves the original labels very well. We test Dager on four different classifiers (BERT, CNN, Bi-LSTM with attention, and Transformer), observing universal improvement on the detection, indicating our method is effective and classifier-agnostic.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源