通过数据增强进攻性语言检测

论文标题

通过数据增强进攻性语言检测

Enhanced Offensive Language Detection Through Data Augmentation

论文作者

Liu, Ruibo, Xu, Guangxuan, Vosoughi, Soroush

论文摘要

在社交媒体上检测进攻性语言是一项重要任务。 ICWSM-2020数据挑战任务2旨在使用包含100K标记推文的人群数据集识别进攻内容。但是，数据集患有类不平衡，与其他类别相比，某些标签极为罕见（例如，仇恨类仅占数据的5％）。在这项工作中，我们提出了基于一代数据增强方法的达格（Data Evermer），该方法改善了在不平衡和低资源数据（例如进攻语言数据集）上的分类性能。达格（Dager）提取给定类的词汇特征，并使用这些功能来指导在GPT-2上构建的条件发生器的生成。然后可以将生成的文本添加到培训集中作为增强数据。我们表明，当我们使用整个数据集的1％进行培训时，应用达格尔可以将数据挑战的F1得分提高11％（使用BERT进行分类）；此外，生成的数据还很好地保留了原始标签。我们在四个不同的分类器（BERT，CNN，BI-LSTM引起注意和变压器）上测试DAGER，观察到检测的普遍改进，这表明我们的方法有效且分类器 - 静态。

Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we present Dager (Data Augmenter), a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data such as the offensive language dataset. Dager extracts the lexical features of a given class, and uses these features to guide the generation of a conditional generator built on GPT-2. The generated text can then be added to the training set as augmentation data. We show that applying Dager can increase the F1 score of the data challenge by 11% when we use 1% of the whole dataset for training (using BERT for classification); moreover, the generated data also preserves the original labels very well. We test Dager on four different classifiers (BERT, CNN, Bi-LSTM with attention, and Transformer), observing universal improvement on the detection, indicating our method is effective and classifier-agnostic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题