L3Cube-Mahahate：基于推文的马拉地仇恨言语检测数据集和BERT模型

论文标题

L3Cube-Mahahate：基于推文的马拉地仇恨言语检测数据集和BERT模型

L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT models

论文作者

Velankar, Abhishek, Patil, Hrushikesh, Gore, Amol, Salunke, Shubham, Joshi, Raviraj

论文摘要

社交媒体平台被许多人占有主导地表达自己的思想和观点。但是，这些平台也促成了大量的仇恨和虐待内容。因此，重要的是要遏制仇恨言论在这些平台上的传播。在印度，马拉地语是广大观众使用的最受欢迎的语言之一。在这项工作中，我们提出了L3Cube-Mahahate，这是Marathi的第一个主要仇恨言论数据集。该数据集是从Twitter策划的，手动注释。我们的数据集由25000多个不同的推文组成，这些推文标记为四个主要类别，即仇恨，令人反感，而不是。我们介绍了用于收集和注释数据中数据以及过程中面临的挑战的方法。最后，我们使用基于CNN，LSTM和Transformers的深度学习模型提出基线分类结果。我们探索Bert的单语和多语言变体，例如Mahabert，Instabert，Mbert和XLM-Roberta，并表明单语模型的性能要比其多语言对应物更好。 Mahabert模型可在L3Cube-Mahahate语料库上提供最佳结果。数据和模型可在https://github.com/l3cube-pune/marathinlp上找到。

Social media platforms are used by a large number of people prominently to express their thoughts and opinions. However, these platforms have contributed to a substantial amount of hateful and abusive content as well. Therefore, it is important to curb the spread of hate speech on these platforms. In India, Marathi is one of the most popular languages used by a wide audience. In this work, we present L3Cube-MahaHate, the first major Hate Speech Dataset in Marathi. The dataset is curated from Twitter, annotated manually. Our dataset consists of over 25000 distinct tweets labeled into four major classes i.e hate, offensive, profane, and not. We present the approaches used for collecting and annotating the data and the challenges faced during the process. Finally, we present baseline classification results using deep learning models based on CNN, LSTM, and Transformers. We explore mono-lingual and multi-lingual variants of BERT like MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa and show that mono-lingual models perform better than their multi-lingual counterparts. The MahaBERT model provides the best results on L3Cube-MahaHate Corpus. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

下载PDF全文

下载文献需遵守相关版权规定

论文标题