论文标题
指示多语言社交媒体内容的毒性检测
Toxicity Detection for Indic Multilingual Social Media Content
论文作者
论文摘要
有毒内容是当今社交媒体平台最关键的问题之一。仅印度在2020年就拥有5.18亿个社交媒体用户。为了为内容创作者及其受众提供良好的体验,对有毒评论和发布该帖子的用户至关重要。但是,最大的挑战是确定低资源指示语言中的毒性,因为存在同一文本的多种表示。此外,社交媒体上的帖子/评论不遵守特定格式,语法或句子结构;这使得滥用检测的任务在多语言社交媒体平台上更具挑战性。本文使用Sharechat/Moj在\ emph {iiit-d多语言滥用评论标识}挑战的数据中描述了团队“ Moj Masti”提出的系统。我们专注于如何利用基于多语言变压器的预训练和微调模型来处理代码混合/代码切换的分类任务。我们表现最好的系统是XLM-Roberta和Muril的合奏,在测试数据/排行榜上,平均F-1得分为0.9。我们还通过添加音译数据来观察到性能的提高。此外,使用弱元数据,结合和一些后处理技术,可以提高系统的性能,从而使我们在排行榜上排名第一。
Toxic content is one of the most critical issues for social media platforms today. India alone had 518 million social media users in 2020. In order to provide a good experience to content creators and their audience, it is crucial to flag toxic comments and the users who post that. But the big challenge is identifying toxicity in low resource Indic languages because of the presence of multiple representations of the same text. Moreover, the posts/comments on social media do not adhere to a particular format, grammar or sentence structure; this makes the task of abuse detection even more challenging for multilingual social media platforms. This paper describes the system proposed by team 'Moj Masti' using the data provided by ShareChat/Moj in \emph{IIIT-D Multilingual Abusive Comment Identification} challenge. We focus on how we can leverage multilingual transformer based pre-trained and fine-tuned models to approach code-mixed/code-switched classification tasks. Our best performing system was an ensemble of XLM-RoBERTa and MuRIL which achieved a Mean F-1 score of 0.9 on the test data/leaderboard. We also observed an increase in the performance by adding transliterated data. Furthermore, using weak metadata, ensembling and some post-processing techniques boosted the performance of our system, thereby placing us 1st on the leaderboard.