论文标题
Mumin:大规模多语言多模式事实检查的错误信息社交网络数据集
MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset
论文作者
论文摘要
在社交媒体和新闻文章中,错误信息越来越普遍。它变得如此普遍,我们需要使用机器学习来检测此类内容的算法辅助。培训这些机器学习模型需要足够规模,多样性和质量的数据集。但是,自动错误信息检测领域中的数据集主要是单语的,包括有限的模式,并且规模和质量不足。解决这个问题时,我们开发了一个数据收集和链接系统(Mumin-grawl),以建立一个公共错误信息图形数据集(MUMIN),其中包含丰富的社交媒体数据(遍布2100万个twitter twitter属于26000个Twitter的Twitters,每场都互相互相互为事件的事件,播放了26000个twitter twitter,遍布26000个twitter,遍及了13000个事实,该事件是事实,并跨越了13000个事实,该事件是武器界的,曾经是武器群体,曾经是26000个twitter, 41种不同的语言,跨越了十多年。该数据集可通过Python软件包(Mumin)作为异质图提供。我们为与涉及社交媒体的索赔的真实性相关的两个节点分类任务提供了基线结果,并证明这些任务是具有挑战性的任务,其中最高的宏观宏观水平的F1分别为62.55%和61.45%。 Mumin生态系统可在https://mumin-dataset.github.io/上获得,包括数据,文档,教程和排行榜。
Misinformation is becoming increasingly prevalent on social media and in news articles. It has become so widespread that we require algorithmic assistance utilising machine learning to detect such content. Training these machine learning models require datasets of sufficient scale, diversity and quality. However, datasets in the field of automatic misinformation detection are predominantly monolingual, include a limited amount of modalities and are not of sufficient scale and quality. Addressing this, we develop a data collection and linking system (MuMiN-trawl), to build a public misinformation graph dataset (MuMiN), containing rich social media data (tweets, replies, users, images, articles, hashtags) spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade. The dataset is made available as a heterogeneous graph via a Python package (mumin). We provide baseline results for two node classification tasks related to the veracity of a claim involving social media, and demonstrate that these are challenging tasks, with the highest macro-average F1-score being 62.55% and 61.45% for the two tasks, respectively. The MuMiN ecosystem is available at https://mumin-dataset.github.io/, including the data, documentation, tutorials and leaderboards.