论文标题

Crisisbench:针对人道主义信息处理的与危机相关的社交媒体数据集进行基准测试

CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing

论文作者

Alam, Firoj, Sajjad, Hassan, Imran, Muhammad, Ofli, Ferda

论文摘要

社交媒体流的时间关键分析对于人道主义组织在灾难期间策划快速反应至关重要。 \ textit {危机信息学}研究社区开发了多种技术和系统,用于处理和分类与社交媒体上发布的与危机相关的大数据。但是,由于文献中使用的数据集的分散性质(例如,对于培训模型),无法比较结果并衡量为危机信息学任务建立更好模型所取得的进展。在这项工作中,我们试图通过组合各种与危机相关的数据集相结合来弥合这一差距。我们合并了八个人类注销的数据集,并为\ textit {informativility}和\ textit {人道主义}分类任务提供166.1k和141.5k推文。我们认为,合并的数据集将有助于培训更复杂的模型。此外,我们使用多种深度学习架构(包括,CNN,FastText和Transformers)为二进制和多类分类任务提供基准。我们在以下网址提供数据集和脚本可用:https://crisisnlp.qcri.org/crisis_datasets_benchmarks.html

Time-critical analysis of social media streams is important for humanitarian organizations for planing rapid response during disasters. The \textit{crisis informatics} research community has developed several techniques and systems for processing and classifying big crisis-related data posted on social media. However, due to the dispersed nature of the datasets used in the literature (e.g., for training models), it is not possible to compare the results and measure the progress made towards building better models for crisis informatics tasks. In this work, we attempt to bridge this gap by combining various existing crisis-related datasets. We consolidate eight human-annotated datasets and provide 166.1k and 141.5k tweets for \textit{informativeness} and \textit{humanitarian} classification tasks, respectively. We believe that the consolidated dataset will help train more sophisticated models. Moreover, we provide benchmarks for both binary and multiclass classification tasks using several deep learning architecrures including, CNN, fastText, and transformers. We make the dataset and scripts available at: https://crisisnlp.qcri.org/crisis_datasets_benchmarks.html

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源