论文标题

CORD19ST:COVID-19语义文本相似性数据集

CORD19STS: COVID-19 Semantic Textual Similarity Dataset

论文作者

Guo, Xiao, Mirzaalian, Hengameh, Sabir, Ekraam, Jaiswal, Ayush, Abd-Almageed, Wael

论文摘要

为了对抗COVID-19的大流行,社会可以从各种自然语言处理应用中受益,例如对话医学诊断系统和专门针对Covid-19的信息检索引擎。这些应用程序依赖于测量语义文本相似性(STS)的能力,使STS成为可以使几个下游应用程序受益的基本任务。但是,现有的STS数据集和模型无法将其性能转化为特定领域的环境,例如COVID-19。为了克服这一差距,我们介绍了CORD19STS数据集,其中包括从COVID-19开放研究数据集(Cord-19)挑战中收集的13,710个带注释的句子对。具体来说,我们使用不同的采样策略生成了100万个句子对。然后,我们使用了一个填充的类似BERT的语言模型,我们称之为Sen-Sci-Cord19-Bert,以计算句子对之间的相似性分数,以提供有关不同语义相似性水平的平衡数据集,这使我们总共提供了32K句子对。每个句子对由五个亚马逊机械土耳其人(AMT)人群工人注释,在该句子对之间代表句子对之间的不同语义相似性水平(即相关,有些相关且不相关)。在使用严格的资格任务来验证收集的注释后,我们的最终Cord19ST数据集包括13,710个句子对。

In order to combat the COVID-19 pandemic, society can benefit from various natural language processing applications, such as dialog medical diagnosis systems and information retrieval engines calibrated specifically for COVID-19. These applications rely on the ability to measure semantic textual similarity (STS), making STS a fundamental task that can benefit several downstream applications. However, existing STS datasets and models fail to translate their performance to a domain-specific environment such as COVID-19. To overcome this gap, we introduce CORD19STS dataset which includes 13,710 annotated sentence pairs collected from COVID-19 open research dataset (CORD-19) challenge. To be specific, we generated one million sentence pairs using different sampling strategies. We then used a finetuned BERT-like language model, which we call Sen-SCI-CORD19-BERT, to calculate the similarity scores between sentence pairs to provide a balanced dataset with respect to the different semantic similarity levels, which gives us a total of 32K sentence pairs. Each sentence pair was annotated by five Amazon Mechanical Turk (AMT) crowd workers, where the labels represent different semantic similarity levels between the sentence pairs (i.e. related, somewhat-related, and not-related). After employing a rigorous qualification tasks to verify collected annotations, our final CORD19STS dataset includes 13,710 sentence pairs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源