COVID-19使用文本挖掘方法的文献挖掘和检索

论文标题

COVID-19使用文本挖掘方法的文献挖掘和检索

COVID-19 Literature Mining and Retrieval using Text Mining Approaches

论文作者

Uday, Sanku Satya, Pavani, Satti Thanuja, Lakshmi, T. Jaya, Chivukula, Rohit

论文摘要

新型冠状病毒病（COVID-19）于2019年底始于中国武汉，迄今为止已感染了全球超过1.48亿人，导致312万人死亡。 2020年3月10日，世界卫生组织（WHO）宣布它为全球大流行。许多院士和研究人员开始发表论文，描述了Covid-19的最新发现。大量的出版物涌入使其他研究人员很难浏览大量数据，并找到有助于他们研究的合适数据。因此，拟议的模型试图从大量的研究出版物中提取恢复标题，这使研究人员容易。艾伦AI研究所发布了CORD-19数据集，其中包括与PubMed的PMC，WO（世界卫生组织），Biorxiv和MedRxiv Preints的200,000篇与冠状病毒相关的研究出版物有关的期刊文章。与本文档语料库一起，他们还提供了一个名为topics-rnd3的主题数据集，该数据集由主题列表组成。每个主题都有三种类型的表示，例如查询，问题和叙述。这些数据集是为研究开放的，他们也发布了Kaggle上的TREC-Covid竞赛。使用这些主题，例如查询，我们的目标是找出Cord-19数据集中的相关文档。在这项研究中，应确认相关文档的主题RND3数据集中的主题。拟议的模型使用自然语言处理（NLP）技术，例如字袋，平均词2-VEC，平均BERT基本模型和TF-IDF加权Word2VEC模型来制造媒介，以查询，问题，叙述和组合。同样，在脐带-19数据集中为标题制造向量。制造向量后，余弦相似性用于在每两个向量之间找到相似性。余弦的相似性有助于我们找到给定主题的相关文档。

The novel coronavirus disease (COVID-19) began in Wuhan, China, in late 2019 and to date has infected over 148M people worldwide, resulting in 3.12M deaths. On March 10, 2020, the World Health Organisation (WHO) declared it as a global pandemic. Many academicians and researchers started to publish papers describing the latest discoveries on covid-19. The large influx of publications made it hard for other researchers to go through a large amount of data and find the appropriate one that helps their research. So, the proposed model attempts to extract relavent titles from the large corpus of research publications which makes the job easy for the researchers. Allen Institute for AI released the CORD-19 dataset, which consists of 2,00,000 journal articles related to coronavirus-related research publications from PubMed's PMC, WHO (World Health Organization), bioRxiv, and medRxiv pre-prints. Along with this document corpus, they have also provided a topics dataset named topics-rnd3 consisting of a list of topics. Each topic has three types of representations like query, question, and narrative. These Datasets are made open for research, and also they released a TREC-COVID competition on Kaggle. Using these topics like queries, our goal is to find out the relevant documents in the CORD-19 dataset. In this research, relevant documents should be recognized for the posed topics in topics-rnd3 data set. The proposed model uses Natural Language Processing(NLP) techniques like Bag-of-Words, Average Word-2-Vec, Average BERT Base model and Tf-Idf weighted Word2Vec model to fabricate vectors for query, question, narrative, and combinations of them. Similarly, fabricate vectors for titles in the CORD-19 dataset. After fabricating vectors, cosine similarity is used for finding similarities between every two vectors. Cosine similarity helps us to find relevant documents for the given topic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题