论文标题
文本挖掘以识别和提取从非结构化数据集中的新型疾病治疗
Text Mining to Identify and Extract Novel Disease Treatments From Unstructured Datasets
论文作者
论文摘要
目的:我们旨在从非结构化的文本来源学习潜在的疾病疗法。更具体地说,我们试图通过对口头文本结构的简单推理提取疾病的药物疾病对疾病。 材料和方法:我们使用Google Cloud转录NPR广播节目的播客剧集。然后,我们构建了一个系统进行系统预处理文本的管道,以确保对核心分类模型的质量输入,该模型供应一系列的后处理步骤,以获取过滤结果。我们的分类模型本身使用PubMed文本中预先训练的语言模型。我们管道的模块化性质可以通过在管道的每个阶段代替更高质量的组件来便于该领域的未来发展。作为一种验证措施,我们使用Robokop,Robokop,这是一个仅具有经过验证途径的医学知识图上的引擎,作为检查所提出的对存在的地面真实源。对于在Robokop中找不到的提议对,我们使用chemotext提供了进一步的验证。 结果:我们在Robokop数据库中发现了30.4%的提议对。例如,我们的模型成功地确定了奥美拉唑可以帮助治疗胃灼热。我们讨论了该结果的重要性,显示了所提出的对的一些示例。 讨论和结论:我们的结果与现有知识来源的一致性指示了朝着正确的方向迈出的一步。鉴于我们框架的插件性质,很容易添加,删除或修改零件以根据需要改进模型。我们讨论结果显示了一些示例,并注意这是一项潜在的新研究,可以进一步探索。尽管我们的方法最初是在无线电播客笔录上定向的,但它是输入态度的,可以应用于任何文本数据来源和任何感兴趣的问题。
Objective: We aim to learn potential novel cures for diseases from unstructured text sources. More specifically, we seek to extract drug-disease pairs of potential cures to diseases by a simple reasoning over the structure of spoken text. Materials and Methods: We use Google Cloud to transcribe podcast episodes of an NPR radio show. We then build a pipeline for systematically pre-processing the text to ensure quality input to the core classification model, which feeds to a series of post-processing steps for obtaining filtered results. Our classification model itself uses a language model pre-trained on PubMed text. The modular nature of our pipeline allows for ease of future developments in this area by substituting higher quality components at each stage of the pipeline. As a validation measure, we use ROBOKOP, an engine over a medical knowledge graph with only validated pathways, as a ground truth source for checking the existence of the proposed pairs. For the proposed pairs not found in ROBOKOP, we provide further verification using Chemotext. Results: We found 30.4% of our proposed pairs in the ROBOKOP database. For example, our model successfully identified that Omeprazole can help treat heartburn.We discuss the significance of this result, showing some examples of the proposed pairs. Discussion and Conclusion: The agreement of our results with the existing knowledge source indicates a step in the right direction. Given the plug-and-play nature of our framework, it is easy to add, remove, or modify parts to improve the model as necessary. We discuss the results showing some examples, and note that this is a potentially new line of research that has further scope to be explored. Although our approach was originally oriented on radio podcast transcripts, it is input-agnostic and could be applied to any source of textual data and to any problem of interest.