论文标题
癌症遗传学的半自动化知识基础构建
Semi-Automating Knowledge Base Construction for Cancer Genetics
论文作者
论文摘要
在这项工作中,我们考虑了癌症中遗传学的指数增长。综合和集中传播证据的需求促使一组医生手动构建和维护知识库,以提炼文献中报告的关键结果。这是一个艰辛的过程,需要阅读全文文章,以了解研究设计,评估研究质量并提取与特定遗传性癌症基因相关的癌症风险估计(即渗透率)。在这项工作中,我们提出了模型,以自动从全文本癌遗传学物品中浮出关键元素,其最终目的是加快目前到位的手动工作流程。 我们提出了两项具有挑战性的任务,这些任务对于表征报道的癌症遗传学研究的发现至关重要:(i)提取描述\ emph {确定机制}的文本片段,而这些文本均反过来又告知所研究人群是否会引入偏见,因为偏离了目标人群; (ii)提取与特定种系突变相关的报告风险估计(例如,几率或危害比率)。后一个任务可能被视为联合实体标记和关系提取问题。为了培训这些任务的模型,我们使用手动构造的知识库在全文文章中对代币和摘要进行遥远的监督。我们提出并评估了几种模型变体,包括基于变压器的联合实体和关系提取模型,以提取<种系突变,风险 - 估计>}对。我们观察到了强大的经验表现,强调了此类模型在该领域的KB构建的实际潜力。我们在模型中烧毁了<种系突变的联合模型,风险 - 估计>票价要比管道上的方法要好得多。
In this work, we consider the exponentially growing subarea of genetics in cancer. The need to synthesize and centralize this evidence for dissemination has motivated a team of physicians to manually construct and maintain a knowledge base that distills key results reported in the literature. This is a laborious process that entails reading through full-text articles to understand the study design, assess study quality, and extract the reported cancer risk estimates associated with particular hereditary cancer genes (i.e., penetrance). In this work, we propose models to automatically surface key elements from full-text cancer genetics articles, with the ultimate aim of expediting the manual workflow currently in place. We propose two challenging tasks that are critical for characterizing the findings reported cancer genetics studies: (i) Extracting snippets of text that describe \emph{ascertainment mechanisms}, which in turn inform whether the population studied may introduce bias owing to deviations from the target population; (ii) Extracting reported risk estimates (e.g., odds or hazard ratios) associated with specific germline mutations. The latter task may be viewed as a joint entity tagging and relation extraction problem. To train models for these tasks, we induce distant supervision over tokens and snippets in full-text articles using the manually constructed knowledge base. We propose and evaluate several model variants, including a transformer-based joint entity and relation extraction model to extract <germline mutation, risk-estimate>} pairs. We observe strong empirical performance, highlighting the practical potential for such models to aid KB construction in this space. We ablate components of our model, observing, e.g., that a joint model for <germline mutation, risk-estimate> fares substantially better than a pipelined approach.