论文标题
肿瘤属性分类的丰富注释来自病理报告,标记有限的数据
Enriched Annotations for Tumor Attribute Classification from Pathology Reports with Limited Labeled Data
论文作者
论文摘要
Precision Medicine有可能彻底改变医疗保健,但是患者的许多数据都被锁定在非结构化的自由文本中,从而限制了研究和提供有效的个性化治疗方法。由于高质量注释所需的高水平,生成大量注释的数据集以从临床笔记中提取信息通常具有挑战性且昂贵。为了实现小型数据集大小的自然语言处理,我们开发了一种新颖的层次分层注释方案和算法,有监督的线路注意力(SLA),并将该算法应用于从加利福尼亚州旧金山大学(UCSF)的肾脏和结肠癌病理学报告中预测肾脏和结肠癌病理学报告的分类肿瘤属性。尽管以前的工作仅注释文档级别标签,但我们还要求注释者通过要求它们还要突出显示最终标签的相关行或潜在的行,以丰富传统标签,这导致每份文档所需的注释时间增加20%。通过丰富的注释,我们开发了一种简单且可解释的机器学习算法,该算法首先预测文档中的相关线,然后预测肿瘤属性。我们的结果表明,每个癌症的32、64、128和186个标记文档的小型数据集尺寸仅需要标记文档的数量的一半,作为最先进的方法,即可获得我们进行的绝大多数比较的类似或更好的Micro-F1和Macro-F1分数。考虑到增加的注释时间,这导致了最新的注释时间减少40%。
Precision medicine has the potential to revolutionize healthcare, but much of the data for patients is locked away in unstructured free-text, limiting research and delivery of effective personalized treatments. Generating large annotated datasets for information extraction from clinical notes is often challenging and expensive due to the high level of expertise needed for high quality annotations. To enable natural language processing for small dataset sizes, we develop a novel enriched hierarchical annotation scheme and algorithm, Supervised Line Attention (SLA), and apply this algorithm to predicting categorical tumor attributes from kidney and colon cancer pathology reports from the University of California San Francisco (UCSF). Whereas previous work only annotated document level labels, we in addition ask the annotators to enrich the traditional label by asking them to also highlight the relevant line or potentially lines for the final label, which leads to a 20% increase of annotation time required per document. With the enriched annotations, we develop a simple and interpretable machine learning algorithm that first predicts the relevant lines in the document and then predicts the tumor attribute. Our results show across the small dataset sizes of 32, 64, 128, and 186 labeled documents per cancer, SLA only requires half the number of labeled documents as state-of-the-art methods to achieve similar or better micro-f1 and macro-f1 scores for the vast majority of comparisons that we made. Accounting for the increased annotation time, this leads to a 40% reduction in total annotation time over the state of the art.