论文标题
临床成像档案中多器官分割的验证和优化
Validation and Optimization of Multi-Organ Segmentation on Clinical Imaging Archives
论文作者
论文摘要
腹部计算机断层扫描(CT)的分割提供了空间环境,形态学特性和组织特异性放射组学的框架,可指导定量放射学评估。 2015年的MICCAI挑战刺激了传统和深度学习方法的多器官腹部CT分割中的实质性创新。最新方法的最新创新推动了临床翻译吸引人的水平的绩效。但是,开放数据集上的持续交叉验证列出了间接知识污染的风险,并可能导致循环推理。此外,由于患者内腹部生理的差异很大,“现实世界”细分可能会具有挑战性。本文中,我们进行了两个数据检索,以捕获有关3D U-NET(基线算法)最近发表的变化的临床获得的腹部CT队列。首先,我们检索了2004年对476例涉及脾脏异常的诊断代码患者的研究(队列A)。其次,我们检索了4313项对1754名没有诊断代码的患者的识别研究(同类b)。我们在两个队列上对现有算法进行预期评估,分别产生13%和8%的失败率。然后,我们确定了与分割失败的队列A中的51名受试者,并手动纠正了肝脏和胆囊标签。我们对添加手动标签的模型进行了重新训练,从而导致A和B队列的性能提高了9%和6%的失败率。总而言之,在潜在队列上的基线性能与先前发布的数据集相似。此外,当在第二次扣留验证队列中评估时,添加第一个队列中的数据实质上提高了性能。
Segmentation of abdominal computed tomography(CT) provides spatial context, morphological properties, and a framework for tissue-specific radiomics to guide quantitative Radiological assessment. A 2015 MICCAI challenge spurred substantial innovation in multi-organ abdominal CT segmentation with both traditional and deep learning methods. Recent innovations in deep methods have driven performance toward levels for which clinical translation is appealing. However, continued cross-validation on open datasets presents the risk of indirect knowledge contamination and could result in circular reasoning. Moreover, 'real world' segmentations can be challenging due to the wide variability of abdomen physiology within patients. Herein, we perform two data retrievals to capture clinically acquired deidentified abdominal CT cohorts with respect to a recently published variation on 3D U-Net (baseline algorithm). First, we retrieved 2004 deidentified studies on 476 patients with diagnosis codes involving spleen abnormalities (cohort A). Second, we retrieved 4313 deidentified studies on 1754 patients without diagnosis codes involving spleen abnormalities (cohort B). We perform prospective evaluation of the existing algorithm on both cohorts, yielding 13% and 8% failure rate, respectively. Then, we identified 51 subjects in cohort A with segmentation failures and manually corrected the liver and gallbladder labels. We re-trained the model adding the manual labels, resulting in performance improvement of 9% and 6% failure rate for the A and B cohorts, respectively. In summary, the performance of the baseline on the prospective cohorts was similar to that on previously published datasets. Moreover, adding data from the first cohort substantively improved performance when evaluated on the second withheld validation cohort.