在分割医学图像中，将人类错误与地面真相删除

论文标题

在分割医学图像中，将人类错误与地面真相删除

Disentangling Human Error from the Ground Truth in Segmentation of Medical Images

论文作者

Zhang, Le, Tanno, Ryutaro, Xu, Mou-Cheng, Jin, Chen, Jacob, Joseph, Ciccarelli, Olga, Barkhof, Frederik, Alexander, Daniel C.

论文摘要

近年来，对细分任务的监督学习方法的使用越来越多。但是，这些算法的预测性能取决于标签的质量。在医学图像域中，此问题尤其相关，在医学图像域中，注释成本和观察者间的可变性都很高。在典型的标签采集过程中，不同的人类专家在其自身偏见和能力水平的影响下估计了对“真实”细分标签的估计。盲目对待这些嘈杂的标签，因为地面真相限制了自动分割算法可以达到的性能。在这项工作中，我们提出了一种使用两个耦合的CNN，从纯粹的嘈杂观察，单个注释者的可靠性和真实的分割标签分布中共同学习的方法。两者的分离是通过鼓励估计的注释者在最大程度上不可靠而实现的，同时通过嘈杂的训练数据实现了高保真度。我们首先根据MNIST定义玩具分割数据集，并研究所提出的算法的特性。然后，我们证明了该方法在三个公共医学成像分割数据集上使用模拟（必要时）和实际多样化注释的实用性：1）MSLSC（多重骨化病变）； 2）小子（脑肿瘤）； 3）LIDC-IDRI（肺部异常）。在所有情况下，我们的方法都优于竞争方法和相关基准，尤其是在注释数量较小并且分歧量很大的情况下。实验还显示出捕获注释者错误的复杂空间特征的强大能力。

Recent years have seen increasing use of supervised learning methods for segmentation tasks. However, the predictive performance of these algorithms depends on the quality of labels. This problem is particularly pertinent in the medical image domain, where both the annotation cost and inter-observer variability are high. In a typical label acquisition process, different human experts provide their estimates of the "true" segmentation labels under the influence of their own biases and competence levels. Treating these noisy labels blindly as the ground truth limits the performance that automatic segmentation algorithms can achieve. In this work, we present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions, using two coupled CNNs. The separation of the two is achieved by encouraging the estimated annotators to be maximally unreliable while achieving high fidelity with the noisy training data. We first define a toy segmentation dataset based on MNIST and study the properties of the proposed algorithm. We then demonstrate the utility of the method on three public medical imaging segmentation datasets with simulated (when necessary) and real diverse annotations: 1) MSLSC (multiple-sclerosis lesions); 2) BraTS (brain tumours); 3) LIDC-IDRI (lung abnormalities). In all cases, our method outperforms competing methods and relevant baselines particularly in cases where the number of annotations is small and the amount of disagreement is large. The experiments also show strong ability to capture the complex spatial characteristics of annotators' mistakes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题