对腕部骨折检测深度神经网络的批判性评估

论文标题

对腕部骨折检测深度神经网络的批判性评估

Critical Evaluation of Deep Neural Networks for Wrist Fracture Detection

论文作者

Raisuddin, Abu Mohammed, Vaattovaara, Elias, Nevalainen, Mika, Nikki, Marko, Järvenpää, Elina, Makkonen, Kaisa, Pinola, Pekka, Palsio, Tuula, Niemensivu, Arttu, Tervonen, Osmo, Tiulpin, Aleksei

论文摘要

腕部骨折是最常见的骨折类型，发病率较高。常规的X射线照相（即X射线成像）通常用于腕骨骨折检测，但偶尔需要通过计算机断层扫描（CT）进行额外的确认来进行诊断。人工智能子场（AI）的深度学习领域（DL）的最新进展表明，手腕骨折检测可以使用卷积神经网络自动化。但是，以前的研究并未密切注意只能通过CT成像确认的困难情况。在这项研究中，我们已经开发了并分析了基于DL的最新DL管道（远端半径）断裂检测 - DeepWrist，并根据一个普通人群测试集进行了评估，并且仅由CT确认的一个挑战性测试集进行了挑战。我们的结果表明，典型的最先进的方法，例如Deepwrist，同时在一般独立的测试集上具有几乎完美的性能，在具有挑战性的测试集上的性能较低 - 平均精度为0.99（0.99-0.99）和0.64（0.46-0.83）。同样，ROC曲线下的面积分别为0.99（0.98-0.99）和0.84（0.72-0.93）。我们的发现突出了对临床使用之前对基于DL的模型进行细致分析的重要性，并发现需要更具挑战性的设置来测试医疗AI系统。

Wrist Fracture is the most common type of fracture with a high incidence rate. Conventional radiography (i.e. X-ray imaging) is used for wrist fracture detection routinely, but occasionally fracture delineation poses issues and an additional confirmation by computed tomography (CT) is needed for diagnosis. Recent advances in the field of Deep Learning (DL), a subfield of Artificial Intelligence (AI), have shown that wrist fracture detection can be automated using Convolutional Neural Networks. However, previous studies did not pay close attention to the difficult cases which can only be confirmed via CT imaging. In this study, we have developed and analyzed a state-of-the-art DL-based pipeline for wrist (distal radius) fracture detection -- DeepWrist, and evaluated it against one general population test set, and one challenging test set comprising only cases requiring confirmation by CT. Our results reveal that a typical state-of-the-art approach, such as DeepWrist, while having a near-perfect performance on the general independent test set, has a substantially lower performance on the challenging test set -- average precision of 0.99 (0.99-0.99) vs 0.64 (0.46-0.83), respectively. Similarly, the area under the ROC curve was of 0.99 (0.98-0.99) vs 0.84 (0.72-0.93), respectively. Our findings highlight the importance of a meticulous analysis of DL-based models before clinical use, and unearth the need for more challenging settings for testing medical AI systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题