模拟IV的广泛数据处理管道

论文标题

模拟IV的广泛数据处理管道

An Extensive Data Processing Pipeline for MIMIC-IV

论文作者

Gupta, Mehak, Gallamoza, Brennan, Cutrona, Nicolas, Dhakal, Pranjal, Poulain, Raphael, Beheshti, Rahmatollah

论文摘要

越来越多的研究致力于将机器学习方法应用于电子健康记录（EHR）数据，以实现各种临床目的。这个不断增长的研究领域揭示了EHR的可及性的挑战。模仿是一种流行，公共和免费的EHR数据集，其原始格式已在许多研究中使用。但是，缺乏标准化的预处理步骤可能是更广泛采用这种稀有资源的重大障碍。此外，这种缺失可以降低开发工具的可重复性，并限制在类似研究中比较结果的能力。在这项工作中，我们提供了一条非常可定制的管道，以提取，清洁和预处理模拟数据集（MIMIC-IV）中可用的数据。该管道还提出了一个端到端的类似向导的软件包，支持预测模型创建和评估。该管道涵盖了一系列临床预测任务，这些任务可以大致分为四类 - 再入院，住院时间，死亡率和表型预测。该工具可在https://github.com/healthylaife/mimic-imic-iv-data-pipeline上公开获得。

An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical purposes. This growing area of research has exposed the challenges of the accessibility of EHRs. MIMIC is a popular, public, and free EHR dataset in a raw format that has been used in numerous studies. The absence of standardized pre-processing steps can be, however, a significant barrier to the wider adoption of this rare resource. Additionally, this absence can reduce the reproducibility of the developed tools and limit the ability to compare the results among similar studies. In this work, we provide a greatly customizable pipeline to extract, clean, and pre-process the data available in the fourth version of the MIMIC dataset (MIMIC-IV). The pipeline also presents an end-to-end wizard-like package supporting predictive model creations and evaluations. The pipeline covers a range of clinical prediction tasks which can be broadly classified into four categories - readmission, length of stay, mortality, and phenotype prediction. The tool is publicly available at https://github.com/healthylaife/MIMIC-IV-Data-Pipeline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题