论文标题
使用Framingham心脏研究数据集比较缺失的数据插补方法
Comparison of Missing Data Imputation Methods using the Framingham Heart study dataset
论文作者
论文摘要
心血管疾病(CVD)是涉及心脏或血管的一类疾病,根据世界卫生组织的说法,全世界是死亡的主要原因。有关此情况的EHR数据以及通常的医疗案例,通常会经常包含缺失值。丢失的百分比可能会有所不同,并且与仪器错误,手动数据输入程序等相关。尽管缺失率通常很明显,但在许多情况下,缺少价值插补的部分可以通过案例删除或使用简单的统计方法(例如模式和中位数)来处理。已知这些方法引入了重大偏见,因为它们不考虑数据集变量之间的关系。在医疗框架内,许多数据集由实验室测试或患者医疗测试组成,这些数据存在存在且牢固。为了解决这些限制,在本文中,我们根据生成对抗网络(GAN)和自动编码器来测试和修改最新的缺少价值插补方法。对数据推出和输入后预测的任务都完成了评估。关于插补任务,我们分别在归一化的均方根误差(RMSE)和接收器操作特征曲线(AUROC)下的归一化均方根误差(RMSE)中提高了0.20%。就输入后预测任务而言,我们的模型在F1得分中优于标准方法2.50%。
Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels and according to World Health Organization is the leading cause of death worldwide. EHR data regarding this case, as well as medical cases in general, contain missing values very frequently. The percentage of missingness may vary and is linked with instrument errors, manual data entry procedures, etc. Even though the missing rate is usually significant, in many cases the missing value imputation part is handled poorly either with case-deletion or with simple statistical approaches such as mode and median imputation. These methods are known to introduce significant bias, since they do not account for the relationships between the dataset's variables. Within the medical framework, many datasets consist of lab tests or patient medical tests, where these relationships are present and strong. To address these limitations, in this paper we test and modify state-of-the-art missing value imputation methods based on Generative Adversarial Networks (GANs) and Autoencoders. The evaluation is accomplished for both the tasks of data imputation and post-imputation prediction. Regarding the imputation task, we achieve improvements of 0.20, 7.00% in normalised Root Mean Squared Error (RMSE) and Area Under the Receiver Operating Characteristic Curve (AUROC) respectively. In terms of the post-imputation prediction task, our models outperform the standard approaches by 2.50% in F1-score.