可解释的多类医学数据分类

论文标题

可解释的多类医学数据分类

Explainable Multi-class Classification of Medical Data

论文作者

Hu, YuanZheng, Sokolova, Marina

论文摘要

机器学习应用程序将新的见解带入了对医学数据的次要分析。机器学习有助于开发新药，定义易受某些疾病的人群，确定许多常见疾病的预测因子。同时，机器学习结果取决于许多因素的卷积，包括特征选择，类（IM）平衡，算法偏好和性能指标。在本文中，我们介绍了大型医学数据集的可解释的多类分类。我们详细讨论了基于知识的功能工程，数据集平衡，最佳模型选择和参数调整。本研究中使用了六种算法：支持向量机（SVM），幼稚的贝叶斯，梯度提升，决策树，随机森林和逻辑回归。我们的经验评估是对1999 - 2008年数据集的UCI糖尿病130-US医院进行的，其任务是对患者医院的重新入院进行分类的任务，分为三类：0天，<30天或> 30天。我们的结果表明，在学习实验中使用23种药物特征改善了六种应用学习算法中五种的回忆。这是一个新的结果，可以扩展先前对相同数据进行的研究。就三类分类精度而言，梯度增强和随机森林的表现优于其他算法。

Machine Learning applications have brought new insights into a secondary analysis of medical data. Machine Learning helps to develop new drugs, define populations susceptible to certain illnesses, identify predictors of many common diseases. At the same time, Machine Learning results depend on convolution of many factors, including feature selection, class (im)balance, algorithm preference, and performance metrics. In this paper, we present explainable multi-class classification of a large medical data set. We in details discuss knowledge-based feature engineering, data set balancing, best model selection, and parameter tuning. Six algorithms are used in this study: Support Vector Machine (SVM), Naïve Bayes, Gradient Boosting, Decision Trees, Random Forest, and Logistic Regression. Our empirical evaluation is done on the UCI Diabetes 130-US hospitals for years 1999-2008 dataset, with the task to classify patient hospital re-admission stay into three classes: 0 days, <30 days, or > 30 days. Our results show that using 23 medication features in learning experiments improves Recall of five out of the six applied learning algorithms. This is a new result that expands the previous studies conducted on the same data. Gradient Boosting and Random Forest outperformed other algorithms in terms of the three-class classification Accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题