论文标题

使用汽车缺少数据填充

Missing Data Infill with Automunge

论文作者

Teague, Nicholas J.

论文摘要

缺少数据是数据科学实践中的基本障碍。本文调查了一些在Automunge开源Python库平台中可用的公约,以进行表格数据预处理,包括“ ML填充”,其中自动ML模型是从培训集的分区提取物中培训了目标功能的。进行了一系列验证实验,以基准基准的插定场景,以实现下游模型性能,在这些方案中,在许多情况下,在许多情况下,ML填充效果均优于数字和分类目标特征,并且在其他弹出场景的噪声分布中否则最小值。证据还表明,用布尔整数标记的添加支持柱来补充ML填充,信号的填充物的存在通常对下游模型性能有益。我们认为这些结果足以建议您默认用于ML填充表格学习,并进一步建议通过支持柱信号的填充物具有信号填充物,每种都可以通过在Automunge库中的按钮操作来准备。我们的贡献包括在Python生态系统中提供的自动ML丢失的数据插图库,以完全集成到具有广泛特征转换库的预处理平台中,并具有新颖的生产友好实现,该平台将基于指定的火车插入模型基于指定的火车设置,以固定依靠额外的数据。

Missing data is a fundamental obstacle in the practice of data science. This paper surveys a few conventions for imputation as available in the Automunge open source python library platform for tabular data preprocessing, including "ML infill" in which auto ML models are trained for target features from partitioned extracts of a training set. A series of validation experiments were performed to benchmark imputation scenarios towards downstream model performance, in which it was found for the given benchmark sets that in many cases ML infill outperformed for both numeric and categoric target features, and was otherwise at minimum within noise distributions of the other imputation scenarios. Evidence also suggested supplementing ML infill with the addition of support columns with boolean integer markers signaling presence of infill was usually beneficial to downstream model performance. We consider these results sufficient to recommend defaulting to ML infill for tabular learning, and further recommend supplementing imputations with support columns signaling presence of infill, each as can be prepared with push-button operation in the Automunge library. Our contributions include an auto ML derived missing data imputation library for tabular learning in the python ecosystem, fully integrated into a preprocessing platform with an extensive library of feature transformations, with a novel production friendly implementation that bases imputation models on a designated train set for consistent basis towards additional data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源