论文标题
SAPIENTML:通过从人写的解决方案中学习的合成机器学习管道
SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions
论文作者
论文摘要
自动机器学习或AutoML具有实质性自动化数据科学家的工作来真正使使用机器学习(ML)民主化的希望。但是,候选管道的巨大组合搜索空间意味着当前的汽车技术,产生亚最佳管道或根本没有,尤其是在大型,复杂的数据集上。在这项工作中,我们提出了一种sapientml,可以从现有数据集的语料库及其人工所写的管道中学习,并有效地生成高质量的管道,以在新数据集中进行预测任务。为了打击Automl的搜索空间爆炸,SapientML采用了一种新颖的分裂和诱使策略,将其视为一种三阶段的程序合成方法,这是因为连续较小的搜索空间而原因。第一阶段使用机器学习的模型来预测一组合理的ML组件以构成管道。在第二阶段,然后使用源自语料库和机器学习模型得出的句法约束将其完善成一小池可行的混凝土管道。在第三阶段,动态评估这几个管道提供了最佳解决方案。我们将SapientMl实例化为全自动工具链的一部分,该工具链通过挖掘Kaggle,从中学习并使用学习的模型来创建一个清洁,标记的学习语料库,然后使用学习的模型将管道合成新的预测任务。我们创建了一个跨越170个数据集的1094个管道的培训语料库,并在一组41个基准数据集上评估了SapientML,包括10个新的,大型的,来自Kaggle的大型现实世界数据集,以及针对3种最先进的AutoMl Automl工具和2个基线。我们的评估表明,SapientML在27个基准测试中产生最佳或可比的精度,而第二好的工具甚至无法在9个实例上产生管道。
Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.