ARDA：机器学习的自动关系数据增强

论文标题

ARDA：机器学习的自动关系数据增强

ARDA: Automatic Relational Data Augmentation for Machine Learning

论文作者

Chepurko, Nadiia, Marcus, Ryan, Zgraggen, Emanuel, Fernandez, Raul Castro, Kraska, Tim, Karger, David

论文摘要

自动机器学习（\ aml）是一种自动化培训预测模型过程的技术家族，旨在提高性能并使机器学习更容易访问。尽管许多最近的作品都集中在机器学习管道的各个方面，例如模型选择，超参数调整和功能选择，但相对较少的作品集中在自动数据增强上。自动数据增强涉及查找与用户的预测任务相关的新功能，而``人类''环境''参与。我们提出\ System是一种端到端系统，它作为输入数据集和数据存储库，并输出一个增强数据集，以便在此增强数据集上训练预测模型，从而提高了性能。我们的系统具有两个不同的组件：（1）基于输入的各种属性，可以与输入数据一起搜索和连接数据的框架，以及（2）一种有效的特征选择算法，该算法从生成的连接中降低了嘈杂或无关的功能。我们对不同系统组件进行广泛的经验评估，并在现实世界数据集上基准我们的功能选择算法。

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题