论文标题
一种基于MIP的可扩展方法,用于学习最佳的多元决策树
A Scalable MIP-based Method for Learning Optimal Multivariate Decision Trees
论文作者
论文摘要
最近的几本出版物报告说,由于整数编程的算法进步,并且对解决诸如购物车等启发式方法的固有次要次数替代的兴趣越来越多,使用混合企业计划(MIP)在培训最佳决策树(ODT)方面取得了进步。在本文中,我们提出了一种基于1个基准支持向量机模型的新型MIP公式,以训练多元ODT解决分类问题。我们提供切割平面技术,以拧紧MIP公式的线性松弛,以提高运行时间以达到最佳性。使用加利福尼亚大学尔湾分校的机器学习存储库中的36个数据集,我们证明我们的配方在文献中的表现平均超过了10%的数据集,而在整个数据集的平均样本外测试准确性方面。我们提供了一个可扩展的框架,通过引入基于新型的线性编程(LP)数据选择方法来选择培训数据子集,以训练大型数据集的多元ODT。我们的方法可以常规处理具有超过7,000个样本点的大型数据集,并且表现优于启发式方法和其他基于MIP的技术。我们介绍了最多包含245,000个样本的数据集的结果。现有的基于MIP的方法在超过5500个样本以外的培训数据集上缩放不佳。
Several recent publications report advances in training optimal decision trees (ODT) using mixed-integer programs (MIP), due to algorithmic advances in integer programming and a growing interest in addressing the inherent suboptimality of heuristic approaches such as CART. In this paper, we propose a novel MIP formulation, based on a 1-norm support vector machine model, to train a multivariate ODT for classification problems. We provide cutting plane techniques that tighten the linear relaxation of the MIP formulation, in order to improve run times to reach optimality. Using 36 data-sets from the University of California Irvine Machine Learning Repository, we demonstrate that our formulation outperforms its counterparts in the literature by an average of about 10% in terms of mean out-of-sample testing accuracy across the data-sets. We provide a scalable framework to train multivariate ODT on large data-sets by introducing a novel linear programming (LP) based data selection method to choose a subset of the data for training. Our method is able to routinely handle large data-sets with more than 7,000 sample points and outperform heuristics methods and other MIP based techniques. We present results on data-sets containing up to 245,000 samples. Existing MIP-based methods do not scale well on training data-sets beyond 5,500 samples.