论文标题
拟合随机森林的近似方法
An Approximation Method for Fitted Random Forests
论文作者
论文摘要
随机森林(RF)是一种流行的机器学习方法,用于分类和回归问题。它涉及对决策树模型的包装应用。随机森林模型的主要优点之一是预测的方差降低。在具有数百万个数据点和数百个功能的模型的大规模应用中,根据树木的数量和深度,拟合对象的大小可能会变得很大并达到生产设置可用空间的限制。当需要按需下载训练有素的型号到具有有限内存的小型设备时,这可能尤其具有挑战性。有必要近似训练的RF模型,以显着降低模型大小而不会失去过多的预测准确性。在这个项目中,我们研究了使用数据点的多项式分配到叶子上的多项式分配,该方法近似于随机森林模型中的每个拟合树。具体而言,我们首先研究拟合多项式逻辑回归(随后将广义添加剂模型(GAM)扩展)与每棵树的输出拟合有助于减小尺寸,同时保留预测质量。
Random Forests (RF) is a popular machine learning method for classification and regression problems. It involves a bagging application to decision tree models. One of the primary advantages of the Random Forests model is the reduction in the variance of the forecast. In large scale applications of the model with millions of data points and hundreds of features, the size of the fitted objects can get very large and reach the limits on the available space in production setups, depending on the number and depth of the trees. This could be especially challenging when trained models need to be downloaded on-demand to small devices with limited memory. There is a need to approximate the trained RF models to significantly reduce the model size without losing too much of prediction accuracy. In this project we study methods that approximate each fitted tree in the Random Forests model using the multinomial allocation of the data points to the leafs. Specifically, we begin by studying whether fitting a multinomial logistic regression (and subsequently, a generalized additive model (GAM) extension) to the output of each tree helps reduce the size while preserving the prediction quality.