论文标题

选择功能的最佳方法?

The best way to select features?

论文作者

Man, Xin, Chan, Ernest

论文摘要

机器学习中的特征选择受特征选择算法的固有随机性(例如,在MDA期间随机排列)。对于这种随机性,所选特征的稳定性对于机器学习算法的人类解释性至关重要。我们提出了一种称为“不稳定性指数”的基于等级的稳定度量,以比较应用于随机森林的三种特征选择算法MDA,LIME和SHAP的稳定性。通常,通过平均选择算法的许多随机迭代来选择功能。尽管我们发现所选特征的可变性确实会随着迭代次数的增加而降低,但它不会归零,并且三种算法选择的特征不一定会收敛到同一集合。我们发现石灰和外形比MDA更稳定,而石灰至少与排名最高的功能一样稳定。因此,总体石灰最适合人类的解释性。但是,所有三种算法的选定特征集显着改善了样本中的各种预测指标,其预测性能并没有显着差异。在合成数据集,两个公共基准数据集以及主动投资策略的专有数据上进行了实验。

Feature selection in machine learning is subject to the intrinsic randomness of the feature selection algorithms (for example, random permutations during MDA). Stability of selected features with respect to such randomness is essential to the human interpretability of a machine learning algorithm. We proposes a rank based stability metric called instability index to compare the stabilities of three feature selection algorithms MDA, LIME, and SHAP as applied to random forests. Typically, features are selected by averaging many random iterations of a selection algorithm. Though we find that the variability of the selected features does decrease as the number of iterations increases, it does not go to zero, and the features selected by the three algorithms do not necessarily converge to the same set. We find LIME and SHAP to be more stable than MDA, and LIME is at least as stable as SHAP for the top ranked features. Hence overall LIME is best suited for human interpretability. However, the selected set of features from all three algorithms significantly improves various predictive metrics out of sample, and their predictive performances do not differ significantly. Experiments were conducted on synthetic datasets, two public benchmark datasets, and on proprietary data from an active investment strategy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源