组概率加权的树总和可解释的异质数据建模

论文标题

组概率加权的树总和可解释的异质数据建模

Group Probability-Weighted Tree Sums for Interpretable Modeling of Heterogeneous Data

论文作者

Nasseri, Keyan, Singh, Chandan, Duncan, James, Kornblith, Aaron, Yu, Bin

论文摘要

高风险领域（例如医疗保健）中的机器学习面临两个关键挑战：（1）在培训数据有限的情况下，在（2）维持可解释性的情况下推广到不同的数据分布。为了应对这些挑战，我们提出了一种实例加权的树木和方法，该方法有效地汇集了各个组之间的数据，以输出简洁的基于规则的模型。给定数据集中不同的实例组（例如，按年龄或治疗地点分组），我们的方法首先估计每个实例的组成员资格概率。然后，它将这些估计值用作图2中的实例权重（Tan等，2022），以增加一组决策树，其值总计为最终预测。我们称此新方法组概率加权的树总和（G-Gigs）。 G-Figs在重要的临床数据集上实现了最先进的预测性能；例如，保持固定在92％的敏感性水平，G形增加了识别颈椎损伤的特异性高达10％，而单独的无花果比无花果高达3％，较高的灵敏度水平的增长较大。通过将图16的规则总数保持在16下，最终模型仍然可以解释，我们发现它们的规则与医疗领域的专业知识相匹配。所有代码，数据和模型均在GitHub上发布。

Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of instances in a dataset (e.g., medical patients grouped by age or treatment site), our method first estimates group membership probabilities for each instance. Then, it uses these estimates as instance weights in FIGS (Tan et al. 2022), to grow a set of decision trees whose values sum to the final prediction. We call this new method Group Probability-Weighted Tree Sums (G-FIGS). G-FIGS achieves state-of-the-art prediction performance on important clinical datasets; e.g., holding the level of sensitivity fixed at 92%, G-FIGS increases specificity for identifying cervical spine injury by up to 10% over CART and up to 3% over FIGS alone, with larger gains at higher sensitivity levels. By keeping the total number of rules below 16 in FIGS, the final models remain interpretable, and we find that their rules match medical domain expertise. All code, data, and models are released on Github.

下载PDF全文

下载文献需遵守相关版权规定

论文标题