论文标题
Flexbart:具有分类预测变量的灵活贝叶斯回归树
flexBART: Flexible Bayesian regression trees with categorical predictors
论文作者
论文摘要
贝叶斯添加剂回归树(BART)的大多数实现一hot编码分类预测变量,用几个二进制指标代替每个级别,一个用于每个级别或类别。用这些指标构建的回归树通过一次反复删除一个级别来分区分类级别的离散集。不幸的是,绝大多数分区无法通过这种策略来构建,严重限制了巴特在跨级别群体中部分汇总数据的能力。通过对棒球数据和邻里级别犯罪动态的分析的激励,我们通过重新实现BART的回归树来克服了这一限制,这些回归树可以将多个级别分配给决策树节点的两个分支。为了对汇总到小区域的空间数据进行建模,我们进一步提出了一个新的决策规则,该规则通过从适当定义的网络的随机跨越树中删除随机跨越树来创建空间连续区域。我们的重新实施在Flexbart软件包中可用,通常可以提高样本外的预测性能,并且比现有的BART实现更好地对更大的数据集则可以提高样本外的预测性能。
Most implementations of Bayesian additive regression trees (BART) one-hot encode categorical predictors, replacing each one with several binary indicators, one for every level or category. Regression trees built with these indicators partition the discrete set of categorical levels by repeatedly removing one level at a time. Unfortunately, the vast majority of partitions cannot be built with this strategy, severely limiting BART's ability to partially pool data across groups of levels. Motivated by analyses of baseball data and neighborhood-level crime dynamics, we overcame this limitation by re-implementing BART with regression trees that can assign multiple levels to both branches of a decision tree node. To model spatial data aggregated into small regions, we further proposed a new decision rule prior that creates spatially contiguous regions by deleting a random edge from a random spanning tree of a suitably defined network. Our re-implementation, which is available in the flexBART package, often yields improved out-of-sample predictive performance and scales better to larger datasets than existing implementations of BART.