论文标题
社会环境数据中的可变选择:稀疏回归和树合奏机器学习方法
Variable selection in social-environmental data: Sparse regression and tree ensemble machine learning approaches
论文作者
论文摘要
目的:从美国人口普查获得的社会环境数据是理解健康差异的重要资源,但很少是用于分析的完整数据集。合并完整数据的障碍是缺乏可变选择的坚实建议,研究人员经常手工选择一些变量。因此,我们评估了经验机器学习方法确定与健康结果真正相关的社会环境因素的能力。 材料和方法:我们比较了几种流行的机器学习方法,包括惩罚的回归(例如Lasso,弹性网)和Tree Ensemble方法。通过仿真,我们评估了方法识别与二进制和连续结果相关的人口普查变量的能力,同时最大程度地减少了假阳性结果(10个真正的关联,1,000个总变量)。我们将最有前途的方法应用于与前列腺癌注册表数据(n = 76,186例)相关的完整人口普查数据(P = 14,663个变量),以识别与晚期前列腺癌相关的社会环境因素。 结果:在模拟中,我们发现弹性网鉴定了许多真实阳性变量,而套索则可以很好地控制误报。使用精确度的组合度量,基于Spearman与稀疏组套索回归的相关性的分层聚类表现出了最佳的整体。贝叶斯自适应回归树的表现优于其他树的合奏方法,而不是稀疏的组套索。在完整的数据集中,稀疏组套索成功识别了一个变量的子集,其中三个复制了早期的发现。 讨论:该分析证明了经验机学习方法的潜力,以识别与结果真正关联的一小部分人口普查变量,并在经验方法中复制。
Objective: Social-environmental data obtained from the U.S. Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. Materials and Methods: We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods' ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1,000 total variables). We applied the most promising method to the full census data (p=14,663 variables) linked to prostate cancer registry data (n=76,186 cases) to identify social-environmental factors associated with advanced prostate cancer. Results: In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman's correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings. Discussion: This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods.