论文标题

太阳能:$ l_0 $解决方案路径平均为高维数据中的快速,准确变量选择

Solar: $L_0$ solution path averaging for fast and accurate variable selection in high-dimensional data

论文作者

Xu, Ning, Fisher, Timothy C. G.

论文摘要

我们提出了一种新的变量选择算法,亚样本订购的最小角回归(太阳能)及其坐标下降概括太阳能CD。太阳能使用$ L_0 $ NORM重建Lasso路径,并在子样本中平均为产生的解决方案路径。路径平均保留信息变量的排名信息,同时平均对高维度的敏感性,改善可变选择稳定性,效率和准确性。我们证明:(i)具有很高的概率,路径平均完美地将信息变量与平均$ L_0 $路径上的冗余变量分开; (ii)太阳变量选择是一致且准确的; (iii)太阳能省略弱信号的概率是有限样本量可控的。我们还证明:(i)太阳能产量,套件计算负载的$ 1/3 $,就稀疏性(64-84 \%降低冗余变量选择的降低)而对套索进行了实质性改善,并且可变选择的准确性; (ii)与套索安全/强规则和可变筛选相比,太阳能在很大程度上避免选择冗余变量和在存在复杂依赖性结构的情况下拒绝信息变量; (iii)太阳能保守的稀疏性和稳定性是数据分解假设测试的剩余自由度,提高了对弱信号的选择后推断的准确性,该信号有限$ n $; (iv)在自举选择(例如Bolasso或稳定性选择)中,用太阳能代替Solar会产生多层变量排名方案,该方案仅通过仅一个LASSO实现的计算负载来提高选择稀少度和排名的准确性; (v)给定计算资源,太阳能引导程序的选择速度(98 \%的计算时间)要比并行的Bootstrap套索的理论最大加速(由Amdahl定律确认)。

We propose a new variable selection algorithm, subsample-ordered least-angle regression (solar), and its coordinate descent generalization, solar-cd. Solar re-constructs lasso paths using the $L_0$ norm and averages the resulting solution paths across subsamples. Path averaging retains the ranking information of the informative variables while averaging out sensitivity to high dimensionality, improving variable selection stability, efficiency, and accuracy. We prove that: (i) with a high probability, path averaging perfectly separates informative variables from redundant variables on the average $L_0$ path; (ii) solar variable selection is consistent and accurate; and (iii) the probability that solar omits weak signals is controllable for finite sample size. We also demonstrate that: (i) solar yields, with less than $1/3$ of the lasso computation load, substantial improvements over lasso in terms of the sparsity (64-84\% reduction in redundant variable selection) and accuracy of variable selection; (ii) compared with the lasso safe/strong rule and variable screening, solar largely avoids selection of redundant variables and rejection of informative variables in the presence of complicated dependence structures; (iii) the sparsity and stability of solar conserves residual degrees of freedom for data-splitting hypothesis testing, improving the accuracy of post-selection inference on weak signals with limited $n$; (iv) replacing lasso with solar in bootstrap selection (e.g., bolasso or stability selection) produces a multi-layer variable ranking scheme that improves selection sparsity and ranking accuracy with the computation load of only one lasso realization; and (v) given the computation resources, solar bootstrap selection is substantially faster (98\% lower computation time) than the theoretical maximum speedup for parallelized bootstrap lasso (confirmed by Amdahl's law).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源