论文标题
部分可观测时空混沌系统的无模型预测
A Conditional Randomization Test for Sparse Logistic Regression in High-Dimension
论文作者
论文摘要
确定具有正确置信度的分类模型的相关变量是高维度的核心但艰巨的任务。尽管稀疏逻辑回归在统计和机器学习中的核心作用,但它仍然缺乏一个很好的解决方案来准确推断功能$ p $的数量大于或大于样本数量$ n $。在这里,我们通过改善条件随机测试(CRT)来解决此问题。原始的CRT算法显示出有望作为输出p值的一种方式,同时对测试统计的分布做出了很少的假设。由于即使在温和的高维问题中,也提出了基于蒸馏的更快的解决方案,因此也提出了更快的溶液。然而,它们依靠不切实际的假设并导致低功率解决方案。为了改善这一点,我们提出了\ emph {crt-logit},这是一种结合了可变限制步骤和去相关步骤的算法,该步骤考虑了$ \ ell_1 $ penalizatization的逻辑回归问题的几何形状。我们提供了对该过程的理论分析,并证明了其对模拟的有效性,以及大规模大脑成像和基因组学数据集的实验。
Identifying the relevant variables for a classification model with correct confidence levels is a central but difficult task in high-dimension. Despite the core role of sparse logistic regression in statistics and machine learning, it still lacks a good solution for accurate inference in the regime where the number of features $p$ is as large as or larger than the number of samples $n$. Here, we tackle this problem by improving the Conditional Randomization Test (CRT). The original CRT algorithm shows promise as a way to output p-values while making few assumptions on the distribution of the test statistics. As it comes with a prohibitive computational cost even in mildly high-dimensional problems, faster solutions based on distillation have been proposed. Yet, they rely on unrealistic hypotheses and result in low-power solutions. To improve this, we propose \emph{CRT-logit}, an algorithm that combines a variable-distillation step and a decorrelation step that takes into account the geometry of $\ell_1$-penalized logistic regression problem. We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on large-scale brain-imaging and genomics datasets.