论文标题
可证明在低维度中审核普通最小二乘
Provably Auditing Ordinary Least Squares in Low Dimensions
论文作者
论文摘要
测量从普通最小二乘线性回归得出的结论的稳定性至关重要,但是大多数指标要么仅衡量局部稳定性(即针对数据中无限变化),要么仅在统计假设下可以解释。最近的工作提出了一个简单的,全局的,有限的样本稳定性度量:需要删除的最小样本数量,以使分析重新推翻结论,特别意味着估计回归器变化的特定系数的符号。但是,除了微不足道的指数时间算法外,计算该指标的唯一方法是一种贪婪的启发式,在合理,可验证的假设下缺乏可证明的保证。该启发式为稳定性提供了松散的上限,也无法证明其下限。 我们表明,在低维状态下,协变量数量是恒定的,但样品数量很大,有有效的算法可证明该度量标准的估计(分数版本)。将我们的算法应用于波士顿住房数据集,我们展示了回归分析,在这些分析中,我们可以估计稳定性高达$ 3 $ $ 3 $的倍,并分析我们可以在其中证明稳定性以降低大多数样品。
Measuring the stability of conclusions derived from Ordinary Least Squares linear regression is critically important, but most metrics either only measure local stability (i.e. against infinitesimal changes in the data), or are only interpretable under statistical assumptions. Recent work proposes a simple, global, finite-sample stability metric: the minimum number of samples that need to be removed so that rerunning the analysis overturns the conclusion, specifically meaning that the sign of a particular coefficient of the estimated regressor changes. However, besides the trivial exponential-time algorithm, the only approach for computing this metric is a greedy heuristic that lacks provable guarantees under reasonable, verifiable assumptions; the heuristic provides a loose upper bound on the stability and also cannot certify lower bounds on it. We show that in the low-dimensional regime where the number of covariates is a constant but the number of samples is large, there are efficient algorithms for provably estimating (a fractional version of) this metric. Applying our algorithms to the Boston Housing dataset, we exhibit regression analyses where we can estimate the stability up to a factor of $3$ better than the greedy heuristic, and analyses where we can certify stability to dropping even a majority of the samples.