论文标题
可证明的训练集调试线性回归
Provable Training Set Debugging for Linear Regression
论文作者
论文摘要
我们调查了受机器学习调试中的应用程序的启发,在罚款$ m $估计中的问题。数据是从两个池收集的,一个包含带有可能受污染标签的数据,另一个数据仅包含仅包含干净标记点的数据。我们首先制定了一种一般的统计算法,用于识别错误点,并在数据遵循线性模型的假设下提供严格的理论保证。然后,我们提出了两个案例研究,以说明我们的一般理论的结果以及估计器对清洁与货物点的依赖性。我们进一步提出了一种用于调整基于LASSO的算法参数选择的算法,并提供相应的理论保证。最后,我们考虑在错误生成器和调试器之间进行的两人“游戏”,调试器可以在原始数据库中使用精美的点的分数增强受污染的数据集。我们建立了一个理论结果,显示了一个足够的条件,在该条件下,错误生成器总是可以欺骗调试器。但是,我们提供的经验结果表明,这种情况可能不会在实践中发生,这使得自然增强策略与我们的Lasso调试算法相结合是成功的。
We investigate problems in penalized $M$-estimation, inspired by applications in machine learning debugging. Data are collected from two pools, one containing data with possibly contaminated labels, and the other which is known to contain only cleanly labeled points. We first formulate a general statistical algorithm for identifying buggy points and provide rigorous theoretical guarantees under the assumption that the data follow a linear model. We then present two case studies to illustrate the results of our general theory and the dependence of our estimator on clean versus buggy points. We further propose an algorithm for tuning parameter selection of our Lasso-based algorithm and provide corresponding theoretical guarantees. Finally, we consider a two-person "game" played between a bug generator and a debugger, where the debugger can augment the contaminated data set with cleanly labeled versions of points in the original data pool. We establish a theoretical result showing a sufficient condition under which the bug generator can always fool the debugger. Nonetheless, we provide empirical results showing that such a situation may not occur in practice, making it possible for natural augmentation strategies combined with our Lasso debugging algorithm to succeed.