学习的可重复性

论文标题

学习的可重复性

Reproducibility in Learning

论文作者

Impagliazzo, Russell, Lei, Rex, Pitassi, Toniann, Sorrell, Jessica

论文摘要

我们在学习背景下介绍了可再现算法的概念。可再现的学习算法对于其样品的变化具有弹性 - 具有很高的概率，当从同一基础分布的两个样品上运行时，它返回完全相同的输出。我们首先要解开定义，阐明随机性如何在平衡准确性和可重复性方面发挥作用。我们启动了可再现算法的理论，显示了可重复性如何意味着所需的特性，例如数据重用和有效的可检验性。尽管对可重复性的需求非常强，但对于统计和学习中的几个基本问题还是有效的可重复算法。首先，我们表明，任何统计查询算法都可以使样品复杂性适度增加，并使用它来构建可再现的算法，以查找近似重型热门和中位数。使用这些想法，我们通过可再现的弱学习者和可再现的增强算法给出了第一种可再现算法，用于学习半空间。最后，我们启动了可再现算法的下限和固有的权衡方面的研究，从而为可再现的SQ算法提供了几乎紧密的样本复杂性上限和下限。

We introduce the notion of a reproducible algorithm in the context of learning. A reproducible learning algorithm is resilient to variations in its samples -- with high probability, it returns the exact same output when run on two samples from the same underlying distribution. We begin by unpacking the definition, clarifying how randomness is instrumental in balancing accuracy and reproducibility. We initiate a theory of reproducible algorithms, showing how reproducibility implies desirable properties such as data reuse and efficient testability. Despite the exceedingly strong demand of reproducibility, there are efficient reproducible algorithms for several fundamental problems in statistics and learning. First, we show that any statistical query algorithm can be made reproducible with a modest increase in sample complexity, and we use this to construct reproducible algorithms for finding approximate heavy-hitters and medians. Using these ideas, we give the first reproducible algorithm for learning halfspaces via a reproducible weak learner and a reproducible boosting algorithm. Finally, we initiate the study of lower bounds and inherent tradeoffs for reproducible algorithms, giving nearly tight sample complexity upper and lower bounds for reproducible versus nonreproducible SQ algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题