草皮：两因素，通用，健壮，快速分发学习算法

论文标题

草皮：两因素，通用，健壮，快速分发学习算法

TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm

论文作者

Hao, Yi, Jain, Ayush, Orlitsky, Alon, Ravindrakumar, Vaishakh

论文摘要

来自样品的近似分布是一个规范的统计学习问题。其最强大和最成功的方式之一将每个分布近似于$ \ ell_1 $距离，从本质上讲，最多比其最接近的$ t $ piece-piece-$ d $ d $ polyenmial，其中$ t \ ge1 $和$ d \ ge0 $。让$ c_ {t，d} $表示最小的因素，显然$ c_ {1,0} = 1 $，可以证明所有其他$ t $和$ t $和$ d $ $ c_ {t，d} \ ge 2 $。然而，当前的计算有效算法仅显示$ c_ {t，1} \ le 2.25 $，并且限制迅速上升至$ c_ {t，d} \ le 3 $ 3 $ for $ d \ ge 9 $。我们得出了一个接近线性的时间且本质上是最佳估计器，该估计器为所有$（t，d）\ ne（1,0）$建立$ c_ {t，d} = 2 $。此外，对于许多实际分布，最低的近似距离是通过零件数量大大变化的多项式来实现的。我们提供了一种几乎很理想地估算此数字的方法，因此有助于实现最佳近似值。结合两种技术的实验证实了与现有方法相比的性能改善。

Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it can be shown that $c_{t,d}\ge 2$ for all other $t$ and $d$. Yet current computationally efficient algorithms show only $c_{t,1}\le 2.25$ and the bound rises quickly to $c_{t,d}\le 3$ for $d\ge 9$. We derive a near-linear-time and essentially sample-optimal estimator that establishes $c_{t,d}=2$ for all $(t,d)\ne(1,0)$. Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题