论文标题

聚集的草图用于聚类

Polylogarithmic Sketches for Clustering

论文作者

Charikar, Moses, Waingarten, Erik

论文摘要

给定$ n $点的$ \ ell_p^d $中的$ n $点,我们认为将点分配给具有关联中心的$ k $簇的问题。聚类的成本是$ p^{\ text {th}} $距离群集中心的距离的总和。对于$ p \在[1,2] $中,我们设计了大小poly $(\ log(nd),k,1/ε)$的草图,使最佳聚类的成本估计在因子$ 1+ε$之内,尽管压缩表示不足以包含足够的信息来恢复群集中心或分区中的群集或群集中。这导致了用于用空格poly $(\ log(nd),k,1/ε)$估算聚类成本的流算法。我们还获得了分布式内存算法,其中$ n $点在$ m $机器中任意分区,每个计算机都会向中央方发送信息,然后将信息发送到集群成本的近似值。在这项工作之前,没有任何此类流或分布式内存算法以sublinear依赖于$ d $ in [1,2)$。

Given $n$ points in $\ell_p^d$, we consider the problem of partitioning points into $k$ clusters with associated centers. The cost of a clustering is the sum of $p^{\text{th}}$ powers of distances of points to their cluster centers. For $p \in [1,2]$, we design sketches of size poly$(\log(nd),k,1/ε)$ such that the cost of the optimal clustering can be estimated to within factor $1+ε$, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly$(\log(nd),k,1/ε)$. We also obtain a distributed memory algorithm, where the $n$ points are arbitrarily partitioned amongst $m$ machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on $d$ for $p \in [1,2)$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源