论文标题

大数据的更快的投射聚类近似

Faster Projective Clustering Approximation of Big Data

论文作者

Statman, Adiel, Rozenberg, Liat, Feldman, Dan

论文摘要

在投影聚类中,我们在$ r^d $中给出了一组n个点,并希望根据给定的距离功能将它们聚集到$ k $线性子空间中的$ k $线性子空间。对于此问题的$ \ eps $ -coreset是输入点的加权(缩放)子集,因此对于每一个可能的$ s $,这些距离的总和近似于$(1+ \ eps)$。我们建议与现有的$ \ \ exp(m)$解决方案相比,建议$ m $ $ o(ndm)$时间的第一个$ o(\ log(m))$近似值来减少现有核心的大小。然后,我们在这些线上投射要点,并证明对于足够大的$ m $,我们获得了用于投影聚类的核心。我们的算法也概括以处理异常值。还提供了实验结果和开放代码。

In projective clustering we are given a set of n points in $R^d$ and wish to cluster them to a set $S$ of $k$ linear subspaces in $R^d$ according to some given distance function. An $\eps$-coreset for this problem is a weighted (scaled) subset of the input points such that for every such possible $S$ the sum of these distances is approximated up to a factor of $(1+\eps)$. We suggest to reduce the size of existing coresets by suggesting the first $O(\log(m))$ approximation for the case of $m$ lines clustering in $O(ndm)$ time, compared to the existing $\exp(m)$ solution. We then project the points on these lines and prove that for a sufficiently large $m$ we obtain a coreset for projective clustering. Our algorithm also generalize to handle outliers. Experimental results and open code are also provided.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源