通过随机投影将小数据集聚类

论文标题

通过随机投影将小数据集聚类

Clustering small datasets in high-dimension by random projection

论文作者

Bradford, Alden, Yellamraju, Tarun, Boutin, Mireille

论文摘要

高维数据集通常不会在其原始空间中形成簇；当数据集中的点数很少时，问题会更糟。我们提出了一种低计算方法，以在小数据集中找到具有统计学意义的聚类结构。该方法通过将数据投影到随机线上并在产生的一维数据中寻求二进制聚类来进行。非线性分离是通过使用原始特征中较高程度的单一元素扩展特征空间来获得的。在投影的一维空间中测试了获得的聚类结构的统计有效性，从而绕过了高维统计验证的挑战。在随机线上投射是一种极端降低技术，以前已成功使用作为高维数据的分层聚类方法的一部分。我们的实验表明，通过这种简化的框架，根据数据集的不同，可以找到统计上显着的聚类结构，而统计学上的聚类结构只有100-200点。发现所发现的不同结构持续存在，因为将更多的点添加到数据集中。

Datasets in high-dimension do not typically form clusters in their original space; the issue is worse when the number of points in the dataset is small. We propose a low-computation method to find statistically significant clustering structures in a small dataset. The method proceeds by projecting the data on a random line and seeking binary clusterings in the resulting one-dimensional data. Non-linear separations are obtained by extending the feature space using monomials of higher degrees in the original features. The statistical validity of the clustering structures obtained is tested in the projected one-dimensional space, thus bypassing the challenge of statistical validation in high-dimension. Projecting on a random line is an extreme dimension reduction technique that has previously been used successfully as part of a hierarchical clustering method for high-dimensional data. Our experiments show that with this simplified framework, statistically significant clustering structures can be found with as few as 100-200 points, depending on the dataset. The different structures uncovered are found to persist as more points are added to the dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题