论文标题

相距不超过6英尺:强劲的K均值通过半径上限

No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds

论文作者

Humayun, Ahmed Imtiaz, Balestriero, Randall, Kyrillidis, Anastasios, Baraniuk, Richard

论文摘要

基于Centroid的聚类方法,例如K-均值,K-Medoids和K-Centers在探索性数据分析中被大量应用作为首选工具。在许多情况下,这些方法用于获得数据歧管的代表性质心,以可视化或摘要数据集。现实世界的数据集通常包含固有的异常,例如重复的样本和采样偏见,表现出不平衡的聚类。我们建议通过在质心形成的群集上引入最大半径约束$ r $来纠正这种情况,即,从同一集群中的样本则不应以$ \ ell_2 $距离的价格分开超过$ 2R $。我们通过求解半明确程序来实现此约束,然后是二次约束的线性分配问题。通过定性结果,我们表明我们提出的方法对数据集的不平衡和采样伪像是可靠的。据我们所知,我们的第一个受约束的K-均值聚类方法具有硬半径约束。在https://bit.ly/kmeans限制的代码

Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis. In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset. Real world datasets often contain inherent abnormalities, e.g., repeated samples and sampling bias, that manifest imbalanced clustering. We propose to remedy such a scenario by introducing a maximal radius constraint $r$ on the clusters formed by the centroids, i.e., samples from the same cluster should not be more than $2r$ apart in terms of $\ell_2$ distance. We achieve this constraint by solving a semi-definite program, followed by a linear assignment problem with quadratic constraints. Through qualitative results, we show that our proposed method is robust towards dataset imbalances and sampling artifacts. To the best of our knowledge, ours is the first constrained k-means clustering method with hard radius constraints. Codes at https://bit.ly/kmeans-constrained

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源