论文标题

Wasserstein $ k $ -Means用于聚类概率分布

Wasserstein $K$-means for clustering probability distributions

论文作者

Zhuang, Yubo, Chen, Xiaohui, Yang, Yun

论文摘要

聚类是基于它们的相似性对组对象的重要探索性数据分析技术。广泛使用的$ k $ -MEANS聚类方法依靠一些距离概念将数据分为较少的组。在欧几里得空间中,$ k $ -Means的基于质心和基于距离的配方相同。在现代机器学习应用中,数据通常是作为概率分布而出现的,并且可以使用最佳传输指标来处理测量值数据。由于瓦斯坦斯坦空间的非负亚历山德罗夫曲率,巴里中心遭受了规律性和非舒适性问题。 Wasserstein Barycenters的特殊行为可能使基于质心的配方无法代表集群内的数据点,而基于距离的$ K $ -MEANS方法及其半际计划(SDP)放松能够恢复真正的集群标签。在聚集高斯分布的特殊情况下,我们表明SDP放松的Wasserstein $ k $ - 金钱可以实现精确的恢复,因为这些集群按照$ 2 $ - WASSERSTEIN MERTRIC进行了良好的分离。我们的仿真和真实数据示例还表明,基于距离的$ K $ -Means可以比基于标准的基于Centroid的$ K $ -Means获得更好的分类性能,用于聚类概率分布和图像。

Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used $K$-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based $K$-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based $K$-means can achieve better classification performance over the standard centroid-based $K$-means for clustering probability distributions and images.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源