层次的集聚图聚集在多层深度

论文标题

层次的集聚图聚集在多层深度

Hierarchical Agglomerative Graph Clustering in Poly-Logarithmic Depth

论文作者

Dhulipala, Laxman, Eisenstat, David, Łącki, Jakub, Mirronki, Vahab, Shi, Jessica

论文摘要

由于实际数据集的庞大尺寸，获得分层聚集聚类（HAC）的可扩展算法引起了重大关注。同时，由于该算法的顺序性质，很难有效地平行HAC。在本文中，我们解决了这个问题并介绍Parhac，这是针对广泛使用的平均链接函数的第一个有效的平行HAC算法。特别是，我们使用$ \ tilde {o}（m）$ work和poly-logarithmic深度提供了$（1+ε）$ - 近似算法。此外，我们表明，在标准复杂性理论假设下，无法获得精确平均链接HAC的相似边界。我们通过对Parhac算法的可扩展性，性能和质量进行了全面研究，并与几个最新的顺序和平行基线相比，我们对理论结果进行了补充。在一组广泛的公共现实世界数据集中，我们发现Parhac平均获得了50.1倍的速度，而不是最佳的顺序基线，同时获得了类似于确切的HAC算法的质量。我们还表明，Parhac可以使用商品多层机器在三个多小时内将最大的公开图形数据集群集群群群。

Obtaining scalable algorithms for hierarchical agglomerative clustering (HAC) is of significant interest due to the massive size of real-world datasets. At the same time, efficiently parallelizing HAC is difficult due to the seemingly sequential nature of the algorithm. In this paper, we address this issue and present ParHAC, the first efficient parallel HAC algorithm with sublinear depth for the widely-used average-linkage function. In particular, we provide a $(1+ε)$-approximation algorithm for this problem on $m$ edge graphs using $\tilde{O}(m)$ work and poly-logarithmic depth. Moreover, we show that obtaining similar bounds for exact average-linkage HAC is not possible under standard complexity-theoretic assumptions. We complement our theoretical results with a comprehensive study of the ParHAC algorithm in terms of its scalability, performance, and quality, and compare with several state-of-the-art sequential and parallel baselines. On a broad set of large publicly-available real-world datasets, we find that ParHAC obtains a 50.1x speedup on average over the best sequential baseline, while achieving quality similar to the exact HAC algorithm. We also show that ParHAC can cluster one of the largest publicly available graph datasets with 124 billion edges in a little over three hours using a commodity multicore machine.

下载PDF全文

下载文献需遵守相关版权规定

论文标题