星星：用于聚类和图形学习的TERA级图形建筑

论文标题

星星：用于聚类和图形学习的TERA级图形建筑

Stars: Tera-Scale Graph Building for Clustering and Graph Learning

论文作者

Carey, CJ, Halcrow, Jonathan, Jayaram, Rajesh, Mirrokni, Vahab, Schudy, Warren, Zhong, Peilin

论文摘要

大规模数据集分析中的基本程序是相似图的构造。此类图对于许多下游任务起着关键作用，包括聚类，分类，图形学习和最近的邻居搜索。对于这些任务，构建稀疏但仍然代表基础数据的图表至关重要。稀疏性的好处是双重的：首先，对于大型数据集，构造密集的图是不可行的，其次，下游任务的运行时直接受相似图的稀疏性的影响。在这项工作中，我们提出了$ \ textit {stars} $：一种高度可扩展的方法，用于通过两跳跨者构建极稀疏的图形，这是图形，其中相似点最多通过长度路径连接了相似的点。恒星可以构建具有相似性比较的两跳跨度的跨越，这对于基于学习的模型来说是一个主要的瓶颈，在该模型中比较昂贵。从理论上讲，我们证明了恒星在几乎线性的时间内构建图形，其中大约最近的邻居包含在两个跳跃社区中。实际上，我们已经为多个数据集部署了星星，允许在$ \ textit {tera-cale} $（即具有数十万亿个边缘的图形）上进行图形构建。我们评估了恒星用于聚类和图形学习的性能，与不同的基线相比，成对相似性比较的10〜1000倍改善，并且运行时间的提高2〜10倍而没有质量损失。

A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.

下载PDF全文

下载文献需遵守相关版权规定

论文标题