论文标题

从树到连续的嵌入到返回:双曲线分层聚类

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

论文作者

Chami, Ines, Gu, Albert, Chatziafratis, Vaggos, Ré, Christopher

论文摘要

基于相似性的层次聚类(HC)是一种经典的无监督的机器学习算法,传统上用启发式算法(如平均链接)解决了。最近,Dasgupta通过引入测量给定树的质量的全球成本函数将HC重新构架为离散优化问题。在这项工作中,我们提供了Dasgupta的离散优化问题的首次不断放松,并提供了可证明的质量保证。我们方法的关键思想是hyphc,显示了从离散树到连续表示的直接对应关系(通过其叶子节点的双曲线嵌入)和返回(通过将叶片嵌入叶片嵌入到树突图中的解码算法),使我们能够搜索具有连续优化的离散二进制树的空间。在树木和双曲线空间之间的类比的基础上,我们得出了最低祖先概念的连续类似物,这导致了Dasgupta离散目标的连续放松。我们可以证明,在解码后,我们连续放松的全局最小化器会产生一棵离散的树,该树具有(1 + epsilon)的(1 + epsilon) - 因子近似Dasgupta的最佳树,可以任意地将Epsilon缩小并控制优化挑战。我们在各种HC基准上对HYPHC进行了实验性评估,发现即使发现梯度下降的近似溶液也比聚集启发术或其他基于梯度的算法具有优越的聚类质量。最后,我们在下游分类任务中使用端到端培训强调了HYPHC的灵活性。

Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective. We can show that after decoding, the global minimizer of our continuous relaxation yields a discrete tree with a (1 + epsilon)-factor approximation for Dasgupta's optimal tree, where epsilon can be made arbitrarily small and controls optimization challenges. We experimentally evaluate HypHC on a variety of HC benchmarks and find that even approximate solutions found with gradient descent have superior clustering quality than agglomerative heuristics or other gradient based algorithms. Finally, we highlight the flexibility of HypHC using end-to-end training in a downstream classification task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源