论文标题

订单保留层次集聚集聚类

Order preserving hierarchical agglomerative clustering

论文作者

Bakkelund, Daniel

论文摘要

部分订单和定向的无环图通常是在众多域和应用中自然出现的重复数据结构,用于表示域中实体之间的有序关系。示例是项目计划中的任务依赖性,分布式分类帐中的交易顺序以及计算机程序中任务的执行顺序,仅提及一些。我们研究了保留这种有序数据的分层聚类的秩序问题。也就是说,如果我们的原始数据中有$ a <b $,并用$ [a] $和$ [b] $表示它们的群集,那么我们将在生产的聚类中有$ [a] <[b] $。聚类是基于相似性的,并且使用标准的链接功能,例如单链和完整的链接,并且是经典层次聚类的扩展。 为了实现这一目标,我们从严格订购的数据上运行经典分层聚类来定义输出,以作为部分树状图;具有几个连接组件的经典树状图的子树。然后,我们将部分树状图的嵌入到同一组的超详细信息中的一组中。最佳层次聚类定义为与最接近原始差异度度量的超量学相对应的部分树状图,该测量值在P-norm中测量。因此,该方法是经典分层聚类和超级拟合的组合。 从机器零件数据库中使用参考实现来实验合成随机数据和现实世界数据。与现有方法相比,实验表明,我们的方法在集群质量和秩序保存方面都表现出色。

Partial orders and directed acyclic graphs are commonly recurring data structures that arise naturally in numerous domains and applications and are used to represent ordered relations between entities in the domains. Examples are task dependencies in a project plan, transaction order in distributed ledgers and execution sequences of tasks in computer programs, just to mention a few. We study the problem of order preserving hierarchical clustering of this kind of ordered data. That is, if we have $a < b$ in the original data and denote their respective clusters by $[a]$ and $[b]$, then we shall have $[a] < [b]$ in the produced clustering. The clustering is similarity based and uses standard linkage functions, such as single- and complete linkage, and is an extension of classical hierarchical clustering. To achieve this, we define the output from running classical hierarchical clustering on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is defined as the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the p-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting. A reference implementation is employed for experiments on both synthetic random data and real world data from a database of machine parts. When compared to existing methods, the experiments show that our method excels both in cluster quality and order preservation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源