论文标题

数据流中的双簇和布尔矩阵分解

Biclustering and Boolean Matrix Factorization in Data Streams

论文作者

Neumann, Stefan, Miettinen, Pauli

论文摘要

我们研究了数据流中两分图和布尔矩阵分解的聚类。我们考虑了一个流设置,其中图的左侧顶点与所有事件边缘一起一个一个。我们提供了一种算法,该算法是,在流过流后,使用sublinear空间恢复了图右侧的簇集;据我们所知,这是该属性的第一种算法。我们还表明,在第二次通过之后,可以恢复两分图的左簇,并显示如何扩展算法以解决布尔矩阵分解问题(通过利用布尔矩阵和双部分图的对应关系)。我们评估了综合数据和现实数据的算法实现。在现实世界数据集上,算法是比静态基线算法更快的大小订单,同时在基线算法的一个因子2中提供质量结果。我们的算法在图中的边数中线性缩放。最后,我们从理论上分析算法,并提供足够的条件,在该算法下,算法在标准随机图模型下恢复了一组种植的簇。

We study the clustering of bipartite graphs and Boolean matrix factorization in data streams. We consider a streaming setting in which the vertices from the left side of the graph arrive one by one together with all of their incident edges. We provide an algorithm that, after one pass over the stream, recovers the set of clusters on the right side of the graph using sublinear space; to the best of our knowledge, this is the first algorithm with this property. We also show that after a second pass over the stream, the left clusters of the bipartite graph can be recovered and we show how to extend our algorithm to solve the Boolean matrix factorization problem (by exploiting the correspondence of Boolean matrices and bipartite graphs). We evaluate an implementation of the algorithm on synthetic data and on real-world data. On real-world datasets the algorithm is orders of magnitudes faster than a static baseline algorithm while providing quality results within a factor 2 of the baseline algorithm. Our algorithm scales linearly in the number of edges in the graph. Finally, we analyze the algorithm theoretically and provide sufficient conditions under which the algorithm recovers a set of planted clusters under a standard random graph model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源