论文标题

主动采样计数草图(ASC)用于在线稀疏估计数万亿个协方差矩阵

Active Sampling Count Sketch (ASCS) for Online Sparse Estimation of a Trillion Scale Covariance Matrix

论文作者

Dai, Zhenwei, Desai, Aditya, Heckel, Reinhard, Shrivastava, Anshumali

论文摘要

估计和存储高维数据的协方差(或相关)矩阵在计算上具有挑战性,因为记忆和计算需求均与维度相互四次缩放。幸运的是,在文本,点击,元基因组学数据集等文本中观察到的高维协方差矩阵通常很少。在本文中,我们考虑了有效稀疏估计的协方差矩阵的问题。我们针对的数据集的大小要求算法在线,因为多个通过数据的通行证是过于刺激的。在本文中,我们提出了一种主动采样计数草图(ASC),一种在线和一通素描算法,可以准确恢复协方差矩阵的大参数。计数草图(CS)和其他亚线性压缩传感算法为理论上的问题提供了自然解决方案。但是,由于信噪比较低(SNR),在实践中的香草CS在实践中无法正常工作。我们方法的核心是一种新型的主动抽样策略,可增加经典CS的SNR。我们通过合成数据和现实世界高维数据集证明了算法的实用性。 ASC对香草CS有了显着改善,证明了我们主动采样策略的优点。

Estimating and storing the covariance (or correlation) matrix of high-dimensional data is computationally challenging because both memory and computational requirements scale quadratically with the dimension. Fortunately, high-dimensional covariance matrices as observed in text, click-through, meta-genomics datasets, etc are often sparse. In this paper, we consider the problem of efficient sparse estimation of covariance matrices with possibly trillions of entries. The size of the datasets we target requires the algorithm to be online, as more than one pass over the data is prohibitive. In this paper, we propose Active Sampling Count Sketch (ASCS), an online and one-pass sketching algorithm, that recovers the large entries of the covariance matrix accurately. Count Sketch (CS), and other sub-linear compressed sensing algorithms, offer a natural solution to the problem in theory. However, vanilla CS does not work well in practice due to a low signal-to-noise ratio (SNR). At the heart of our approach is a novel active sampling strategy that increases the SNR of classical CS. We demonstrate the practicality of our algorithm with synthetic data and real-world high dimensional datasets. ASCS significantly improves over vanilla CS, demonstrating the merit of our active sampling strategy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源