Active Sampling Count Sketch (ASCS) for Online Sparse Estimation of a Trillion Scale Covariance Matrix

10/29/2020
by   Zhenwei Dai, et al.
0

Estimating and storing the covariance (or correlation) matrix of high-dimensional data is computationally challenging. For this problem, both memory and computational requirements scale quadratically with the dimension. Fortunately, high-dimensional covariance matrices observed in text, click-through, and meta-genomics datasets are often sparse. In this paper, we consider the problem of efficiently estimating a sparse covariance matrix, which can scale to matrices with trillions of entries. The scale of the datasets requires the algorithm to be online, as any second pass over the data is prohibitive. In this paper, we propose Active Sampling Count Sketch (ASCS), an online and one-pass sketching algorithm, that recovers the large entries of the covariance matrix accurately. Count Sketch (CS), and other sub-linear compressed sensing algorithms, offer a natural solution to the problem in theory. However, vanilla CS does not work well in practice due to a low signal-to-noise ratio (SNR). At the heart of our approach is a novel active sampling strategy that increases the SNR of classical count sketches. We demonstrate the practicality of our algorithm with synthetic data and real-world high dimensional datasets. ASCS significantly improves over vanilla CS, demonstrating the merit of our active sampling strategy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2020

Effective Data-aware Covariance Estimator from Compressed Data

Estimating covariance matrix from massive high-dimensional and distribut...
research
04/04/2018

Active covariance estimation by random sub-sampling of variables

We study covariance matrix estimation for the case of partially observed...
research
11/11/2020

Sketch and Scale: Geo-distributed tSNE and UMAP

Running machine learning analytics over geographically distributed datas...
research
02/25/2022

High-Dimensional Sparse Bayesian Learning without Covariance Matrices

Sparse Bayesian learning (SBL) is a powerful framework for tackling the ...
research
11/19/2013

Near-Optimal Entrywise Sampling for Data Matrices

We consider the problem of selecting non-zero entries of a matrix A in o...
research
11/24/2020

Effective and Sparse Count-Sketch via k-means clustering

Count-sketch is a popular matrix sketching algorithm that can produce a ...
research
05/21/2021

Covariance-Free Sparse Bayesian Learning

Sparse Bayesian learning (SBL) is a powerful framework for tackling the ...

Please sign up or login with your details

Forgot password? Click here to reset