# Active covariance estimation by random sub-sampling of variables

We study covariance matrix estimation for the case of partially observed random vectors, where different samples contain different subsets of vector coordinates. Each observation is the product of the variable of interest with a 0-1 Bernoulli random variable. We analyze an unbiased covariance estimator under this model, and derive an error bound that reveals relations between the sub-sampling probabilities and the entries of the covariance matrix. We apply our analysis in an active learning framework, where the expected number of observed variables is small compared to the dimension of the vector of interest, and propose a design of optimal sub-sampling probabilities and an active covariance matrix estimation algorithm.

## Authors

• 9 publications
• 26 publications
• ### Detecting the large entries of a sparse covariance matrix in sub-quadratic time

The covariance matrix of a p-dimensional random variable is a fundamenta...
05/12/2015 ∙ by Ofer Shwartz, et al. ∙ 0

• ### Robust covariance estimation under L_4-L_2 norm equivalence

Let X be a centered random vector taking values in R^d and let Σ= E(X⊗ X...
09/27/2018 ∙ by Shahar Mendelson, et al. ∙ 0

• ### Graph quilting: graphical model selection from partially observed covariances

We investigate the problem of conditional dependence graph estimation wh...
12/11/2019 ∙ by Giuseppe Vinci, et al. ∙ 0

• ### Sparse sketches with small inversion bias

For a tall n× d matrix A and a random m× n sketching matrix S, the sketc...
11/21/2020 ∙ by Michał Dereziński, et al. ∙ 0

• ### Hierarchical Regularizers for Mixed-Frequency Vector Autoregressions

Mixed-frequency Vector AutoRegressions (MF-VAR) model the dynamics betwe...
02/23/2021 ∙ by Alain Hecq, et al. ∙ 0

• ### Active Sampling Count Sketch (ASCS) for Online Sparse Estimation of a Trillion Scale Covariance Matrix

Estimating and storing the covariance (or correlation) matrix of high-di...
10/29/2020 ∙ by Zhenwei Dai, et al. ∙ 0

• ### Natural Steganography in JPEG Domain with a Linear Development Pipeline

In order to achieve high practical security, Natural Steganography (NS) ...
01/08/2020 ∙ by Taburet Théo, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study estimation of the covariance matrix of a random vector from observations of the form , where each component is multiplied by a Bernoulli random variable . We assume the probabilities are known, independent of each other, and of . Partial observations consistent with this model may arise when there are some physical limitations to the observation process (and thus there is missing data) or when the cost of observation needs to be reduced by selecting what to sample (i.e., active learning).

Applications where probabilistic models need to be constructed from observations with missing data include transportation networks [1] and sensor networks [2]. For example, different sensors in the network might have have different capabilities, e.g. different reliability or number of samples per hour that can be acquired, and these can be captured by associating different sub-sampling probabilities to each sensor.

For covariance estimation, or more generally graphical model estimation, active learning approaches have been considered where variables are observed in a sequential and adaptive manner to identify the graphical model with minimal number of observations. Active learning approaches for covariance estimation can be useful in distributed computation environments like sensor networks, where there are acquisition, processing or communication constraints, and there is a need for optimal resource allocation.

One critical difference between missing data and active covariance estimation scenarios is the control over the observation model. In the missing data case, the sub-sampling probabilities are a problem feature that is given or has to be estimated from data, while in active learning problems, the sampling probabilities are a design parameter. The results from this paper can be useful for analysis in the missing data case, as well as for designing active learning algorithms as we will discuss in more detail in Section 5.

In this work we analyze an unbiased covariance matrix estimator under sub-Gaussian assumptions on . Our main result is an error bound on the Frobenius norm that reveals the relation between number of observations, sub-sampling probabilities and entries of the true covariance matrix. We apply this error bound to the design of sub-sampling probabilities in an active covariance estimation scenario. An interesting conclusion from this work is that when the covariance matrix is approximately low rank, an active covariance estimation approach can perform almost as well as an estimator with complete observations. The paper is organized as follows, in Section 2 we review related work. Sections 3 and 4 introduce the problem and the main results respectively. In Section 5 we show an application of our bounds for optimal design of sampling probabilities for a batch based active covariance estimation algorithm. Proofs are presented in Section 6 and conclusion in Sections 7.

## 2 Related work

Lounici [3] studies covariance matrix estimation from missing data when all variables can be observed with the same probability. The focus of [3] is on estimation of approximately low rank covariance matrices with a matrix version of the lasso algorithm. In [4] the empirical covariance matrix under missing data is shown to be indefinite. Most recently Cai and Zhang [5] study the missing completely at random model (MCAR), which assumes arbitrary and unknown missing data probabilities. They show that a modified sample covariance matrix estimator, similar to ours but using the empirical estimation of probabilities , achieves optimal minimax rates in spectral norm for bandable and sparse covariance matrices. Compared to [3, 5, 4], we allow all probabilities to be different and known, and we only study the unbiased sample covariance estimator. Moreover, all of the aforementioned papers consider estimation errors in spectral norm, while we consider errors in Frobenius norm, allowing us to derive error bounds with precise dependences between the entries of the true covariance matrix and the sampling probabilities . These bounds can then be applied to design the sampling distribution in an adaptive manner.

For distributed computing applications, the work of [6] studies covariance estimation from random subspace projections using dense matrices, which are generalized in [7] to sparse projection matrices. The same approach from [7] is used in [8] for memory and complexity reduced PCA. In all these works [6, 8, 7], each measurement is a (possibly sparse) linear combination of a few variables, and even though the designs are random, they have a fixed distribution for all observations.

The work of [9] considers the case when all are unknown, and uses an unbiased covariance matrix estimator as an input to the graphical lasso algorithm [10] for inverse covariance matrix estimation. Other interesting active learning approaches for graphical model selection are [11], [12] and [13]. Vats et. al. [12] uses a method that combines sampling marginals with conditional independence testing to learn the graphical model structure. [11, 13] consider a more general family of algorithms based on sampling high degree vertices, and prove upper and lower performance bounds.

Random sub-sampling and reconstruction of signals has been studied within graph signal processing [14, 15] and statistics [16]. The Bernoulli observation model we use has been studied by [14, 15] as a sampling strategy for graph signals, where sampling probability designs are proposed for reconstruction of deterministic band-limited signals. Also, Romero et. al. [17] derives covariance matrix estimators assuming the target covariance matrix is a linear combination of known covariance matrices, which leads to algorithms and theoretical analysis that are fundamentally different from ours.

## 3 Problem Formulation

### 3.1 Notation

We denote scalars using regular font, while we use bold for vectors and matrices, e.g., , and . The Hadamard product between matrices is defined as . We use for entry-wise matrix norms, with corresponding to the Frobenius norm. denotes norm or spectral norm when applied to vectors or matrices respectively.

### 3.2 Unbiased estimation

Consider a random vector taking values in . We observe

 y=δ⊙x, (1)

where is a vector of Bernoulli random variables. The probability of observing the -th variable is given by . The vector of probabilities is denoted by , and is a diagonal matrix. The average number of samples corresponds to . Given , i.i.d. realizations of , the -th variable will be sampled in average times, if all , then , and we have perfect observation of . We are interested in studying covariance estimation for when and for all . Let and be the mean and covariance of , then

 E(y)=Pμ, and Cov(y)=Σ⊙Ξ+(P−P2)diag(μμ⊤),

where is defined as and when .

For the rest of the paper we will assume . Given a set of of i.i.d. samples of the random vector , define

 ˆΣ=1TT∑k=1y(k)y(k)⊤⊙Ξ†, (2)

where is the Hadamard (entry-wise) inverse of . A simple calculation shows that

is an unbiased estimator for

. Indeed, . Because , the matrix might not be positive semi-definite (conditions for to be positive semi-definite are given in [4]).

## 4 Estimation error

In this section we present an error analysis of the covariance matrix estimator from (2) when has sub-Gaussian entries. Sub-Gaussian random variables include Gaussian, Bernoulli and Bounded random variables. For more information see [18] and references therein.

###### Definition 1 ([18]).

If holds for some , we say is sub-Gaussian. If holds for some , we say is sub-exponential.

###### Definition 2 ([18]).

The sub-Gaussian and sub-exponential norms are defined as

 ∥z∥ψα=inf{u>0:E[exp(|z|α/uα)]≤2}.

for and respectively.

Sub-Gaussian and sub-exponential random variables, and their norms, are related as follows.

###### Proposition 1 ([18]).

If and are sub-Gaussian, then and are sub-exponential with norms satisfying , and .

We have that the following characterization of the product of sub-Gaussian and Bernoulli random variables.

###### Lemma 1.

Let and be a product of Bernoulli and sub-Gaussian random variables with Bernoulli probabilities and respectively, the only dependent variables are and , then

1. is sub-Gaussian and .

2. , , and are sub-exponential with norms satisfying for .

The proof of Lemma 1 can be easily obtained from the definition of sub-Gaussian and sub-exponential norms. We omit it for space considerations. We also define the matrix , with entries given by

 hij=⎧⎪ ⎪⎨⎪ ⎪⎩∥xixj∥ψ1pipji≠j∥x2i∥ψ1pii=j.

Now we state our main result, whose proof appears in Section 6.

###### Theorem 1.

Let be zero mean random vector in with sub-Gaussian entries and norm . Let , where each is a Bernoulli random variable with parameter , independent of each other and of . Given i.i.d. realizations , the estimator from (2) satisfies

 ∥^Σ−Σ∥q≤∥H∥q⎧⎨⎩√γ2log(n)+log(η)T∨γ2log(n)+log(η)T⎫⎬⎭

with probability at least , where is an universal constant. Moreover if and , then

 ∥H∥q≤2σ2^p2r(Σ)∥Σ∥, (3)

where , and is the effective rank.

Theorem 1 shows that the estimation error in probability as the number of samples increases. More importantly, our result reveals that the sampling probabilities are closely related to the sub-exponential norms of the variables through the matrix . The bound from (3 ) suggests that distributions with smaller effective rank [3, 18] can tolerate a more aggressive sub-sampling factor (smaller ). This is not surprising since the effective rank is upper bounded by the actual rank, and can be significantly smaller for distributions whose energy concentrates in few principal components. We also note that the ratio appears in the bounds of [3] as well.

## 5 active covariance estimation

In this section we consider a scenario where we cannot observe (on average) more than

variables at a time, but we have the freedom to choose the sub-sampling probability distribution.

### 5.1 Sub-sampling distribution for true covariance matrix

Based on the error bound from Theorem 1, we propose designing the sub-sampling distribution by approximately minimizing . Theorem 1 suggests that should be larger whenever the sub-Gaussian norm of is large, but also the product should be large when the sub-exponential norm of is large. We assume that . The bounds for and from (6) and (7) respectively suggest the approximation . Given a sampling budget , we estimate the sub-sampling probability vector by solving the following scaled projection problem

 minp,ρ12∥p−ρdiag(Σ)12∥22, s.t. 1⊤p=m,0≤p≤1. (4)

### 5.2 Sub-sampling distribution for empirical covariance matrix

Since the true covariance matrix is unknown, (4) does not lead to a practical estimator. Instead, we consider a batch based algorithm that for a given budget , and a starting sub-sampling distribution, it iteratively refines the sub-sampling probability distribution as a function of previous observations. We show the pseudo code for such procedure in Algorithm 1, which can be summarized in the following steps: observation with variable sub-sampling, covariance estimation, and sub-sampling distribution update. At the -th iteration, i.i.d. realizations are observed according to (1) with sub-sampling probabilities . The covariance estimator is a convex combination of the estimator at the previous iteration, and the estimator for the current batch. Finally, the new covariance matrix estimator is used to update the sub-sampling probabilities as

 p(t)= argminp,ρ12∥p−ρdiag(ˆΣ(t))12∥22, (5) s.t. 1⊤p=m,0≤p≤1.
###### Proposition 2.

Algorithm 1 produces an unbiased estimator for .

###### Proof.

We will proceed by induction. For the first iteration we have used uniform sampling, thus we have that . Now assume , then at the -th iteration, the covariance of the new data satisfies , which implies . ∎

### 5.3 Numerical evaluation

In this section we evaluate our proposed method using the MNIST [19] dataset, which consists of images of scanned digits from to . In our experiments we consider images of the digit . The vectorized, and mean removed images are denoted by with covariance matrix . We consider estimation of the the covariance matrix of , where

is zero mean Gaussian noise with unit variance, therefore

.

To draw i.i.d. realizations of , we sample images without replacement and add Gaussian noise with variance . The effective rank of is controlled by the parameter and satisfies

 r(Σ)=r(C)+nθ1+θ.

We first compare uniform sub-sampling with non uniform sub-sampling for estimation of with . We designed the non uniform sub-sampling distribution using (4) with the true covariance matrix. We report relative errors in Frobenius norm as a function of in Figure 0(a). Each point in the plot is an average over independent trials. We observe that when the non uniform sampling distribution matches closely the performance of the estimator with full data. It is clear that when and performance decreases (for uniform and non uniform sampling) as decreases, and non uniform sampling always outperforms uniform sampling.

In Figure 0(b) we evaluate our active method from Algorithm 1 with batch size . We consider the same scenario as in Figure 0(a) for and . The proposed active covariance estimation method quickly learns the optimal sub-sampling distribution, is always better than uniform sampling, and matches the performance of the non uniform sub-sampling method obtained from the true covariance matrix. Finally we show in Figure 0(c) the same experiment shown in Figure 0(b), but now with covariance matrix with parameter , which changes the effective rank from with to . We observe that the problem becomes more difficult since the estimation errors are larger. There is no major difference between uniform, and non uniform sampling, thus the advantages of the proposed active covariance estimation method are limited. This might be due to various effects including, relative magnitudes of diagonal entries of , sampling budget , effective rank, and probability update algorithm. Moreover, since the effective rank did not change much (compared with ), this experiment suggests the effective rank does not quantify effectively the problem difficulty. Also, a more precise method to update the probabilities might help improving the performance of the active covariance estimation algorithm.

## 6 Proof of Theorem 1

We first need the following concentration bounds

###### Lemma 2.

Under the same assumptions of Theorem 1, for any we have

 P(|ˆΣij−Σij|>ν)≤2exp{−c1Tmin(ν2/h2ij,ν/hij)}, P(|ˆΣii−Σii|>ν)≤2exp{−c2Tmin(ν2/h2ii,ν/hii)},

for off-diagonal and diagonal entries respectively.

###### Proof.

The error events for off-diagonal entries satisfy

 |ˆΣij−Σij|>ν ⇔|T∑k=1(y(k)iy(k)j−pipjΣij)|>νTpipj.

We apply Bernstein’s inequality [18] for sums of independent zero mean sub-exponential random variables, which combined with the bound from Lemma 1 leads to the desired bound. The proof for diagonal terms follows almost the same procedure. ∎

We also need the following geometric result which we state without proof.

###### Lemma 3.

Let , and , then for all , and that satisfy , we have that .

The proof of Theorem 1 starts by bounding the probability of the event . Pick a set of such that , and apply Lemma 3 and the union bound to get

 P(∥^Σ−Σ∥q>ϵ∥Σ∥q)=P(∑i,j|^Σij−Σij|q>ϵq∥Σ∥qq) ≤∑ijP(|^Σij−Σij|q>αijϵq∥Σ∥qq) =∑ijP(|^Σij−Σij|>α1/qijϵ∥Σ∥q) ≤n∑i=12exp⎧⎨⎩−c2Tmin⎛⎝α2/qiiϵ2∥Σ∥2qh2ii,α1/qiiϵ∥Σ∥qhii⎞⎠⎫⎬⎭+ n∑i≠j=12exp⎧⎪⎨⎪⎩−c1Tmin⎛⎜⎝α2/qijϵ2∥Σ∥2qh2ij,α1/qijϵ∥Σ∥qhij⎞⎟⎠⎫⎪⎬⎪⎭.

The last inequality follows from Lemma 2 with constants , appropiately chosen so the inequalities hold for all pairs . We can further simplify by choosing

 P(∥^Σ−Σ∥q>ϵ∥Σ∥q) ≤2nexp{−c2Tmin(ϵ2∥Σ∥2q∥H∥2q,ϵ∥Σ∥q∥H∥q)} +2(n2−n)exp{−c1Tmin(ϵ2∥Σ∥2q∥H∥2q,ϵ∥Σ∥q∥H∥q)} ≤2n2exp{−Tγmin(ϵ2∥Σ∥2q∥H∥2q,ϵ∥Σ∥q∥H∥q)},

where . The proof can be finished by equating to , solving for , and doing some manipulations. To derive the bound from (3) we bound the entries of obtaining

 hii =σ2Σiipi≤σ2Σii^p, (6) hij ≤σ2√ΣiiΣjjpipj≤σ2√ΣiiΣjj^p2. (7)

Then, applying (6) and (7) followed by triangle inequality of the norm we have

 ∥H∥q ≤σ2^p⎡⎣(1−1^pq)n∑i=1Σqii+1^pq(n∑i=1Σq/2ii)2⎤⎦1q ≤σ2^p⎡⎣(1^pq−1)1q∥diag(Σ)∥q+1^p∥diag(Σ)∥q2⎤⎦ ≤σ2tr(Σ)^p2[(1−^pq)1q+1]≤2σ2^p2r(Σ)∥Σ∥.

The last step uses the fact that for all , and the definition of effective rank.

## 7 Conclusion

We studied covariance matrix estimation when the variables are sub-sampled by a product with Bernoulli variables. Variations of this model have been traditionally considered in the analysis of missing data. We study an unbiased estimator for the covariance matrix and derive a novel estimation error bound in entry-wise norm. Our bound illustrates the subtle relations between covariance matrix parameters and sub-sampling distribution. Using this bound, we propose an active covariance matrix estimation algorithm that also produces an unbiased estimator. We show with numerical experiments that the proposed active covariance estimation algorithm outperforms uniform sub-sampling, and closely matches non-uniform sub-sampling with complete knowledge of the true covariance matrix.

## References

• [1] Muhammad Tayyab Asif, Nikola Mitrovic, Justin Dauwels, and Patrick Jaillet,

“Matrix and tensor based methods for missing data estimation in large traffic networks,”

IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp. 1816–1825, 2016.
• [2] Amol Deshpande, Carlos Guestrin, Samuel R Madden, Joseph M Hellerstein, and Wei Hong, “Model-driven data acquisition in sensor networks,” in Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004, pp. 588–599.
• [3] Karim Lounici, “High-dimensional covariance matrix estimation with missing observations,” Bernoulli, vol. 20, no. 3, pp. 1029–1058, 2014.
• [4] Kamil Jurczak and Angelika Rohde, “Spectral analysis of high-dimensional sample covariance matrices with missing observations,” Bernoulli, vol. 23, no. 4A, pp. 2466–2532, 2017.
• [5] T Tony Cai and Anru Zhang, “Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data,”

Journal of multivariate analysis

, vol. 150, pp. 55–74, 2016.
• [6] Martin Azizyan, Akshay Krishnamurthy, and Aarti Singh, “Extreme compressive sampling for covariance estimation,” arXiv preprint arXiv:1506.00898, 2015.
• [7] Farhad Pourkamali-Anaraki, “Estimation of the sample covariance matrix from compressive measurements,” IET Signal Processing, vol. 10, no. 9, pp. 1089–1095, 2016.
• [8] Farhad P Anaraki and Shannon Hughes, “Memory and computation efficient pca via very sparse random projections,” in

Proceedings of the 31st International Conference on Machine Learning (ICML)

, 2014, pp. 1341–1349.
• [9] Mladen Kolar and Eric P Xing, “Consistent covariance selection from data with missing values,” in Proceedings of the 29th International Conference on Machine Learning (ICML), 2012, pp. 551–558.
• [10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
• [11] Gautamd Dasarathy, Aarti Singh, Maria-Florina Balcan, and Jong H Park, “Active learning algorithms for graphical model selection,” in Artificial Intelligence and Statistics (AISTATS), 2016, pp. 1356–1364.
• [12] Divyanshu Vats, Robert Nowak, and Richard Baraniuk, “Active learning for undirected graphical model selection,” in Artificial Intelligence and Statistics (AISTATS), 2014, pp. 958–967.
• [13] Jonathan Scarlett and Volkan Cevher, “Lower bounds on active learning for graphical model selection,” in The 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
• [14] Siheng Chen, Rohan Varma, Aarti Singh, and Jelena Kovačević, “Signal recovery on graphs: Fundamental limits of sampling strategies,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 539–554, 2016.
• [15] Gilles Puy, Nicolas Tremblay, Rémi Gribonval, and Pierre Vandergheynst, “Random sampling of bandlimited signals on graphs,” Applied and Computational Harmonic Analysis, 2016.
• [16] Po-Ling Loh and Martin J Wainwright, “High dimensional regression with noisy and missing data: provable guarantees with non-convexity,” The Annals of Statistics, vol. 40, no. 3, pp. 1637–1664, 2012.
• [17] Daniel Romero, Dyonisius Dony Ariananda, Zhi Tian, and Geert Leus, “Compressive covariance sensing: Structure-based compressive sensing beyond sparsity,” IEEE signal processing magazine, vol. 33, no. 1, pp. 78–93, 2016.
• [18] Roman Vershynin,

High Dimensional Probability: An Introduction with Applications in Data Science

,
Cambridge University Press, 2018.
• [19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 – 2324, 1998.