1 Introduction
We study estimation of the covariance matrix of a random vector from observations of the form , where each component is multiplied by a Bernoulli random variable . We assume the probabilities are known, independent of each other, and of . Partial observations consistent with this model may arise when there are some physical limitations to the observation process (and thus there is missing data) or when the cost of observation needs to be reduced by selecting what to sample (i.e., active learning).
Applications where probabilistic models need to be constructed from observations with missing data include transportation networks [1] and sensor networks [2]. For example, different sensors in the network might have have different capabilities, e.g. different reliability or number of samples per hour that can be acquired, and these can be captured by associating different subsampling probabilities to each sensor.
For covariance estimation, or more generally graphical model estimation, active learning approaches have been considered where variables are observed in a sequential and adaptive manner to identify the graphical model with minimal number of observations. Active learning approaches for covariance estimation can be useful in distributed computation environments like sensor networks, where there are acquisition, processing or communication constraints, and there is a need for optimal resource allocation.
One critical difference between missing data and active covariance estimation scenarios is the control over the observation model. In the missing data case, the subsampling probabilities are a problem feature that is given or has to be estimated from data, while in active learning problems, the sampling probabilities are a design parameter. The results from this paper can be useful for analysis in the missing data case, as well as for designing active learning algorithms as we will discuss in more detail in Section 5.
In this work we analyze an unbiased covariance matrix estimator under subGaussian assumptions on . Our main result is an error bound on the Frobenius norm that reveals the relation between number of observations, subsampling probabilities and entries of the true covariance matrix. We apply this error bound to the design of subsampling probabilities in an active covariance estimation scenario. An interesting conclusion from this work is that when the covariance matrix is approximately low rank, an active covariance estimation approach can perform almost as well as an estimator with complete observations. The paper is organized as follows, in Section 2 we review related work. Sections 3 and 4 introduce the problem and the main results respectively. In Section 5 we show an application of our bounds for optimal design of sampling probabilities for a batch based active covariance estimation algorithm. Proofs are presented in Section 6 and conclusion in Sections 7.
2 Related work
Lounici [3] studies covariance matrix estimation from missing data when all variables can be observed with the same probability. The focus of [3] is on estimation of approximately low rank covariance matrices with a matrix version of the lasso algorithm. In [4] the empirical covariance matrix under missing data is shown to be indefinite. Most recently Cai and Zhang [5] study the missing completely at random model (MCAR), which assumes arbitrary and unknown missing data probabilities. They show that a modified sample covariance matrix estimator, similar to ours but using the empirical estimation of probabilities , achieves optimal minimax rates in spectral norm for bandable and sparse covariance matrices. Compared to [3, 5, 4], we allow all probabilities to be different and known, and we only study the unbiased sample covariance estimator. Moreover, all of the aforementioned papers consider estimation errors in spectral norm, while we consider errors in Frobenius norm, allowing us to derive error bounds with precise dependences between the entries of the true covariance matrix and the sampling probabilities . These bounds can then be applied to design the sampling distribution in an adaptive manner.
For distributed computing applications, the work of [6] studies covariance estimation from random subspace projections using dense matrices, which are generalized in [7] to sparse projection matrices. The same approach from [7] is used in [8] for memory and complexity reduced PCA. In all these works [6, 8, 7], each measurement is a (possibly sparse) linear combination of a few variables, and even though the designs are random, they have a fixed distribution for all observations.
The work of [9] considers the case when all are unknown, and uses an unbiased covariance matrix estimator as an input to the graphical lasso algorithm [10] for inverse covariance matrix estimation. Other interesting active learning approaches for graphical model selection are [11], [12] and [13]. Vats et. al. [12] uses a method that combines sampling marginals with conditional independence testing to learn the graphical model structure. [11, 13] consider a more general family of algorithms based on sampling high degree vertices, and prove upper and lower performance bounds.
Random subsampling and reconstruction of signals has been studied within graph signal processing [14, 15] and statistics [16]. The Bernoulli observation model we use has been studied by [14, 15] as a sampling strategy for graph signals, where sampling probability designs are proposed for reconstruction of deterministic bandlimited signals. Also, Romero et. al. [17] derives covariance matrix estimators assuming the target covariance matrix is a linear combination of known covariance matrices, which leads to algorithms and theoretical analysis that are fundamentally different from ours.
3 Problem Formulation
3.1 Notation
We denote scalars using regular font, while we use bold for vectors and matrices, e.g., , and . The Hadamard product between matrices is defined as . We use for entrywise matrix norms, with corresponding to the Frobenius norm. denotes norm or spectral norm when applied to vectors or matrices respectively.
3.2 Unbiased estimation
Consider a random vector taking values in . We observe
(1) 
where is a vector of Bernoulli random variables. The probability of observing the th variable is given by . The vector of probabilities is denoted by , and is a diagonal matrix. The average number of samples corresponds to . Given , i.i.d. realizations of , the th variable will be sampled in average times, if all , then , and we have perfect observation of . We are interested in studying covariance estimation for when and for all . Let and be the mean and covariance of , then
where is defined as and when .
For the rest of the paper we will assume . Given a set of of i.i.d. samples of the random vector , define
(2) 
where is the Hadamard (entrywise) inverse of . A simple calculation shows that
is an unbiased estimator for
. Indeed, . Because , the matrix might not be positive semidefinite (conditions for to be positive semidefinite are given in [4]).4 Estimation error
In this section we present an error analysis of the covariance matrix estimator from (2) when has subGaussian entries. SubGaussian random variables include Gaussian, Bernoulli and Bounded random variables. For more information see [18] and references therein.
Definition 1 ([18]).
If holds for some , we say is subGaussian. If holds for some , we say is subexponential.
Definition 2 ([18]).
The subGaussian and subexponential norms are defined as
for and respectively.
SubGaussian and subexponential random variables, and their norms, are related as follows.
Proposition 1 ([18]).
If and are subGaussian, then and are subexponential with norms satisfying , and .
We have that the following characterization of the product of subGaussian and Bernoulli random variables.
Lemma 1.
Let and be a product of Bernoulli and subGaussian random variables with Bernoulli probabilities and respectively, the only dependent variables are and , then

is subGaussian and .

, , and are subexponential with norms satisfying for .
The proof of Lemma 1 can be easily obtained from the definition of subGaussian and subexponential norms. We omit it for space considerations. We also define the matrix , with entries given by
Now we state our main result, whose proof appears in Section 6.
Theorem 1.
Let be zero mean random vector in with subGaussian entries and norm . Let , where each is a Bernoulli random variable with parameter , independent of each other and of . Given i.i.d. realizations , the estimator from (2) satisfies
with probability at least , where is an universal constant. Moreover if and , then
(3) 
where , and is the effective rank.
Theorem 1 shows that the estimation error in probability as the number of samples increases. More importantly, our result reveals that the sampling probabilities are closely related to the subexponential norms of the variables through the matrix . The bound from (3 ) suggests that distributions with smaller effective rank [3, 18] can tolerate a more aggressive subsampling factor (smaller ). This is not surprising since the effective rank is upper bounded by the actual rank, and can be significantly smaller for distributions whose energy concentrates in few principal components. We also note that the ratio appears in the bounds of [3] as well.
5 active covariance estimation
In this section we consider a scenario where we cannot observe (on average) more than
variables at a time, but we have the freedom to choose the subsampling probability distribution.
5.1 Subsampling distribution for true covariance matrix
Based on the error bound from Theorem 1, we propose designing the subsampling distribution by approximately minimizing . Theorem 1 suggests that should be larger whenever the subGaussian norm of is large, but also the product should be large when the subexponential norm of is large. We assume that . The bounds for and from (6) and (7) respectively suggest the approximation . Given a sampling budget , we estimate the subsampling probability vector by solving the following scaled projection problem
(4) 
5.2 Subsampling distribution for empirical covariance matrix
Since the true covariance matrix is unknown, (4) does not lead to a practical estimator. Instead, we consider a batch based algorithm that for a given budget , and a starting subsampling distribution, it iteratively refines the subsampling probability distribution as a function of previous observations. We show the pseudo code for such procedure in Algorithm 1, which can be summarized in the following steps: observation with variable subsampling, covariance estimation, and subsampling distribution update. At the th iteration, i.i.d. realizations are observed according to (1) with subsampling probabilities . The covariance estimator is a convex combination of the estimator at the previous iteration, and the estimator for the current batch. Finally, the new covariance matrix estimator is used to update the subsampling probabilities as
(5)  
Proposition 2.
Algorithm 1 produces an unbiased estimator for .
Proof.
We will proceed by induction. For the first iteration we have used uniform sampling, thus we have that . Now assume , then at the th iteration, the covariance of the new data satisfies , which implies . ∎
5.3 Numerical evaluation
In this section we evaluate our proposed method using the MNIST [19] dataset, which consists of images of scanned digits from to . In our experiments we consider images of the digit . The vectorized, and mean removed images are denoted by with covariance matrix . We consider estimation of the the covariance matrix of , where
is zero mean Gaussian noise with unit variance, therefore
.To draw i.i.d. realizations of , we sample images without replacement and add Gaussian noise with variance . The effective rank of is controlled by the parameter and satisfies
We first compare uniform subsampling with non uniform subsampling for estimation of with . We designed the non uniform subsampling distribution using (4) with the true covariance matrix. We report relative errors in Frobenius norm as a function of in Figure 0(a). Each point in the plot is an average over independent trials. We observe that when the non uniform sampling distribution matches closely the performance of the estimator with full data. It is clear that when and performance decreases (for uniform and non uniform sampling) as decreases, and non uniform sampling always outperforms uniform sampling.
In Figure 0(b) we evaluate our active method from Algorithm 1 with batch size . We consider the same scenario as in Figure 0(a) for and . The proposed active covariance estimation method quickly learns the optimal subsampling distribution, is always better than uniform sampling, and matches the performance of the non uniform subsampling method obtained from the true covariance matrix. Finally we show in Figure 0(c) the same experiment shown in Figure 0(b), but now with covariance matrix with parameter , which changes the effective rank from with to . We observe that the problem becomes more difficult since the estimation errors are larger. There is no major difference between uniform, and non uniform sampling, thus the advantages of the proposed active covariance estimation method are limited. This might be due to various effects including, relative magnitudes of diagonal entries of , sampling budget , effective rank, and probability update algorithm. Moreover, since the effective rank did not change much (compared with ), this experiment suggests the effective rank does not quantify effectively the problem difficulty. Also, a more precise method to update the probabilities might help improving the performance of the active covariance estimation algorithm.
6 Proof of Theorem 1
We first need the following concentration bounds
Lemma 2.
Under the same assumptions of Theorem 1, for any we have
for offdiagonal and diagonal entries respectively.
We also need the following geometric result which we state without proof.
Lemma 3.
Let , and , then for all , and that satisfy , we have that .
The proof of Theorem 1 starts by bounding the probability of the event . Pick a set of such that , and apply Lemma 3 and the union bound to get
The last inequality follows from Lemma 2 with constants , appropiately chosen so the inequalities hold for all pairs . We can further simplify by choosing
where . The proof can be finished by equating to , solving for , and doing some manipulations. To derive the bound from (3) we bound the entries of obtaining
(6)  
(7) 
Then, applying (6) and (7) followed by triangle inequality of the norm we have
The last step uses the fact that for all , and the definition of effective rank.
7 Conclusion
We studied covariance matrix estimation when the variables are subsampled by a product with Bernoulli variables. Variations of this model have been traditionally considered in the analysis of missing data. We study an unbiased estimator for the covariance matrix and derive a novel estimation error bound in entrywise norm. Our bound illustrates the subtle relations between covariance matrix parameters and subsampling distribution. Using this bound, we propose an active covariance matrix estimation algorithm that also produces an unbiased estimator. We show with numerical experiments that the proposed active covariance estimation algorithm outperforms uniform subsampling, and closely matches nonuniform subsampling with complete knowledge of the true covariance matrix.
References

[1]
Muhammad Tayyab Asif, Nikola Mitrovic, Justin Dauwels, and Patrick Jaillet,
“Matrix and tensor based methods for missing data estimation in large traffic networks,”
IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp. 1816–1825, 2016.  [2] Amol Deshpande, Carlos Guestrin, Samuel R Madden, Joseph M Hellerstein, and Wei Hong, “Modeldriven data acquisition in sensor networks,” in Proceedings of the Thirtieth international conference on Very large data basesVolume 30. VLDB Endowment, 2004, pp. 588–599.
 [3] Karim Lounici, “Highdimensional covariance matrix estimation with missing observations,” Bernoulli, vol. 20, no. 3, pp. 1029–1058, 2014.
 [4] Kamil Jurczak and Angelika Rohde, “Spectral analysis of highdimensional sample covariance matrices with missing observations,” Bernoulli, vol. 23, no. 4A, pp. 2466–2532, 2017.

[5]
T Tony Cai and Anru Zhang,
“Minimax rateoptimal estimation of highdimensional covariance
matrices with incomplete data,”
Journal of multivariate analysis
, vol. 150, pp. 55–74, 2016.  [6] Martin Azizyan, Akshay Krishnamurthy, and Aarti Singh, “Extreme compressive sampling for covariance estimation,” arXiv preprint arXiv:1506.00898, 2015.
 [7] Farhad PourkamaliAnaraki, “Estimation of the sample covariance matrix from compressive measurements,” IET Signal Processing, vol. 10, no. 9, pp. 1089–1095, 2016.

[8]
Farhad P Anaraki and Shannon Hughes,
“Memory and computation efficient pca via very sparse random
projections,”
in
Proceedings of the 31st International Conference on Machine Learning (ICML)
, 2014, pp. 1341–1349.  [9] Mladen Kolar and Eric P Xing, “Consistent covariance selection from data with missing values,” in Proceedings of the 29th International Conference on Machine Learning (ICML), 2012, pp. 551–558.
 [10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
 [11] Gautamd Dasarathy, Aarti Singh, MariaFlorina Balcan, and Jong H Park, “Active learning algorithms for graphical model selection,” in Artificial Intelligence and Statistics (AISTATS), 2016, pp. 1356–1364.
 [12] Divyanshu Vats, Robert Nowak, and Richard Baraniuk, “Active learning for undirected graphical model selection,” in Artificial Intelligence and Statistics (AISTATS), 2014, pp. 958–967.
 [13] Jonathan Scarlett and Volkan Cevher, “Lower bounds on active learning for graphical model selection,” in The 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
 [14] Siheng Chen, Rohan Varma, Aarti Singh, and Jelena Kovačević, “Signal recovery on graphs: Fundamental limits of sampling strategies,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 539–554, 2016.
 [15] Gilles Puy, Nicolas Tremblay, Rémi Gribonval, and Pierre Vandergheynst, “Random sampling of bandlimited signals on graphs,” Applied and Computational Harmonic Analysis, 2016.
 [16] PoLing Loh and Martin J Wainwright, “High dimensional regression with noisy and missing data: provable guarantees with nonconvexity,” The Annals of Statistics, vol. 40, no. 3, pp. 1637–1664, 2012.
 [17] Daniel Romero, Dyonisius Dony Ariananda, Zhi Tian, and Geert Leus, “Compressive covariance sensing: Structurebased compressive sensing beyond sparsity,” IEEE signal processing magazine, vol. 33, no. 1, pp. 78–93, 2016.

[18]
Roman Vershynin,
High Dimensional Probability: An Introduction with Applications in Data Science
, Cambridge University Press, 2018.  [19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 – 2324, 1998.
Comments
There are no comments yet.