We study estimation of the covariance matrix of a random vector from observations of the form , where each component is multiplied by a Bernoulli random variable . We assume the probabilities are known, independent of each other, and of . Partial observations consistent with this model may arise when there are some physical limitations to the observation process (and thus there is missing data) or when the cost of observation needs to be reduced by selecting what to sample (i.e., active learning).
Applications where probabilistic models need to be constructed from observations with missing data include transportation networks  and sensor networks . For example, different sensors in the network might have have different capabilities, e.g. different reliability or number of samples per hour that can be acquired, and these can be captured by associating different sub-sampling probabilities to each sensor.
For covariance estimation, or more generally graphical model estimation, active learning approaches have been considered where variables are observed in a sequential and adaptive manner to identify the graphical model with minimal number of observations. Active learning approaches for covariance estimation can be useful in distributed computation environments like sensor networks, where there are acquisition, processing or communication constraints, and there is a need for optimal resource allocation.
One critical difference between missing data and active covariance estimation scenarios is the control over the observation model. In the missing data case, the sub-sampling probabilities are a problem feature that is given or has to be estimated from data, while in active learning problems, the sampling probabilities are a design parameter. The results from this paper can be useful for analysis in the missing data case, as well as for designing active learning algorithms as we will discuss in more detail in Section 5.
In this work we analyze an unbiased covariance matrix estimator under sub-Gaussian assumptions on . Our main result is an error bound on the Frobenius norm that reveals the relation between number of observations, sub-sampling probabilities and entries of the true covariance matrix. We apply this error bound to the design of sub-sampling probabilities in an active covariance estimation scenario. An interesting conclusion from this work is that when the covariance matrix is approximately low rank, an active covariance estimation approach can perform almost as well as an estimator with complete observations. The paper is organized as follows, in Section 2 we review related work. Sections 3 and 4 introduce the problem and the main results respectively. In Section 5 we show an application of our bounds for optimal design of sampling probabilities for a batch based active covariance estimation algorithm. Proofs are presented in Section 6 and conclusion in Sections 7.
2 Related work
Lounici  studies covariance matrix estimation from missing data when all variables can be observed with the same probability. The focus of  is on estimation of approximately low rank covariance matrices with a matrix version of the lasso algorithm. In  the empirical covariance matrix under missing data is shown to be indefinite. Most recently Cai and Zhang  study the missing completely at random model (MCAR), which assumes arbitrary and unknown missing data probabilities. They show that a modified sample covariance matrix estimator, similar to ours but using the empirical estimation of probabilities , achieves optimal minimax rates in spectral norm for bandable and sparse covariance matrices. Compared to [3, 5, 4], we allow all probabilities to be different and known, and we only study the unbiased sample covariance estimator. Moreover, all of the aforementioned papers consider estimation errors in spectral norm, while we consider errors in Frobenius norm, allowing us to derive error bounds with precise dependences between the entries of the true covariance matrix and the sampling probabilities . These bounds can then be applied to design the sampling distribution in an adaptive manner.
For distributed computing applications, the work of  studies covariance estimation from random subspace projections using dense matrices, which are generalized in  to sparse projection matrices. The same approach from  is used in  for memory and complexity reduced PCA. In all these works [6, 8, 7], each measurement is a (possibly sparse) linear combination of a few variables, and even though the designs are random, they have a fixed distribution for all observations.
The work of  considers the case when all are unknown, and uses an unbiased covariance matrix estimator as an input to the graphical lasso algorithm  for inverse covariance matrix estimation. Other interesting active learning approaches for graphical model selection are ,  and . Vats et. al.  uses a method that combines sampling marginals with conditional independence testing to learn the graphical model structure. [11, 13] consider a more general family of algorithms based on sampling high degree vertices, and prove upper and lower performance bounds.
Random sub-sampling and reconstruction of signals has been studied within graph signal processing [14, 15] and statistics . The Bernoulli observation model we use has been studied by [14, 15] as a sampling strategy for graph signals, where sampling probability designs are proposed for reconstruction of deterministic band-limited signals. Also, Romero et. al.  derives covariance matrix estimators assuming the target covariance matrix is a linear combination of known covariance matrices, which leads to algorithms and theoretical analysis that are fundamentally different from ours.
3 Problem Formulation
We denote scalars using regular font, while we use bold for vectors and matrices, e.g., , and . The Hadamard product between matrices is defined as . We use for entry-wise matrix norms, with corresponding to the Frobenius norm. denotes norm or spectral norm when applied to vectors or matrices respectively.
3.2 Unbiased estimation
Consider a random vector taking values in . We observe
where is a vector of Bernoulli random variables. The probability of observing the -th variable is given by . The vector of probabilities is denoted by , and is a diagonal matrix. The average number of samples corresponds to . Given , i.i.d. realizations of , the -th variable will be sampled in average times, if all , then , and we have perfect observation of . We are interested in studying covariance estimation for when and for all . Let and be the mean and covariance of , then
where is defined as and when .
For the rest of the paper we will assume . Given a set of of i.i.d. samples of the random vector , define
where is the Hadamard (entry-wise) inverse of . A simple calculation shows that
is an unbiased estimator for. Indeed, . Because , the matrix might not be positive semi-definite (conditions for to be positive semi-definite are given in ).
4 Estimation error
In this section we present an error analysis of the covariance matrix estimator from (2) when has sub-Gaussian entries. Sub-Gaussian random variables include Gaussian, Bernoulli and Bounded random variables. For more information see  and references therein.
Definition 1 ().
If holds for some , we say is sub-Gaussian. If holds for some , we say is sub-exponential.
Definition 2 ().
The sub-Gaussian and sub-exponential norms are defined as
for and respectively.
Sub-Gaussian and sub-exponential random variables, and their norms, are related as follows.
Proposition 1 ().
If and are sub-Gaussian, then and are sub-exponential with norms satisfying , and .
We have that the following characterization of the product of sub-Gaussian and Bernoulli random variables.
Let and be a product of Bernoulli and sub-Gaussian random variables with Bernoulli probabilities and respectively, the only dependent variables are and , then
is sub-Gaussian and .
, , and are sub-exponential with norms satisfying for .
The proof of Lemma 1 can be easily obtained from the definition of sub-Gaussian and sub-exponential norms. We omit it for space considerations. We also define the matrix , with entries given by
Now we state our main result, whose proof appears in Section 6.
Let be zero mean random vector in with sub-Gaussian entries and norm . Let , where each is a Bernoulli random variable with parameter , independent of each other and of . Given i.i.d. realizations , the estimator from (2) satisfies
with probability at least , where is an universal constant. Moreover if and , then
where , and is the effective rank.
Theorem 1 shows that the estimation error in probability as the number of samples increases. More importantly, our result reveals that the sampling probabilities are closely related to the sub-exponential norms of the variables through the matrix . The bound from (3 ) suggests that distributions with smaller effective rank [3, 18] can tolerate a more aggressive sub-sampling factor (smaller ). This is not surprising since the effective rank is upper bounded by the actual rank, and can be significantly smaller for distributions whose energy concentrates in few principal components. We also note that the ratio appears in the bounds of  as well.
5 active covariance estimation
In this section we consider a scenario where we cannot observe (on average) more than
variables at a time, but we have the freedom to choose the sub-sampling probability distribution.
5.1 Sub-sampling distribution for true covariance matrix
Based on the error bound from Theorem 1, we propose designing the sub-sampling distribution by approximately minimizing . Theorem 1 suggests that should be larger whenever the sub-Gaussian norm of is large, but also the product should be large when the sub-exponential norm of is large. We assume that . The bounds for and from (6) and (7) respectively suggest the approximation . Given a sampling budget , we estimate the sub-sampling probability vector by solving the following scaled projection problem
5.2 Sub-sampling distribution for empirical covariance matrix
Since the true covariance matrix is unknown, (4) does not lead to a practical estimator. Instead, we consider a batch based algorithm that for a given budget , and a starting sub-sampling distribution, it iteratively refines the sub-sampling probability distribution as a function of previous observations. We show the pseudo code for such procedure in Algorithm 1, which can be summarized in the following steps: observation with variable sub-sampling, covariance estimation, and sub-sampling distribution update. At the -th iteration, i.i.d. realizations are observed according to (1) with sub-sampling probabilities . The covariance estimator is a convex combination of the estimator at the previous iteration, and the estimator for the current batch. Finally, the new covariance matrix estimator is used to update the sub-sampling probabilities as
Algorithm 1 produces an unbiased estimator for .
We will proceed by induction. For the first iteration we have used uniform sampling, thus we have that . Now assume , then at the -th iteration, the covariance of the new data satisfies , which implies . ∎
5.3 Numerical evaluation
In this section we evaluate our proposed method using the MNIST  dataset, which consists of images of scanned digits from to . In our experiments we consider images of the digit . The vectorized, and mean removed images are denoted by with covariance matrix . We consider estimation of the the covariance matrix of , where
is zero mean Gaussian noise with unit variance, therefore.
To draw i.i.d. realizations of , we sample images without replacement and add Gaussian noise with variance . The effective rank of is controlled by the parameter and satisfies
We first compare uniform sub-sampling with non uniform sub-sampling for estimation of with . We designed the non uniform sub-sampling distribution using (4) with the true covariance matrix. We report relative errors in Frobenius norm as a function of in Figure 0(a). Each point in the plot is an average over independent trials. We observe that when the non uniform sampling distribution matches closely the performance of the estimator with full data. It is clear that when and performance decreases (for uniform and non uniform sampling) as decreases, and non uniform sampling always outperforms uniform sampling.
In Figure 0(b) we evaluate our active method from Algorithm 1 with batch size . We consider the same scenario as in Figure 0(a) for and . The proposed active covariance estimation method quickly learns the optimal sub-sampling distribution, is always better than uniform sampling, and matches the performance of the non uniform sub-sampling method obtained from the true covariance matrix. Finally we show in Figure 0(c) the same experiment shown in Figure 0(b), but now with covariance matrix with parameter , which changes the effective rank from with to . We observe that the problem becomes more difficult since the estimation errors are larger. There is no major difference between uniform, and non uniform sampling, thus the advantages of the proposed active covariance estimation method are limited. This might be due to various effects including, relative magnitudes of diagonal entries of , sampling budget , effective rank, and probability update algorithm. Moreover, since the effective rank did not change much (compared with ), this experiment suggests the effective rank does not quantify effectively the problem difficulty. Also, a more precise method to update the probabilities might help improving the performance of the active covariance estimation algorithm.
6 Proof of Theorem 1
We first need the following concentration bounds
Under the same assumptions of Theorem 1, for any we have
for off-diagonal and diagonal entries respectively.
We also need the following geometric result which we state without proof.
Let , and , then for all , and that satisfy , we have that .
The last inequality follows from Lemma 2 with constants , appropiately chosen so the inequalities hold for all pairs . We can further simplify by choosing
where . The proof can be finished by equating to , solving for , and doing some manipulations. To derive the bound from (3) we bound the entries of obtaining
The last step uses the fact that for all , and the definition of effective rank.
We studied covariance matrix estimation when the variables are sub-sampled by a product with Bernoulli variables. Variations of this model have been traditionally considered in the analysis of missing data. We study an unbiased estimator for the covariance matrix and derive a novel estimation error bound in entry-wise norm. Our bound illustrates the subtle relations between covariance matrix parameters and sub-sampling distribution. Using this bound, we propose an active covariance matrix estimation algorithm that also produces an unbiased estimator. We show with numerical experiments that the proposed active covariance estimation algorithm outperforms uniform sub-sampling, and closely matches non-uniform sub-sampling with complete knowledge of the true covariance matrix.
Muhammad Tayyab Asif, Nikola Mitrovic, Justin Dauwels, and Patrick Jaillet,
“Matrix and tensor based methods for missing data estimation in large traffic networks,”IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp. 1816–1825, 2016.
-  Amol Deshpande, Carlos Guestrin, Samuel R Madden, Joseph M Hellerstein, and Wei Hong, “Model-driven data acquisition in sensor networks,” in Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004, pp. 588–599.
-  Karim Lounici, “High-dimensional covariance matrix estimation with missing observations,” Bernoulli, vol. 20, no. 3, pp. 1029–1058, 2014.
-  Kamil Jurczak and Angelika Rohde, “Spectral analysis of high-dimensional sample covariance matrices with missing observations,” Bernoulli, vol. 23, no. 4A, pp. 2466–2532, 2017.
T Tony Cai and Anru Zhang,
“Minimax rate-optimal estimation of high-dimensional covariance
matrices with incomplete data,”
Journal of multivariate analysis, vol. 150, pp. 55–74, 2016.
-  Martin Azizyan, Akshay Krishnamurthy, and Aarti Singh, “Extreme compressive sampling for covariance estimation,” arXiv preprint arXiv:1506.00898, 2015.
-  Farhad Pourkamali-Anaraki, “Estimation of the sample covariance matrix from compressive measurements,” IET Signal Processing, vol. 10, no. 9, pp. 1089–1095, 2016.
Farhad P Anaraki and Shannon Hughes,
“Memory and computation efficient pca via very sparse random
Proceedings of the 31st International Conference on Machine Learning (ICML), 2014, pp. 1341–1349.
-  Mladen Kolar and Eric P Xing, “Consistent covariance selection from data with missing values,” in Proceedings of the 29th International Conference on Machine Learning (ICML), 2012, pp. 551–558.
-  Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
-  Gautamd Dasarathy, Aarti Singh, Maria-Florina Balcan, and Jong H Park, “Active learning algorithms for graphical model selection,” in Artificial Intelligence and Statistics (AISTATS), 2016, pp. 1356–1364.
-  Divyanshu Vats, Robert Nowak, and Richard Baraniuk, “Active learning for undirected graphical model selection,” in Artificial Intelligence and Statistics (AISTATS), 2014, pp. 958–967.
-  Jonathan Scarlett and Volkan Cevher, “Lower bounds on active learning for graphical model selection,” in The 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
-  Siheng Chen, Rohan Varma, Aarti Singh, and Jelena Kovačević, “Signal recovery on graphs: Fundamental limits of sampling strategies,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 539–554, 2016.
-  Gilles Puy, Nicolas Tremblay, Rémi Gribonval, and Pierre Vandergheynst, “Random sampling of bandlimited signals on graphs,” Applied and Computational Harmonic Analysis, 2016.
-  Po-Ling Loh and Martin J Wainwright, “High dimensional regression with noisy and missing data: provable guarantees with non-convexity,” The Annals of Statistics, vol. 40, no. 3, pp. 1637–1664, 2012.
-  Daniel Romero, Dyonisius Dony Ariananda, Zhi Tian, and Geert Leus, “Compressive covariance sensing: Structure-based compressive sensing beyond sparsity,” IEEE signal processing magazine, vol. 33, no. 1, pp. 78–93, 2016.
High Dimensional Probability: An Introduction with Applications in Data Science, Cambridge University Press, 2018.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 – 2324, 1998.