1 Introduction
This paper studies online active learning
for estimation of linear models. Active learning is motivated by the premise that in many sequential data collection scenarios, labeling or obtaining output from observations is costly. Thus ongoing decisions must be made about whether to collect data on a particular unit of observation. Active learning has a rich history; see, e.g.,
[8, 7, 6, 17, 3].As a motivating example, suppose that an online marketing organization plans to send display advertising promotions to a new target market. Their goal is to estimate the revenue that can be expected for an individual with a given covariate vector. Unfortunately, providing the promotion and collecting data on each individual is costly. Thus the goal of the marketing organization is to acquire first the most “informative” observations. They must do this in an online fashion: opportunities to display the promotion to individuals arrive sequentially over time. In online active learning, this is achieved by selecting those observational units (target individuals in this case) that provide the most information to the model fitting procedure.
Linear models are ubiquitous in both theory and practice—often used even in settings where the data may exhibit strong nonlinearity—in large part because of their interpretability, flexibility, and simplicity. As a consequence, in practice, people tend to add a large number of features and interactions to the model, hoping to capture the right signal at the expense of introducing some noise. Moreover, the input space can be updated and extended iteratively after data collection if the decision maker feels predictions on a heldout set are not good enough. As a consequence, often times the number of covariates becomes higher than the number of available observations. In those cases, selecting the subsequent most informative data is even more critical. Accordingly, our focus is on actively choosing observations for optimal prediction of the resulting highdimensional linear models.
Our main contributions are as follows. We initially focus on standard linear models, and build the theory that we later extend to high dimensional settings. First, we develop an algorithm that sequentially selects observations if they have sufficiently large norm, in an appropriate space (dependent on the datagenerating distribution). Second, we provide a comprehensive theoretical analysis of our algorithm, including upper and lower bounds. We focus on minimizing mean squared prediction error (MSE), and show a high probability upper bound on the MSE of our approach (cf. Theorem
3.1). In addition, we provide a lower bound on the best possible achievable performance in high probability and expectation (cf. Section 4). In some distributional settings of interest we show that this lower bound structurally matches our upper bound, suggesting our algorithm is nearoptimal.The results above show that the improvement of active learning progressively weakens as the dimension of the data grows, and a new approach is needed. To tackle our original goal and address this degradation, under standard sparsity assumptions, we design an adaptive extension of the thresholding algorithm that initially devotes some budget to learn the sparsity pattern of the model, in order to subsequently apply active learning to the relevant lower dimensional subspace. We find that in this setting, the active learning algorithm provides significant benefit over passive random sampling. Theoretical guarantees are given in Theorem 3.3.
Finally, we empirically evaluate our algorithm’s performance. Our tests on real world data show our approach is remarkably robust: the gain of active learning remains significant even in settings that fall outside our theory. Our results suggest that the thresholdbased rule may be a valuable tool to leverage in observationlimited environments, even when the assumptions of our theory may not exactly hold.
Active learning has mainly been studied for classification; see, e.g., [1, 10, 2, 28, 9]. For regression, see, e.g., [18, 24, 5] and the references within. A closely related work to our setting is [23]: they study online or streambased active learning for linear regression, with random design. They propose a theoretical algorithm that partitions the space by stratification based on MonteCarlo methods, where a recently proposed algorithm for linear regression [14] is used as a black box. It converges to the globally optimal oracle risk under possibly misspecified models (with suitable assumptions). Due to the relatively weak model assumptions, they achieve a constant gain over passive learning. As we adopt stronger assumptions (wellspecified model), we are able to achieve larger than constant gains, with a computationally simpler algorithm. Suppose covariate vectors are Gaussian with dimension ; the total number of observations is ; and the algorithm is allowed to label at most of them. Then, we beat the standard MSE to obtain when , so active learning truly improves performance when or . While [23] does not tackle highdimensional settings, we overcome the exponential data requirements via regularization.
2 Problem Definition
The online active learning problem for regression is defined as follows. We sequentially observe covariate vectors in a dimensional space , which are i.i.d. When presented with the th observation, we must choose whether we want to label it or not, i.e., choose to observe the outcome. If we decide to label the observation, then we obtain . Otherwise, we do not see its label, and the outcome remains unknown. We can label at most out of the observations.
We assume covariates are distributed according to some known distribution , with zero mean , and covariance matrix . We relax this assumption later. In addition, we assume that follows a linear model: , where and i.i.d. We denote observations by , components by , and sets in boldface: .
After selecting observations, , we output an estimate , with no intercept.^{1}^{1}1We assume covariates and outcome are centered. Our goal is to minimize the expected MSE of in norm, i.e. , under random design; that is, when the ’s are random and the algorithm may be randomized. This is related to the Aoptimality criterion, [22]. We use the experimentation budget to minimize the variance of by sampling from a different thresholded distribution. Minimizing expected MSE is equivalent to minimizing the trace of the normalized inverse of the Fisher information matrix ,
where expectations are over all sources of randomness. In this setting, the OLS estimator is the best linear unbiased estimator by the
Gauss–Markov Theorem. Also, for any set of i.i.d. observations, has sampling distribution , [13]. In Section 3.3, we tackle highdimensionality, where , via Lasso estimators within a twostage algorithm.3 Algorithm and Main Results
In this section we motivate the algorithm, state the main result quantifying its performance for general distributions, and provide a highlevel overview of the proof. A corollary for the Gaussian distribution is presented, and we also extend the algorithm by making the threshold adaptive. Finally, we show how to generalize the results to
sparse linear regression. In Appendix E, we derive a CLT approximation with guarantees that is useful in complex or unknown distributional settings.Without loss of generality, we assume that each observation is white, that is,
is the identity matrix. For correlated observations
, we apply to whiten them, (see Appendix A). Note that .We bound the whitened trace as
(1) 
To minimize the expected MSE, we need to maximize the minimum eigenvalue of
with high probability. The thresholding procedure in Algorithm 1 maximizes the minimum eigenvalue of through two observations. First, since the sum of eigenvalues of is the trace of , which is in turn the sum of the norm of the observations, the algorithm chooses observations of large (weighted) norm. Second, the eigenvalues of should be balanced, that is, have similar magnitudes. This is achieved by selecting the appropriate weights for the norm.Let be a vector of weights defining the norm . Let be a threshold. Algorithm 1 simply selects the observations with weighted norm larger than . The selected observations can be thought as i.i.d. samples from an induced distribution : the original distribution conditional on . Suppose observations are chosen and denoted by . Then , where is the covariance matrix with respect to . This covariance matrix is diagonal under density symmetry assumptions, as thresholding preserves uncorrelation; its diagonal terms are
(2) 
Hence, and The main technical result in Theorem 3.1
is to link the eigenvalues of the random matrix
to its deterministic counter part . From the above calculations, the goal is to find such that , and both are as large as possible. The first objective is achieved when there exists some such that(3) 
We note that if has independent components with the same marginal distribution (after whitening), then it suffices to choose for all . It is necessary to choose unequal weights when the marginal distributions of the components are different, e.g., some are Gaussian and some are uniform, or components are dependent. For joint Gaussian, whitening removes dependencies, so we set .
3.1 Thresholding Algorithm
The algorithm is simple. For each incoming observation we compute its weighted norm (possibly after whitening if necessary). If the norm is above the threshold , then we select the observation, otherwise we ignore it. We stop when we have collected observations. Note that random sampling is equivalent to setting .
We want to catch the largest observations given our budget, therefore we require that satisfies
(4) 
If we apply this rule to independent observations coming from , on average we select of them: the largest. If is a solution to (3) and (4), then is also a solution for any . So we require .
Algorithm 1
can be seen as a regularizing process similar to ridge regression, where the amount of regularization depends on the distribution
and the budget ratio ; it improves the conditioning of the problem.Guarantees when is unknown can be derived as follows: we allocate an initial sequence of points to estimation of the inverse of the covariance matrix, and the remainder to labeling (where we no longer update our estimate). In this manner observations remain independent. Note that observations are required for accurate recovery when is subgaussian, and if subexponential, [26]. Errors by using the estimate to whiten and make decisions are bounded, small with high probability (via Cauchy–Schwarz), and the result is equivalent to using a slightly worse threshold.
Algorithm 1 keeps the threshold fixed from the beginning, leading to a mathematically convenient analysis, as it generates i.i.d. observations. However, Algorithm 1b, which is adaptive and updates its parameters after each observation, produces slightly better results, as we empirically show in Appendix K. Before making a decision on , Algorithm 1b finds satisfying (3) and
(5) 
where is the number of observations already labeled. The idea is identical: set the threshold to capture, on average, the number of observations still to be labeled, that is , out of the number still to be observed, .
Importantly, active learning not only decreases the expected MSE, but also its variance. Since the variance of the MSE for fixed depends on [13], it is also minimized by selecting observations that lead to large eigenvalues of .
3.2 Main Theorem
Theorem 3.1 states that by sampling observations from where satisfy (3), the estimation performance is significantly improved, compared to randomly sampling observations from the original distribution. Section 4 shows the gain in Theorem 3.1 essentially cannot be improved and Algorithm 1 is optimal. A sketch of the proof is provided at the end of this section (see Appendix B).
Theorem 3.1
Let . Assume observations are distributed according to subgaussian with covariance matrix . Also, assume marginal densities are symmetric around zero after whitening. Let be a matrix with observations sampled from the distribution induced by the thresholding rule with parameters satisfying (3). Let , so that , then, with probability at least
(6) 
where constants depend on the subgaussian norm of .
While Theorem 3.1 is stated in fairly general terms, we can apply the result to specific settings. We first present the Gaussian case where white components are independent. The proof is in Appendix D.
Corollary 3.2
If the observations in Theorem 3.1 are jointly Gaussian with covariance matrix , for all , and , for some constant , then with probability at least we have that
(7) 
The MSE of random sampling for white Gaussian data is proportional to , by the inverse Wishart distribution. Active learning provides a gain factor of order with high probability (a very similar term shows up for random sampling). Note that our algorithm may select fewer than observations. Then, when the number of observations yet to be seen equals the remaining labeling budget, we should select all of them (equivalent to random sampling). The number of observations with
has binomial distribution, is highly concentrated around its mean
, with variance . By the Chernoff Bounds, the probability that the algorithm selects fewer than decreases exponentially fast in . Thus, these deviations are dominated in the bound of Theorem 3.1 by the leading term. In practice, one may set the threshold in (4) by choosing observations for some small , or use the adaptive threshold in Algorithm 1b.3.3 Sparsity and Regularization
The gain provided by active learning in our setting suffers from the curse of dimensionality, as it diminishes very fast when
increases, and Section 4 shows the gain cannot be improved in general. For high dimensional settings (where ) we assume sparsity in , that is, we assume the support of contains at most nonzero components, for some . In Appendix J, we also provide related results for Ridge regression.We state the twostage Sparse Thresholding Algorithm (see Algorithm 2) and show this algorithm effectively overcomes the curse of dimensionality. For simplicity, we assume the data is Gaussian, . Based, for example, on the results of [25] and Theorem 1 in [16], we could extend our results to subgaussian data via the Orthogonal Matching Pursuit algorithm for recovery. The twostage algorithm works as follows. First, we focus on recovering the true support, , by selecting the very first observations (without thresholding), and computing the Lasso estimator . Second, we assign the weights : for , we set , otherwise we set . Then, we apply the thresholding rule to select the remaining observations. While observations are collected in all dimensions, our final estimate is the OLS estimator computed only including the observations selected in the second stage, and exclusively in those dimensions in .
Note that, in general, the points that end up being selected by our algorithm are informational outliers, while not necessarily geometric outliers in the original space. After applying the whitening transformation, ignoring some dimensions based on the Lasso results, and then thresholding based on a weighted norm possibly learnt from data (say, if components are not independent, and we recover the covariance matrix in a online fashion), the algorithm is able to identify good points for the underlying data distribution and
.Theorem 3.3 summarizes the performance of Algorithm 2; it requires the standard assumptions on and for support recovery (see Theorem 3 in [27]).
Theorem 3.3
Let . Assume and satisfy the standard conditions given in Theorem 3 of [27]. Assume we run the Sparse Thresholding algorithm with observations to recover the support of , for an appropriate . Let be observations sampled via thresholding on . It follows that for such that , there exist some universal constants , and that depend on the subgaussian norm of , such that with probability at least
it holds that
Performance for random sampling with the Lasso estimator is . A regime of interest is , and , for large enough , and . In that case, Algorithm 2 leads to a bound of order smaller than , as opposed to a weaker constant guarantee for random sampling. The gain is at least a factor with high probability. The proof is in Appendix H. In practice, the performance of the algorithm is improved by using all the observations to fit the final estimate , as shown in simulations. However, in that case, observations are no longer i.i.d. Also, using thresholding to select the initial observations decreases the probability of making a mistake in support recovery. In Section 5 we provide simulations comparing different methods.
3.4 Proof of Theorem 3.1
The complete proof of Theorem 3.1 is in Appendix B. We only provide a sketch here. The proof is a direct application of spectral results in [26], which are derived via a covering argument using a discrete net on the unit Euclidean sphere , together with a Bernsteintype concentration inequality that controls deviations of for each element in the net. Finally, a union bound is taken over the net. Importantly, the proof shows that if our algorithm uses which are approximate solutions to (3), then (10) still holds with in the denominator of the RHS, instead of . This fact can be quite useful in practice, when is unknown. We can devote some initial budget to recover , and then find approximately solving (3) and (4) under . Note that no labeling is required.
Also, the result can be extended to subexponential distributions. In this case, the probabilistic bound will be weaker (including a term in front of the exponential). More generally, our probabilistic bounds are strongest when for some constant , a common situation in active learning [23], where superlinear requirements in seem unavoidable in noisy settings. A simple bound for the parameter can be calculated as follows. Assume there exists such that and consider the weighted squared norm . Then and which implies that . For specific distributions, can be easily computed. The last inequality is close to equality in cases where the conditional density decays extremely fast for values of above . Heavytailed distributions allocate mass to significantly higher values, and could be much larger than .
4 Lower Bound
In this section we derive a lower bound for the setting. Suppose all the data are given. Again choose the observations with largest norms, denoted by . To minimize the prediction error, the best possible is diagonal, with identical entries, and trace equal to the sum of the norms. No selection algorithm, online or offline, can do better. Algorithm 1 achieves this by selecting observations with large norms and uncorrelated entries (through whitening if necessary). Theorem 4.1 captures this intuition.
Theorem 4.1
Let be an algorithm for the problem we described in Section 2. Then,
(8)  
where is the white observation with the th largest norm. Moreover, fix . Let be the cdf of . Then, with probability at least .
The proof is in Appendix E. The upper bound in Theorem 3.1 has a similar structure, with denominator equal to . By Theorem 3.1, for every component . Hence, summing over all components: . The latter expectation is taken with respect to , which only captures the expected largest observations out of , as opposed to in (8). The weights simply account for the fact that, in reality, we cannot make all components have equal norm, something we implicitly assumed in our lower bound.
We specialize the lower bound to the Gaussian setting, for which we computed the upper bound of Theorem 3.1. The proofs are based on the FisherTippett Theorem and the Gumbel distribution; see Appendix F.
Corollary 4.2
For Gaussian observations and large , for any algorithm
Moreover, let . Then, for any with probability at least and ,
The results from Corollary 3.2 have the same structure as the lower bound; hence in this setting our algorithm is near optimal. Similar results and conclusions are derived for the CLT approximation in Appendix I.
5 Simulations
We conducted experiments in various settings: regularized estimators in highdimensions, and the basic thresholding approach in realworld data to explore its performance on strongly nonlinear environments.
Regularized Estimators. We compare the performance in highdimensional settings of random sampling and Algorithm 1 —both with an appropriately adjusted Lasso estimator— against Algorithm 2, which takes into account the structure of the problem (). For completeness, we also show the performance of Algorithm 2 when all observations are included in the final OLS estimate, and that of random sampling (RS) and Algorithm 1 (Thr) when the true support is known in advance, and the OLS computed on . In Figure 1 (a), we see that Algorithm 2 dramatically reduces the MSE, while in Figure 1 (b) we zoomin to see that, quite remarkably, Algorithm 2 using all observations for the final estimate outperforms random sampling that knows the sparsity pattern in hindsight. We used for recovery. More experiments are provided in Appendix K.
RealWorld Data. We show the results of Algorithm 1b (online estimation) with the simplest distributional assumption (Gaussian threshold, ) versus random sampling on publicly available realworld datasets (UCI, [20]), measuring test squared prediction error. We fix a sequence of values of , together with , and for each pair we run a number of iterations. In each one, we randomly split the dataset in training ( observations, random order), and test (rest of them). Finally,
is computed on selected observations, and the prediction error estimated on the test set. All datasets are initially centered to have zero means (covariates and response). Confidence intervals are provided.
We first analyze the Physicochemical Properties of Protein Tertiary Structure dataset (45730 observations), where we predict the size of the residue, based on variables, including the total surface area of the protein and its molecular mass. Figure 2 (a) shows the results; Algorithm 1b outperforms random sampling for all values of . The reduction in variance is substantial. In the Bike Sharing dataset [12] we predict the number of hourly users of the service, given weather conditions, including temperature, wind speed, humidity, and temporal covariates. There are 17379 observations, and we use covariates. Our estimator has lower mean, median and variance MSE than random sampling; Figure 2 (b). Finally, for the YearPredictionMSD dataset [4], we predict the year a song was released based on covariates, mainly metadata and audio features. There are 99799 observations. The MSE and variance did strongly improve; Figure 2 (c).
In the examples we see that, while active learning leads to strong improvements in MSE and variance reduction for moderate values of with respect to , the gain vanishes when grows large. This was expected; the reason might be that by sampling so many outliers, we end up learning about parts of the space where heavy nonlinearities arise, which may not be important to the test distribution. However, the motivation of active learning are situations of limited labeling budget, and hybrid approaches combining random sampling and thresholding could be easily implemented if needed.
6 Conclusion
Our paper provides a comprehensive analysis of thresholding algorithms for online active learning of linear regression models, which are shown to perform well both theoretically and empirically. Several natural open directions suggest themselves. Additional robustness could be guaranteed in other settings by combining our algorithm as a “black box” with other approaches: for example, some addition of random sampling or stratified sampling could be used to determine if significant nonlinearity is present, and to determine the fraction of observations that are collected via thresholding.
7 Acknowledgments
The authors would like to thank Sven Schmit for his excellent comments and suggestions, Mohammad Ghavamzadeh for fruitful discussions, and the anonymous reviewers for their valuable feedback. We gratefully acknowledge support from the National Science Foundation under grants CMMI1234955, CNS1343253, and CNS1544548.
References

[1]
M.F. Balcan, A. Beygelzimer, and J. Langford.
Agnostic active learning.
In
Proceedings of the 23rd international conference on Machine learning
, pages 65–72. ACM, 2006.  [2] M.F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Learning Theory, pages 35–50. Springer, 2007.
 [3] M.F. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning. Machine learning, 80(23):111–139, 2010.
 [4] T. BertinMahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. 2011.
 [5] W. Cai, Y. Zhang, and J. Zhou. Maximizing expected model change for active learning in regression. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 51–60. IEEE, 2013.
 [6] R. M. Castro and R. D. Nowak. Minimax bounds for active learning. pages 5–19, 2007.
 [7] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine learning, 15(2):201–221, 1994.

[8]
D. A. Cohn, Z. Ghahramani, and M. I. Jordan.
Active learning with statistical models.
Journal of artificial intelligence research
, 1996.  [9] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pages 208–215. ACM, 2008.
 [10] S. Dasgupta, C. Monteleoni, and D. J. Hsu. A general agnostic active learning algorithm. In Advances in neural information processing systems, pages 353–360, 2007.
 [11] P. Embrechts, C. Klüppelberg, and T. Mikosch. Modelling extremal events, volume 33. Springer Science & Business Media, 1997.
 [12] H. FanaeeT and J. Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, pages 1–15, 2013.
 [13] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
 [14] D. Hsu and S. Sabato. Heavytailed regression with a generalized medianofmeans. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 37–45, 2014.

[15]
T. Inglot.
Inequalities for quantiles of the chisquare distribution.
Probability and Mathematical Statistics, 30(2):339–351, 2010.  [16] A. Joseph. Variable selection in highdimension with random designs and orthogonal matching pursuit. Journal of Machine Learning Research, 14(1):1771–1800, 2013.
 [17] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. The Journal of Machine Learning Research, 11:2457–2485, 2010.
 [18] A. Krause and C. Guestrin. Nonmyopic active learning of gaussian processes: an explorationexploitation approach. In Proceedings of the 24th international conference on Machine learning, pages 449–456. ACM, 2007.
 [19] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
 [20] M. Lichman. UCI machine learning repository. 2013.
 [21] K. B. Petersen et al. The matrix cookbook.
 [22] F. Pukelsheim. Optimal design of experiments, volume 50. siam, 1993.
 [23] S. Sabato and R. Munos. Active regression by stratification. In Advances in Neural Information Processing Systems, pages 469–477, 2014.
 [24] M. Sugiyama and S. Nakajima. Poolbased active learning in approximate linear regression. Machine Learning, 75(3):249–274, 2009.
 [25] J. Tropp and A. C. Gilbert. Signal recovery from partial information via orthogonal matching pursuit, 2005.
 [26] R. Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 [27] M. J. Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery usingconstrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183–2202, 2009.
 [28] Y. Wang and A. Singh. Noiseadaptive marginbased active learning and lower bounds under tsybakov noise condition. arXiv preprint arXiv:1406.5383, 2014.
A Whitening
Before thresholding the norm of incoming observations, it is useful to decorrelate and standardize their components, i.e., to whiten the data. Then, we apply the algorithm to uncorrelated covariates, with zero mean and unit variance (not necessarily independent). The covariance matrix can be decomposed as , where is orthogonal, and diagonal with . We whiten each observation to (while for , ), so that . We denote whitened observations by and in the appendix. After some algebra we see that,
(9) 
We focus on algorithms that maximize the minimum eigenvalue of with high probability, or, in general, leading to large and even eigenvalues of .
B Proof of Theorem 3.1
Theorem B.1
Let . Assume observations are distributed according to subgaussian with covariance matrix . Also, assume marginal densities are symmetric around zero after whitening. Let be a matrix with observations sampled from the distribution induced by the thresholding rule with parameters satisfying (3). Let , so that , then, with probability at least
(10) 
where constants depend on the subgaussian norm of .

We would like to choose out of observations iid. Assume our sampling induces a new distribution . The loss we want to minimize for our OLS estimate is
(11) where we assumed Gaussian noise with variance .
Let us see how we construct . We sample , we whiten the observation , and then we select it or not according to a fixed thresholding rule. If , then we keep .
We choose and so that there exists , such that for all ,
(12) where denotes the th component of . Note that .
Moreover, the covariance matrix of is . If is a general subgaussian distribution, thresholding could change the mean away from zero.
Assume after running our algorithm, we end up with . We denote by the observations after whitening, note that by design every passed our test: . In other words, . We see that or, alternatively, .
Now, we can derive
(13) (14) (15) (16) (17) (18) (19) (20) where is actually white data. Thus, note that as .
Assume that is subgaussian such that if , then has full rank with probability one. Thresholding will not change the shape of the tails of the distribution, will also be subgaussian.
At this point, we need to measure how fast goes to 1. We can use Theorem 5.39 in [26] which guarantees that, for such that , with probability at least we have
(21) as is white subgaussian. It follows that for , with probability at least
(22) Note that .
C Proof of
In order to justify that we want to be as close as possible to diagonal, we show the following lemma. Under our assumptions is symmetric positive definite with probability 1.
Lemma C.1
Let be a symmetric positive definite matrix. Then,
(23) 
where returns a diagonal matrix with the same diagonal as the argument.
In other words, we show that for all positive definite matrices with the same diagonal elements, the diagonal matrix (matrix with all off diagonal elements being 0) has the least trace after the inverse operation.

We show this by induction. Consider a matrix
(24) and
(25) since ( is positive definite), the above expression is minimized when , that is, is diagonal.
Assume the statement is true for all matrices. Let be a positive definite matrix. Decompose it as
(26) By the block inverse formula, (see for example [21])
(27) where . Note by Schur’s complement for positive definite matrices. Using the induction hypothesis, . By the positive definiteness of , , therefore .
Also, . Thus,
(28) and the result follows.
D Proof of Corollary 3.2
Corollary D.1
If the observations in Theorem 3.1 are jointly Gaussian with covariance matrix , for all , and , for some constant , then with probability at least we have that
(29) 

We have to show that for all , and satisfy the equations
(30) (31) and . The components of are independent, as observations are jointly Gaussian. It immediately follows that , for all . Thus,
(32) The value of is strongly concentrated around its mean, . We now use two tail approximations to obtain our desired result.
By [19], we have that
(33) If we take , then . In this case, we conclude that
(34) Note that . Therefore, by definition
(35) On the other hand, we would like to show that
(36) as that would directly imply that .