Joint mean and covariance estimation with unreplicated matrix-variate data

by   Michael Hornstein, et al.

It has been proposed that complex populations, such as those that arise in genomics studies, may exhibit dependencies among the observations as well as among the variables. This gives rise to the challenging problem of analyzing unreplicated high-dimensional data with unknown mean and dependence structures. Matrix-variate approaches that impose various forms of (inverse) covariance sparsity allow flexible dependence structures to be estimated, but cannot directly be applied when the mean and dependences are estimated jointly. We present a practical method utilizing penalized (inverse) covariance estimation and generalized least squares to address this challenge. The advantages of our approaches are: (i) dependence graphs and covariance structures can be estimated in the presence of unknown mean structure, (ii) the mean structure becomes more efficiently estimated when accounting for the dependence structure(s); and (iii) inferences about the mean parameters become correctly calibrated. We establish consistency and obtain rates of convergence for estimating the mean parameters and covariance matrices. We use simulation studies and analysis of genomic data from a twin study of ulcerative colitis to illustrate the statistical convergence and the performance of the procedures in practical settings. Furthermore, several lines of evidence show that the test statistics for differential gene expression produced by our method are correctly calibrated.


Gemini: Graph estimation with matrix variate normal instances

Undirected graphs can be used to describe matrix variate distributions. ...

Slicing: Nonsingular Estimation of High Dimensional Covariance Matrices Using Multiway Kronecker Delta Covariance Structures

Nonsingular estimation of high dimensional covariance matrices is an imp...

Estimating The Proportion of Signal Variables Under Arbitrary Covariance Dependence

Estimating the proportion of signals hidden in a large amount of noise v...

Joint Estimation of Precision Matrices in Heterogeneous Populations

We introduce a general framework for estimation of inverse covariance, o...

Simultaneous and Temporal Autoregressive Network Models

While logistic regression models are easily accessible to researchers, w...

Bayesian Optimal Two-sample Tests in High-dimension

We propose optimal Bayesian two-sample tests for testing equality of hig...

Probabilistic Detection and Estimation of Conic Sections from Noisy Data

Inferring unknown conic sections on the basis of noisy data is a challen...

1 Introduction

Understanding how changes in gene expression are related to changes in biological state is one of the fundamental tasks in genomics research, and is a prototypical example of “large scale inference” (efron2010large). While some genomics datasets have within-subject replicates or other known clustering factors that could lead to dependence among observations, most are viewed as population cross-sections or convenience samples, and are usually analyzed by taking observations (biological samples) to be statistically independent of each other. Countering this conventional view, Efr09 proposed that there may be unanticipated correlations between samples even when the study design would not suggest it. To identify and adjust for unanticipated sample-wise correlations, Efr09

proposed an empirical Bayes approach utilizing the sample moments of the data. In particular, sample-wise correlations may lead to inflated evidence for mean differences, and could be one explanation for the claimed lack of reproducibility in genomics research  

(leek2010tackling; allen2012inference; sugden2013assessing).

A persistent problem in genomics research is that test statistics for mean parameters (e.g. t-statistics for two-group comparisons) often appear to be incorrectly calibrated (efron:05; allen2012inference). When this happens, for example when test statistics are uniformly overdispersed relative to their intended reference distribution, this is usually taken to be an indication of miscalibration, rather than reflecting a nearly global pattern of differential effects  (efron2007correlation). Adjustments such as genomic control  (devlin1999genomic) can be used to account for this; a related approach is that of allen2012inference. In this work we address unanticipated sample-wise dependence, which can exhibit a strong effect on statistical inference. We propose a new method to jointly estimate the mean and covariance with a single instance of the data matrix, as is common in genetics. The basic idea of our approach is to alternate for a fixed number of steps between mean and covariance estimation. We exploit recent developments in two-way covariance estimation for matrix-variate data (Zhou14a). We crucially combine the classical idea of generalized least squares (GLS)  (aitken1936iv)

with thresholding for model selection and estimation of the mean parameter vector. Finally, we use Wald-type statistics to conduct inference. We motivate this approach using differential expression analysis in a genomics context, but the method is broadly applicable to matrix-variate data having unknown mean and covariance structures, with or without replications. We illustrate, using theory and data examples, including a genomic study of ulcerative colitis, that estimating and accounting for the sample-wise dependence can systematically improve the calibration of test statistics, therefore reducing or eliminating the need for certain post-hoc adjustments.

With regard to variable selection, another major challenge we face is that variables (e.g. genes or mRNA transcripts) have a complex dependency structure that exists together with any dependencies among observations. As pointed out by Efr09 and others, the presence of correlations among the samples makes it more difficult to estimate correlations among variables, and vice versa. A second major challenge is that due to dependence among both observations and variables, there is no independent replication in the data, that is, we have a single matrix to conduct covariance estimation along both axes. This challenge is addressed in Zhou14a when the mean structure is taken to be zero. A third major challenge that is unique to our framework is that covariance structures can only be estimated after removing the mean structure, a fact that is generally not considered in most work on high dimensional covariance and graph estimation, where the population mean is taken to be zero. We elaborate on this challenge next.

1.1 Our approach and contributions

Two obvious approaches for removing the mean structure in our setting are to globally center each column of the data matrix (containing the data for one variable), or to center each column separately within each group of sample points to be compared (subsequently referred to as “group centering”). Globally centering each column, by ignoring the mean structure, may result in an estimated covariance matrix that is not consistent. Group centering all genes, by contrast, leads to consistent covariance estimation, as shown in Theorem 3 with regard to Algorithm 1. However, group centering all genes introduces extraneous noise when the true vector of mean differences is sparse. We find that there is a complex interplay between the mean and covariance estimation tasks, such that overly flexible modeling of the mean structure can introduce large systematic errors in the mean structure estimation. To mitigate this effect, we aim to center the data using a model selection strategy. More specifically, we adopt a model selection centering approach in which only mean parameters having a sufficiently large effect size (relative to the dimension of the data) are targeted for removal. This refined approach has theoretical guarantees and performs well in simulations. The estimated covariance matrix can be used in uncertainty assessment and formal testing of mean parameters, thereby improving calibration of the inference.

In Section 2, we define the two group mean model, which is commonly used in the genomics literature, and introduce the GLS algorithm in this context. We bound the statistical error for estimating each column of the mean matrix using the GLS procedure so long as each column of shares the same covariance matrix , for which we have a close approximation. It is commonly known that genes are correlated, so correlations exist across columns as well as rows of the data matrix. In particular, in Theorem 1 in Section 3.1, we establish consistency for the GLS estimator given a deterministic which is close to in the operator norm, and present the rate of convergence for mean estimation for data generated according to a subgaussian model to be defined in Definition 2.1. Moreover, we do not impose a separable covariance model in the sense of (1).

What distinguishes our model from those commonly used in the genomics literature is that we do not require that individuals are independent. Our approach to covariance modeling builds on the Gemini method (Zhou14a), which is designed to estimate a separable covariance matrix for data with two-way dependencies. For matrices and , the Kronecker product is the block matrix for which the th block is , for . We say that an random matrix follows a matrix variate distribution with mean and a separable covariance matrix


if has mean and covariance . Here is formed by stacking the columns of into a vector in . For the mean matrix , we focus on the two-group setting to be defined in (4). Intuitively, describes the covariance between columns while describes the covariance between rows of . Even with perfect knowledge of , we can only estimate and up to a scaling factor, as for any , and hence this will be our goal and precisely what we mean when we say we are interested in estimating covariances and . However, this lack of identifiability does not affect the GLS estimate, because the GLS estimate is invariant to rescaling the estimate of .

1.2 Related work

Efr09 introduced an approach for inference on mean differences in data with two-way dependence. His approach uses empirical Bayes ideas and tools from large scale inference, and also explores how challenging the problem of conducting inference on mean parameters is when there are uncharacterized dependences among samples. We combine GLS and variable selection with matrix-variate techniques. allen2012inference

also consider this question and develop a different approach that uses ordinary least squares (OLS) through the iterations, first decorrelating the residuals and then using OLS techniques again on this adjusted dataset. The confounder adjustment literature in genomics, including

sun2012multiple and wang2015confounder, can also be used to perform large-scale mean comparisons in similar settings that include similarity structures among observations. These methods use the same general matrix decomposition framework, where the mean and noise are separated. They exploit low-rank structure in the mean matrix, as well as using sparse approximation of OLS estimates, for example where thresholding. Our model introduces row-wise dependence through matrix-variate noise, while the confounder adjustment literature instead assumes that a small number of latent factors also affect the mean expression, resulting in additional low-rank structure in the mean matrix. Web Supplement Section J contains detailed comparisons between our approach and these alternative methods.

Our inference procedures are based on Z-scores and associated FDR values for mean comparisons of individual variables. While we account for sample-wise correlations, gene-gene correlations remain, which we regard as a nuisance parameter. Our estimated correlation matrix among the genes can be used in future work in combination with the line of work that addresses FDR in the presence of gene correlations. This relies on earlier work for false discovery rate estimation using correlated data, including

owen2005variance; benjamini2001control; tony2011optimal; li2014rate; benjamini1995controlling; storey2003positive. Taking a different approach, hall2010innovated develop the innovated higher criticism test statistics to detect differences in means in the presence of correlations between genes. Our estimated gene-gene correlation matrix can be used in combination with this approach; we leave this as future work. Another line of relevant research has focused on hypothesis testing of high-dimensional means, exploiting assumed sparsity of effects, and developing theoretical results using techniques from high dimensional estimation theory. Work of this type includes caihigh; chen2014twoSampleThresh; bai1996effect; chen2010two. Hoff11

adopts a Bayesian approach, using a model that is a generalization of the matrix-variate normal distribution.

Our method builds on the Gemini estimator introduced by Zhou14a, which estimates covariance matrices when both rows and columns of the data matrix are dependent. In the setting where correlations exist along only one axis of the array, researchers have proposed various covariance estimators and studied their theoretical and numerical properties (BGA08; FFW09; FHT07; LF07; MB06; PWZZ09; RWRY08; RBLZ08; YL07; ZLW08). Although we focus on the setting of Kronecker products, or separable covariance structures, cai2015joint proposed a covariance estimator for a model with several populations, each of which may have a different variable-wise covariance matrix. Our methods can be generalized to this setting. tan2014sparse use a similar matrix-variate data setting as in (1), but perform biclustering instead of considering a regression problem with a known design matrix.

1.3 Notation and organization

Before we leave this section, we introduce the notation needed for the technical sections. Let be the canonical basis of . For a matrix , let denote the determinant and be the trace of . Let denote the entry-wise max norm. Let denote the matrix norm. The Frobenius norm is given by . Let denote the

th largest eigenvalue of

, with and denoting the largest and smallest eigenvalues, respectively. Let be the condition number for matrix . Let denote the sum of the absolute values of the off-diagonal entries and let denote the number of non-zero off-diagonal entries. Let . Denote by the stable rank . We write for a diagonal matrix with the same diagonal as . Let

be the identity matrix. We let

be positive constants which may change from line to line. For two numbers , and . Let . For sequences , we write if for some positive absolute constant which is independent of and or sparsity parameters, and write if . We write if for some positive absolute constant which is independent of and or sparsity parameters. We write if

. For random variables

and , let denote that and follow the same distribution.

The remainder of the paper is organized as follows. In Section 2, we present our matrix-variate modeling framework and methods on joint mean and covariance estimation. In particular, we propose two algorithms for testing mean differences based on two centering strategies. In Section 3, we present convergence rates for these methods. In Theorems 3 and 4, we provide joint rates of convergence for mean and covariance estimation using Algorithms 1 and 2, respectively. We also emphasize the importance of the design effect (c.f. equation (15)) in testing and present theoretical results for estimating this quantity in Corollary 2 and Corollary 5. In Section 4, we demonstrate through simulations that our algorithms can outperform OLS estimators in terms of accuracy and variable selection consistency. In Section 5, we analyze a gene expression dataset, and show that our method corrects test statistic overdispersion that is clearly present when using sample mean based methods (c.f. Section 4.2). We conclude in Section 6. We place all technical proofs and additional simulation and data analysis results in the Web Supplement, which is organized as follows. Sections A and  B contain additional simulation and data analysis results. Section C contains some preliminary results and notation. In Section D, we prove Theorem 1. In Sections E and F we prove Theorem 3. In Section G, we derive entry-wise rates of convergence for the sample covariance matrices. In Sections H and I we prove Theorem 4 and its auxiliary results. In Section J we provide additional comparisons between our method and some related methods on both simulated and real data.

2 Models and methods

In this section we present our model and method for joint mean and covariance estimation. Our results apply to subgaussian data. Before we present the model, we define subgaussian random vectors and the norm. The condition on a scalar random variable is equivalent to the subgaussian tail decay of , which means For a vector , denote by .

Definition 2.1.

Let be a random vector in . (a) is called isotropic if for every , . (b) is with a constant if for every ,

Our goal is to estimate the group mean vectors , the vector of mean differences between two groups , the row-wise covariance matrix , and the column-wise covariance matrix . In our motivating genomics applications, the people by people covariance matrix is often incorrectly anticipated to have a simple known structure, for example, is taken to be diagonal if observations are assumed to be uncorrelated. However, we show by example in Section 5 that departures from the anticipated diagonal structure may occur, corroborating earlier claims of this type by Efr09 and others. Motivated by this example, we define the two-group mean model and the GLS algorithm, which takes advantage of the covariance matrix .

The model. Our model for the matrix-variate data can be expressed as a mean matrix plus a noise term,


where columns (and rows) of are subgaussian. Let be defined as


Let denote a vector of ones. For the two-group model, we take the mean matrix to have the form


is the design matrix and is a matrix of group means. Let denote the vector of mean differences. Let denote the size of the support of . To estimate the group means, we use a GLS estimator,


where is an estimate of the observation-wise inverse covariance matrix. Throughout the paper, we denote by the oracle GLS estimator, since it depends on the unknown true covariance . Also, we denote the estimated vector of mean differences as , where .

2.1 Matrix-variate covariance modeling

In the previous section, we have not yet explicitly constructed an estimator of . To address this need, we model the data matrix with a matrix-variate distribution having a separable covariance matrix, namely, the covariance of follows a Kronecker product covariance model. When (2) follows a matrix-variate normal distribution , as considered in  Zhou14a, the support of encodes conditional independence relationships between samples, and likewise, the support of encodes conditional independence relationships among genes. The inverse covariance matrices and have the same supports as their respective correlation matrices, so edges of the dependence graphs are identifiable under the model . When the data is subgaussian, the method is still valid for obtaining consistent estimators of , , and their inverses, but the interpretation in terms of conditional independence does not hold in general.

Our results do not assume normally distributed data; we analyze the subgaussian correspondent of the matrix variate normal model instead. In the Kronecker product covariance model we consider in the present work, the noise term has the form for a mean-zero random matrix with independent subgaussian entries satisfying . Clearly, . Here, the matrix represents the shared covariance among variables for each sample, while represents the covariance among observations which in turn is shared by all genes.

For identifiability, and convenience, we define


where the scaling factor is chosen so that has trace . For the rest of the paper and refer to and , as defined in (6). Let and denote sample covariance matrices to be specified. Let the corresponding sample correlation matrices be defined as


The baseline Gemini estimators  (Zhou14a) are defined as follows, using a pair of penalized estimators for the correlation matrices and ,


where the input are a pair of sample correlation matrices as defined in (7).

Let denote the estimator of the mean matrix in (1). Denote the centered data matrix and the sample covariance matrices as


Define the diagonal matrices of sample standard deviations as


2.2 Group based centering method

We now discuss our first method for estimation and inference with respect to the vector of mean differences , for and as in (4). Our approach in Algorithm 1 is to remove all possible mean effects by centering each variable within every group.

Algorithm 1: GLS-Global group centering
Input: ; and : indices of group one and two, respectively.
Output: , , , , , for all

1. Group center the data.

Let denote the th row of the data matrix. To estimate the group mean vectors : Compute sample mean vectors

Center the data by
2. Obtain regularized correlation estimates.
  1. The centered data matrix used to calculate and for Algorithm 1 is , where is the projection matrix that performs within-group centering,


    with and as defined in (3). Compute sample covariance matrices based on group-centered data:   and

  2. Compute (7) to obtain penalized correlation matrices and using the Gemini estimators as defined in (8a) and (8b) with tuning parameters to be defined in (23).

3. Rescale the estimated correlation matrices to obtain penalized covariance
4. Estimate the group mean matrix

using the GLS estimator as defined in (5).

5. Obtain test statistics.

The th test statistic is defined as


and , for . Note that as defined in (15

) is essentially a Wald test and the denominator is a plug-in standard error of


2.3 Model selection centering method

In this section we present Algorithm 2, which aims to remove mean effects that are strong enough to have an impact on covariance estimation. The strategy here is to use a model selection step to identify variables with strong mean effects.

Algorithm 2: GLS-Model selection centering
Input: , and : indices of group one and two, respectively.
Output: , , , , , for all

1. Run Algorithm 1.

Use the group centering method to obtain initial estimates for all . Let and be as obtained in (14).

2. Select genes with large estimated differences in means.

Let denote the set of genes which we consider as having strong mean effects, where


with , , and .

3. Calculate Gram matrices based on model selection centering.

Global centering can be expressed in terms of the projection matrix . Compute the centered data matrix

where denotes the th column of the centered data matrix . Compute the sample covariance and correlation matrices with following (9) and (7).

4. Estimate covariances and means.
  1. Obtain the penalized correlation matrices and using Gemini estimators as defined in (8a) and (8b) with tuning parameters of the same order as those in (23).

  2. Obtain inverse covariance estimates , using (14).

  3. Calculate the GLS estimator as in (5), as well as the vector of mean differences , for .

5. Obtain test statistics.

Calculate test statistics as in (15), now using as estimated in Step 4.

Remarks. In the case that is sparse, we show that this approach can perform better than the approach in Section 2.2, in particular when the sample size is small. We now consider the expression in (16) as an upper bound on the threshold in the sense that it is chosen to tightly control false positives. In Section 4.2 we show in simulations that with this plug-in estimate , Algorithm 2 can nearly reach the performance of GLS with the true . Since this choice of acts as an order on the threshold we need, the plug-in method can also be applied with a multiplier between and . When we set at its lower bound, namely,

we anticipate many false positives. In Figure 3, we show that the performance of Algorithm 2 is stable in the setting of small and sparse for different values of , demonstrating robustness of our methods to the multiplier; there we observe that the performance can degrade if the threshold is set to be too small, eventually reaching the performance of Algorithm 1.

Second, if an upper bound on the number of differentially expressed genes is known a priori, one can select a set of genes to group center such that the cardinality is understood to be chosen as an upper bound on based on prior knowledge. We select the set by ranking the components of the estimated vector of mean differences . In the data analysis in Section 5 we adopt this strategy in an iterative manner by successively halving the number of selected genes, choosing at each step the genes with largest estimated mean differences from the previous step. We show in this data example and through simulation that the proposed method is robust to the choice of .

Finally, it is worth noting that these algorithms readily generalize to settings with more than two groups, in which case we simply group center within each group. This is equivalent to applying the method with a different design matrix

. In fact, we can move beyond group-wise mean comparisons to a regression analysis with a fixed design matrix

, which includes the -group mean analysis as a special case.

3 Theoretical results

We first state Theorem 1, which provides the rate of convergence of the GLS estimator (5) when we use a fixed approximation of the covariance matrix . We then provide in Theorems 3 and 4 the convergence rates for estimating the group mean matrix for Algorithms 1 and 2 respectively. In Theorem 3 we state rates of convergence for the Gemini estimators of and when the input sample covariance matrices use the group centering approach as defined in Algorithm 1, while in Theorem 4, we state only the rate of convergence for estimating , anticipating that the rate for can be similarly obtained, using the model selection centering approach as defined in Algorithm 2.

3.1 GLS under fixed covariance approximation

We now state a theorem on the rate of convergence of the GLS estimator (5), where we use a fixed approximation to , where the operator norm of is small in the sense of (17). We will specialize Theorem 1 to the case where is estimated using the baseline method in Zhou14a when follows subgaussian matrix-variate distribution as in (1). We prove Theorem 1 in Web Supplement Section D.

Theorem 1.

Let be an random matrix with independent entries satisfying , . Let be the columns of . Suppose the th column of the data matrix satisfies . Suppose is a positive definite symmetric matrix. Let . Suppose


Then with probability at least

, for some absolute constants , ,

and (20)

Remarks. If the operator norm of is bounded, that is , then condition (17) is equivalent to . The term in (19) reflects the error due to approximating with , whereas reflects the error in estimating the mean matrix (5) using GLS with the true for the random design . The term is , whereas is . The dominating term in (19) can be replaced by the tighter bound, namely, , with . This bound correctly drops the factor of present in (19) and (20), while revealing that variation aligned with the column space of is especially important in mean estimation.

Note that the condition (17) is not stringent, and that the estimates used in Algorithms 1 and 2 have much lower errors than this. When is known, and can be the usual Gram matrices, and the theory in Zhou14a guarantees that as defined in (19) has rate , with . However in our setting, in general is nonzero. In Sections 2.2 and 2.3 we provide two constructions for and , which differ in how the data are centered. These constructions have a different bound , as we will discuss in Theorems 3 and 4.

In Section 4, we present simulation results that demonstrate the advantage of the oracle GLS and GLS with estimated (5) over the sample mean based (OLS) method (c.f. (12) and (32)) for mean estimation as well as the related variable selection problem with respect to . There, we scrutinize this quantity and its estimation procedure in detail.

Design effect.

The “design effect” is the variance of the “oracle” GLS estimator (

5) of using the true , that is,


The design effect reflects the potential improvement of GLS over OLS. It appears as a factor above in , so it contributes to the rate of mean parameter estimation as characterized in Theorem 1. Lower variance in the GLS estimator of the mean difference contributes to greater power of the test statistics relative to OLS. The design effect also appears as a scale factor in the test statistics for (15), and therefore it is particularly important that the design effect is accurately estimated in order for the test statistics to be properly calibrated. In a study focusing on mean differences, it may be desirable to assess the sample size needed to detect a given effect size using our methodology. Given the design effect, our tests for differential expression are essentially Z-tests based on the GLS fits, followed by some form of multiple comparisons adjustment.

Corollary 2.

Let , , and . Under the conditions of Theorem 1, the relative error in estimating the design effect is bounded as


with probability , for some absolute constants .

We prove Corollary 2 in Web Supplement Section D.2. Corollary 2 implies that given an accurate estimator of , the design effect is accurately estimated and therefore suggests that traditional techniques can be used to gain an approximate understanding of the power of our methods. We show that can be accurately estimated under conditions in Theorems 3 and 4. If pilot data are available that are believed to have similar between-sample correlations to the data planned for collection in a future study, Corollary 2 also justifies using this pilot data to estimate the design effect. If no pilot data are available, it is possible to conduct power analyses based on various plausible specifications for the matrix.

3.2 Rates of convergence for Algorithms 1 and 2

We state the following assumptions.
(A1) The number of nonzero off-diagonal entries of and satisfy

(A2) The eigenvalues of and are bounded away from 0 and . We assume that the stable ranks satisfy , where .

Theorem 3.

Suppose that (A1) and (A2) hold. Consider the data as generated from model (2) with , where and are positive definite matrices, and is an random matrix as defined in Theorem 1. Let be some absolute constants. Let and . (I) Let and denote the penalty parameters for (8b) and (8a) respectively. Suppose


Then with probability at least , for as define in (11),


Furthermore, with probability at least ,

where (25)

The same conclusions hold for the inverse estimate, with being bounded in the same order as in (25). (II) Let be defined as in (5) with being defined as in (14) and as in (4). Then, with probability at least the following holds for all ,


We prove Theorem 3 part I in Web Supplement Section E; this relies on rates of convergence of and in the operator and the Frobenius norm, which are established in Lemma S7. We prove part II in Web Supplement Section E.2.

Remarks. We find that the additional complexity of estimating the mean matrix leads to an additional additive term of order appearing in the convergence rates for covariance estimation for and . In part I of Theorem 3, is decomposed into two terms, one term reflecting the variance of , and one term reflecting the bias due to group centering. The variance term goes to zero as increases, and the bias term goes to zero as increases. To analyze the error in the GLS estimator based on , we decompose as

where the first term is the error due to not knowing , and the second term is the error due to not knowing . The rate of convergence given in (26) reflects this decomposition. For Algorithm 2, we have analogous rates of convergence for both mean and covariance estimation. Simulations suggest that the constants in the rates for Algorithm 2 are smaller than those in (26).

We state the following assumptions for Theorem 4 to hold on Algorithm 2.

(A2’) Suppose (A2) holds, and .

(A3) Let . Let denote the sparsity of . Assume that .

Remarks. Condition (A2’) is mild, because the condition on the stable rank of already implies that .

Theorem 4.

Suppose that (A1), (A2’), and (A3) hold. Consider the data as generated from model (4) with , where and are positive definite matrices, and is an random matrix as defined in Theorem 3. Let denote the penalty parameter for estimating . Suppose is as defined in (23). Let


Then with probability at least , for output of Algorithm 2,