1 Introduction
Linear regression, which dates back to the beginning of the 19th century, is one of the most important statistical tools for practitioners. The theoretical research addressing various issues arising from big data analyses has gained much attention in the last decade. One important problem is to determine which covariates (aka “predictors”) are “useful” or “important” in a linear regression. In early days (before 1970’s), people often rely on the ttest to assess the importance of each individual predictor in a regression model, although the method is known to be problematic due to the existence of highly multicolinearity. A greedy stepwise regression method was later proposed in
Efroymson (1960) to alleviate some of the flaws. Good criterion, such as Akaike information criteria (AIC, Akaike (1998)) and Bayesian information criteria (BIC, Schwarz (1978)), for directing its operation were developed later. In recent years, due to the advance of datagenerating technologies in both science and industry, researchers discovered that various regularization based regression methods, such as Lasso Tibshirani (1996), SCAD Fan and Li (2001), elastic NET (Zou and Hastie (2005)) and many others, are quite effective in dealing with highdimensional data and selecting relevant covariates with certain guaranteed theoretical properties, such as the
selection consistency and the oracle property.For the linear regression model, , we are interested in testing hypotheses , simultaneously and finding a statistical rule to decide which hypotheses should be rejected. Let be the set of variables with , i.e., not in the true model and let . Let
be the set that is selected based on a statistical rule. The FDR, quantifying the type I error for this statistical rule, is defined as
The task of controlling FDR at a designated level, say
, is very challenging from two aspects: (i) the test statistic and the corresponding distribution under the null hypothesis is not easily available for high dimensional problems; and (ii) the dependence structure of the covariates induces a complex dependence structure among the estimated regression coefficients. A heuristic method based on permutation tests was introduced by
Chung and Romano (2016), but it fails to yield the correct FDR control unless is empty. Starting with the pvalues based on the marginal regression, Fan et al. (2012)proposed a novel way to estimate the false discovery propoption (FDP) when the test statistic follows a multivariate normal distribution. However, the consideration of the marginal regression deviates from the original purpose of the study in many cases.
Barber and Candès (2015) introduced an innovative approach, the knockoff filter, which provides a provable FDR control for cases with arbitrary design matrices when . When , they kept predictors unchanged and constructed the knockoffs only for the remaining variables. It is guaranteed that the FDR can be controlled with a sacrifice of the power. By introducing knockoffs, they obtained a test statistic that satisfies (i) the symmetric property under the nulls, and (ii) independent signs under the nulls. The key of this method is to construct knockoff variables such that preserves the correlation structure of the original . To gain the power, however, one wants to construct knockoffs such that and are as dissimilar as possible. When the multicolinearity between the predictors are high and dense, it leaves little room for one to choose a good knockoff, resulting in a significant decrease in the power. As shown in Section 6, when the correlation between the predictors are high and dense, the knockoffs can still control the FDR, but at a high expense of the power loss.
Later, the knockoff method has been extended to high dimensional (screening + knockoffs Barber and Candès (2019)) and the modelX knockoff approach (Candes et al. (2018)). In the modelX framework, Candes et al. (2018)
discarded the linear assumption by considering the joint distribution of the response
and the predictors . The goal is to find the Markov blanket, a minimal set such that is independent of all the others when conditioning on . They proposed modelX knockoffs that can control the FDR. However, this construction relies heavily on the assumption that the distribution of is completely known, which is generally unrealistic except for some special scenarios. Barber et al. (2018) constructed a robust version of modelX knockoffs based on the estimated joint distribution of . However, estimating the joint distribution in a highdimensional setting not only is very challenging even for the multivariate Gaussian case, but also leads to inaccurate FDR controls. For the high dimensional data, an other line of work is the postselection inference which aims at doing inference conditional on the selection step Berk et al. (2013); Lee et al. (2016). In Lee et al. (2016), the selection event of LASSO is shown to be an union of polyhedra. Tibshirani et al. (2016) provides analogous results for forward stepwise regression and least angle regression Taylor and Tibshirani (2018) extends these results to penalized likelihood models including generalized regression model.In this paper, we propose a method called Gaussian Mirror, which constructs the test statistic locally or marginally. Namely, for each variable , we create a pair of “mirror variables”, and , where is a scalar and . can be viewed as a special and quantifiable way of perturbing the variable. The perturbation is carefully chosen according to an explicit formula of such that the test statistic we introduce has a symmetric distribution around zero for the null variables. Under the high dimensional case, we proof the symmetric property based on the postselection theory. Based on this property, we can quantitatively estimate the number of false positives and the FDP. To control the FDR, a datadriven threshold is chosen such that the estimated FDP is no larger than . By assuming weak dependence conditions on the test statistic, we show that the proposed method controls FDR at level asymptotically. Adding some perturbations not only helps weaken the dependence among the test statistics, but also leads to conclusions that are stable to perturbations, as advocated in the stability framework of Yu (2013).
The GM procedure calculates the mirror statistics by only perturbing one variable at a time. This simple perturbation enjoys two advantages. First, comparing with the knockoffs and ModelX knockoffs which double the size of the design matrix, we introduce the smallest perturbation to the original design matrix, which can improve the power as shown in our numerical studies. Second, the algorithms of calculating the mirror statistics for are completely separable, easily parallelizable, and memory efficient. All of our numerical studies have been implemented using a parallel computation architecture.
The construction of the GM statistics and selection procedure do not require any distributional assumption on the design matrix . In contrast to the modelX knockoffs based methods, the GM design bypasses the difficulty of estimating the joint distribution of , thus is more flexible in real applications such as GWAS studies where the covariates are types of single nucleotide polymorphisms (SNPs) taking values in .
In addition, practitioners often have a limited budget to verify the discoveries. For example, if the budget only allows the scientist to do
validation tests, a natural question is how many false discoveries (FDs) they expect to have in a top100 list and whether statisticians can provide an uncertainty measure for the estimated FDs. To address this practical problem, we provide an estimate of the number of FDs in a given list based on the proposed mirror statistics and use the nonparametric bootstrap method to give a confidence interval for the proposed estimate.
The paper is organized as follows. In Section 2, we propose the general GM idea and a GM algorithm for the ordinary least squares (OLS) estimation in lowdimensional cases (). In Section 3, we employ the postselection adjustment strategy and extend the GM construction for Lasso, which handles cases with at ease. In Section 4, we develop theoretical properties of the GM procedure. In Section 5, we introduce an estimator of the number of false discoveries and build its confidence interval using nonparametric bootstrap. In Section 6, we provide numerical evidences showing the advantage of GM to its competitors via simulations and real data analysis. The code is publicly available at https://github.com/BioAlgs/GM.
Notation: To ease the readability, we summarize some notations used frequently in this paper. Let be the design matrix. Without loss of generality, we assume that is normalized so that for . Let be the th column of and let be the submatrix of with the th column removed. Denote , where , , , and is a scalar. Let
be the vector coefficient and let
be the subvector with the th entry removed. Let , where and denote the coefficients of the mirror variable pair, and let be the corresponding estimator. Let and . Let be the designated FDR level to be controlled.2 The Gaussian Mirror Methodology for OLS
2.1 The Gaussian mirror
Suppose the data are generated from the following linear regression model
(1) 
where is the response vector, is the design matrix, and
is a vector of i.i.d. Gaussian white noise with variance
. We begin with the low dimensional case with and focus on the OLS estimator. The highdimensional setting with is deferred to the next section. As shown in Figure 1, we construct the th Gaussian Mirror by replacing with a pair of variables with and , where is an independently simulated Gaussian random vector with mean and covariance and is a scalar which depends on and .Clearly, at the population level we have if , i.e., for variables in the null set; and if . Let be the design matrix with the th Gaussian mirror, i.e.,
(2) 
Then we rewrite model in (1) as . We estimate the regression coefficients by the following least squares,
(3) 
We have and
are both unbiased estimates. We construct the mirror statistics for the
th variable as(4) 
The mirror statistics has two parts: the first part reflects the importance of the th predictor; and the second part captures the noise cancellation effect. Intuitively, for a relevant variable , the coefficients and tend to be close to each other so as to cancel out the noise effect, which results in a relatively large value of . Based on this observation, we regard variable as “significant” when for some threshold . If , we choose appropriately such that follows a symmetric distribution with respect to zero, i.e., , . Consequently, the number of false positive approximately equals .
In practice, we do not know the underlying active set . Thus, we use as an estimate of the number of false positives, which can be shown to be a slightly overestimation under certain conditions, and use
(5) 
as an estimate of the FDP. For any designated FDR level , leads to a natural choice of the cutoff such that
(6) 
Based on the datadriven threshold in (6), we select the set of variables as .
The key to achieve a reasonable estimate of the FDP is to construct the Gaussian mirror to guarantee the symmetric property of when , which, as shown in the next subsection, can be achieved by carefully choosing the scalar . We discuss GM constructions for the OLS estimates (requiring ), and then extend the results to the Lasso estimates for highdimensional cases.
2.2 Gaussian mirrors for the OLS estimator
The th GM design () for the OLS estimates lead to the following quantity:
(7) 
which has an explicit expression as . It is known that in , follow a bivariate normal distribution with mean zero conditional on . The following result provides a sufficient condition for the mirror statistics being symmetrically distributed with .
Proposition 1.
Suppose and are two random variables following a bivariate normal distribution with mean zero. If the correlation between and is zero, we have following a symmetric distribution about zero, i.e., , .
Proof.
Since the correlation between and is zero, we have . Furthermore, combining with the fact that , the joint distribution of is identical to . Thus, and follow the same distribution, i.e., , . ∎
The following lemma provides an explicit construction of such that the correlation between and is zero, resulting in a symmetric distribution of .
Lemma 1.
For the design in , we can choose
(8) 
so that the correlation between and given and is zero.
Proof.
Conditioning on , the covariance matrix of is
(9) 
where , , , , and . Through calculating the inverse of the block matrix in (9), we have the covariance of given as
(10) 
where
Simple calculation shows that if we choose such that , which is equivalant to
then the covariance between and is zero. We note that the numerator and denominator of in (8) are the lengths of the projections of and onto the space of ’s orthogonal complement, respectively. ∎
Consequently, to construct the for , we first generate the random vector from , and then choose as (8) such that the covariance between and is zero. We summarize the construction of the Gaussian mirror as follows.
Definition 1.
(Gaussian Mirror for OLS) For , one first generates dimensional Gaussian random vectors from , and then computes based on equation (8). The th GM is designed as .
As we have argued, the defined in Definition 1 guarantees that the covariance between and is zero based on Lemma 1. We further construct the mirror test statistics as in (4). The following theorem is a direct consequence of Proposition 1.

Parallel FOR :

Obtain the ordinary least square estimator of and :

Calculate the mirror statistics .
END parallel FOR loop

For a designated FDR level , calculate the cutoff as

Output the index of the selected variables: .
The GM design is a special type of data perturbation method. Perturbation has been widely used in statistics to ensure stability and quantify uncertainty (Yu and Kumbier, 2019; Yu, 2013). For example, jackknife and bootstrap (Efron, 1992) are efficient data perturbation methods with wide applications in statistical inference. In this work, we aim to control FDR based on a new type of data perturbation. The perturbation in the GM design can reduce correlations between the mirrored predictor and other ones, likely improving the robustness and power of variable selection. Theorem 1 guarantees that the distribution of is symmetric with respect to zero for the null variables. Therefore, for any threshold , if we select the variables with the mirror statistics , a natural estimate of the FDP is given by (5). For a designated FDR level , a datadriven threshold can be obtained as in (6). Note that the GM construction enables independent calculations of the ’s for , and is easily parallelizable. Algorithm 1 can thus be implemented on a parallel computing architecture, which significantly reduces the computational time. In Section 4, we will show that the datadriven choice of in Algorithm 1 guarantees that FDR is controlled asymptotically under some weak dependence assumptions of the mirror statistics.
3 Gaussian Mirrors for High Dimensional Regression
In highdimensional settings with the number of features greater than the number of observations , one can still create mirror variables and , for , and construct the mirror statistics naturally using the Lasso estimator in (14). Although this simple extension appears to work quite well empirically, its theoretical justification is challenging due to the following reasons: (i) The specific design of as in the OLS case is no longer available. (ii) The Lasso estimator is biased because of the regularization resulting from penalty with . This implies that and for in the false discovered set . (iii) The penalization also forces a linear constraint on in the selection procedure. Since the Lasso estimator is a nonlinear function of , the constraint on the Lasso estimator is nonlinear, leading to its complicated distribution. We here propose a postselection strategy to design the GM and construct the mirror statistics such that each satisfies the symmetric property when . Then, a consistent estimate of FDR can be derived via (5).
3.1 Literature on highdimensional inference
To enable a proper statistical inference in high dimensional settings involving variable selections, Zhang and Zhang (2014) and Van de Geer et al. (2014) propose desparsified Lasso; Wasserman and Roeder (2009) advocate using data splitting (Cox, 1975) to avoid dealing with complex constraints induced by variable selection; and Berk et al. (2013) suggest the postselection adjustments framework, aiming to provide valid statistical inference conditioning on the datadriven variable selection result. In data splitting, the data is divided into two parts, with one half used to select the model and the other half to conduct inference. For postselection adjustments, (Berk et al., 2013), Lee et al. (2016) propose a procedure that first selects variables using Lasso and then obtains the OLS estimates for the selected model. By characterizing the selection event as a series of linear constraints on the postselection OLS estimates, they provide valid postselection inference on certain linear combinations of the coefficients of the selected variables.
Our goal is to control the number of false selected variables based on mirror statistics, which requires that the mirror statistics be symmetrically distributed for null variables. To characterize the distribution of mirror statistics on the null variable set, we consider the postselection procedure based on Lasso. More specifically, similar to Lee et al. (2016), we first use Lasso to select the model and then refit the model with OLS and construct mirror statistics the same way as in the lowdimensional case. In the following, we explain how to adjust such constructed mirror statistics to accommodate the fact that the fitted model has to be conditional on the selection event.
3.2 Postselection adjustment for Gaussian mirrors
Recall that Lasso solves the minimization problem
(11) 
Denote and as the active set and the related sign based on Lasso estimator. By Lemma 4.1 in Lee et al. (2016), the event can be rewritten as a series of constrains on as follows
(12) 
where encodes the “inactive” constraints determine the selection set, and encodes the “active” constraints which determine the sign of nonzero coefficients. The expression of these matrices are
(13)  
where is the projection matrix of .
We now develop the mirror statistics conditional on the event , and focus on the design matrix . We first define the Gaussian mirrors for the th variable in , . Denote as the the submatrix of by removing its th column. Then we choose as
(14) 
where with generated from . Comparing with (8) in the OLS case, is constructed by first projecting on the orthognal space of . Based on , we define the Gaussian mirror designs as follows.
Definition 2.
(Postselection Gaussian Mirror) Given the postselection set and the corresponding design matrix , for , we generate dimensional random vector from , then compute based on equation (14). The th GM is designed as
Let and where is the standard basis vector with the th entry as one and the others as zero for . and are the first two dimensional OLS estimates of regress on . That is,
(15) 
Before deriving the joint postselection distribution of and , we first characterize the linear constrains on resulted from the postselection event .
Lemma 2.
In Lemma 2 equation (17), we show that and are both rank one. Write , and let and it is easy to verify that is uncorrelated with , hence independent of . Then the constrain in (13) is equivalent to
i.e., the “inactive” constraints on are applied on the direction of . Similarly, we have , that is, the “active” constraints are applied on the direction of . As shown in Figure 2 (a), the constraint regions for and are along the line with slope and . By rotating the coordinate system by , we have the joint distribution of and shown in Figure 2 (b), where the constraint regions are parallel to the axis and axis. We characterize the constraints provided by the selection event in the following Lemma 3.
Lemma 3.
By the normality of and (16), we have . and are uncorrelated and hence independent. Since is independent of , it behaves as “fixed” quantities for the distribution of and conditioning on . Thus and
behave like two independent truncated normal random variables. We next use the probability integral transformation to construct a uniform distribution related to
and .Theorem 2.
Let denote the CDF of a random variable truncated to the interval , that is,
(19) 
where is the CDF of a random variable. Then for and , we have
(20)  
Comments
There are no comments yet.