# Controlling False Discovery Rate Using Gaussian Mirrors

Simultaneously finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem with a long history. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian random variable, and proceeds with a certain regression method, such as the ordinary least-square or the Lasso. The mirror variables naturally lead to a test statistic highly effective for controlling the FDR. Under a weak dependence assumption, we show that the FDR can be controlled at a user-specified level asymptotically. It is shown that the GM method is more powerful than many existing methods in selecting important variables, subject to the control of FDR especially under the case when high correlations among the covariates exist.

## Authors

• 17 publications
• 6 publications
• 28 publications
02/18/2021

### Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff assisted variable selection

This paper revisits the knockoff-based multiple testing setup considered...
05/02/2021

### Directional FDR Control for Sub-Gaussian Sparse GLMs

High-dimensional sparse generalized linear models (GLMs) have emerged in...
06/30/2021

### Whiteout: when do fixed-X knockoffs fail?

A core strength of knockoff methods is their virtually limitless customi...
03/12/2019

### ECKO: Ensemble of Clustered Knockoffs for multivariate inference on fMRI data

Continuous improvement in medical imaging techniques allows the acquisit...
12/01/2021

### Controlling for multiple covariates

A fundamental problem in statistics is to compare the outcomes attained ...
04/18/2021

### A central limit theorem for the Benjamini-Hochberg false discovery proportion under a factor model

The Benjamini-Hochberg (BH) procedure remains widely popular despite hav...
02/10/2021

### Bayesian Knockoff Filter Using Gibbs Sampler

In many fields, researchers are interested in discovering features with ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear regression, which dates back to the beginning of the 19th century, is one of the most important statistical tools for practitioners. The theoretical research addressing various issues arising from big data analyses has gained much attention in the last decade. One important problem is to determine which covariates (aka “predictors”) are “useful” or “important” in a linear regression. In early days (before 1970’s), people often rely on the t-test to assess the importance of each individual predictor in a regression model, although the method is known to be problematic due to the existence of highly multi-colinearity. A greedy stepwise regression method was later proposed in

Efroymson (1960) to alleviate some of the flaws. Good criterion, such as Akaike information criteria (AIC, Akaike (1998)) and Bayesian information criteria (BIC, Schwarz (1978)), for directing its operation were developed later. In recent years, due to the advance of data-generating technologies in both science and industry, researchers discovered that various regularization based regression methods, such as Lasso Tibshirani (1996), SCAD Fan and Li (2001), elastic NET (Zou and Hastie (2005)

) and many others, are quite effective in dealing with high-dimensional data and selecting relevant covariates with certain guaranteed theoretical properties, such as the

selection consistency and the oracle property.

For the linear regression model, , we are interested in testing hypotheses , simultaneously and finding a statistical rule to decide which hypotheses should be rejected. Let be the set of variables with , i.e., not in the true model and let . Let

be the set that is selected based on a statistical rule. The FDR, quantifying the type I error for this statistical rule, is defined as

 FDR=E[FDP],whereFDP=#{i|i∈S0,i∈^S1}#{i|i∈^S1}∨1.

The task of controlling FDR at a designated level, say

, is very challenging from two aspects: (i) the test statistic and the corresponding distribution under the null hypothesis is not easily available for high dimensional problems; and (ii) the dependence structure of the covariates induces a complex dependence structure among the estimated regression coefficients. A heuristic method based on permutation tests was introduced by

Chung and Romano (2016), but it fails to yield the correct FDR control unless is empty. Starting with the p-values based on the marginal regression, Fan et al. (2012)

proposed a novel way to estimate the false discovery propoption (FDP) when the test statistic follows a multivariate normal distribution. However, the consideration of the marginal regression deviates from the original purpose of the study in many cases.

Barber and Candès (2015) introduced an innovative approach, the knockoff filter, which provides a provable FDR control for cases with arbitrary design matrices when . When , they kept predictors unchanged and constructed the knockoffs only for the remaining variables. It is guaranteed that the FDR can be controlled with a sacrifice of the power. By introducing knockoffs, they obtained a test statistic that satisfies (i) the symmetric property under the nulls, and (ii) independent signs under the nulls. The key of this method is to construct knockoff variables such that preserves the correlation structure of the original . To gain the power, however, one wants to construct knockoffs such that and are as dissimilar as possible. When the multi-colinearity between the predictors are high and dense, it leaves little room for one to choose a good knockoff, resulting in a significant decrease in the power. As shown in Section 6, when the correlation between the predictors are high and dense, the knockoffs can still control the FDR, but at a high expense of the power loss.

Later, the knockoff method has been extended to high dimensional (screening + knockoffs Barber and Candès (2019)) and the model-X knockoff approach (Candes et al. (2018)). In the model-X framework, Candes et al. (2018)

discarded the linear assumption by considering the joint distribution of the response

and the predictors . The goal is to find the Markov blanket, a minimal set such that is independent of all the others when conditioning on . They proposed model-X knockoffs that can control the FDR. However, this construction relies heavily on the assumption that the distribution of is completely known, which is generally unrealistic except for some special scenarios. Barber et al. (2018) constructed a robust version of model-X knockoffs based on the estimated joint distribution of . However, estimating the joint distribution in a high-dimensional setting not only is very challenging even for the multivariate Gaussian case, but also leads to inaccurate FDR controls. For the high dimensional data, an other line of work is the post-selection inference which aims at doing inference conditional on the selection step Berk et al. (2013); Lee et al. (2016). In Lee et al. (2016), the selection event of LASSO is shown to be an union of polyhedra. Tibshirani et al. (2016) provides analogous results for forward stepwise regression and least angle regression Taylor and Tibshirani (2018) extends these results to penalized likelihood models including generalized regression model.

In this paper, we propose a method called Gaussian Mirror, which constructs the test statistic locally or marginally. Namely, for each variable , we create a pair of “mirror variables”, and , where is a scalar and . can be viewed as a special and quantifiable way of perturbing the variable. The perturbation is carefully chosen according to an explicit formula of such that the test statistic we introduce has a symmetric distribution around zero for the null variables. Under the high dimensional case, we proof the symmetric property based on the post-selection theory. Based on this property, we can quantitatively estimate the number of false positives and the FDP. To control the FDR, a data-driven threshold is chosen such that the estimated FDP is no larger than . By assuming weak dependence conditions on the test statistic, we show that the proposed method controls FDR at level asymptotically. Adding some perturbations not only helps weaken the dependence among the test statistics, but also leads to conclusions that are stable to perturbations, as advocated in the stability framework of Yu (2013).

The GM procedure calculates the mirror statistics by only perturbing one variable at a time. This simple perturbation enjoys two advantages. First, comparing with the knockoffs and Model-X knockoffs which double the size of the design matrix, we introduce the smallest perturbation to the original design matrix, which can improve the power as shown in our numerical studies. Second, the algorithms of calculating the mirror statistics for are completely separable, easily parallelizable, and memory efficient. All of our numerical studies have been implemented using a parallel computation architecture.

The construction of the GM statistics and selection procedure do not require any distributional assumption on the design matrix . In contrast to the model-X knockoffs based methods, the GM design bypasses the difficulty of estimating the joint distribution of , thus is more flexible in real applications such as GWAS studies where the covariates are types of single nucleotide polymorphisms (SNPs) taking values in .

In addition, practitioners often have a limited budget to verify the discoveries. For example, if the budget only allows the scientist to do

validation tests, a natural question is how many false discoveries (FDs) they expect to have in a top-100 list and whether statisticians can provide an uncertainty measure for the estimated FDs. To address this practical problem, we provide an estimate of the number of FDs in a given list based on the proposed mirror statistics and use the nonparametric bootstrap method to give a confidence interval for the proposed estimate.

The paper is organized as follows. In Section 2, we propose the general GM idea and a GM algorithm for the ordinary least squares (OLS) estimation in low-dimensional cases (). In Section 3, we employ the post-selection adjustment strategy and extend the GM construction for Lasso, which handles cases with at ease. In Section 4, we develop theoretical properties of the GM procedure. In Section 5, we introduce an estimator of the number of false discoveries and build its confidence interval using non-parametric bootstrap. In Section 6, we provide numerical evidences showing the advantage of GM to its competitors via simulations and real data analysis. The code is publicly available at https://github.com/BioAlgs/GM.

Notation: To ease the readability, we summarize some notations used frequently in this paper. Let be the design matrix. Without loss of generality, we assume that is normalized so that for . Let be the -th column of and let be the submatrix of with the -th column removed. Denote , where , , , and is a scalar. Let

be the vector coefficient and let

be the subvector with the -th entry removed. Let , where and denote the coefficients of the mirror variable pair, and let be the corresponding estimator. Let and . Let be the designated FDR level to be controlled.

## 2 The Gaussian Mirror Methodology for OLS

### 2.1 The Gaussian mirror

Suppose the data are generated from the following linear regression model

 y=Xβ+ϵ, (1)

where is the response vector, is the design matrix, and

is a vector of i.i.d. Gaussian white noise with variance

. We begin with the low dimensional case with and focus on the OLS estimator. The high-dimensional setting with is deferred to the next section. As shown in Figure 1, we construct the -th Gaussian Mirror by replacing with a pair of variables with and , where is an independently simulated Gaussian random vector with mean and covariance and is a scalar which depends on and .

Clearly, at the population level we have if , i.e., for variables in the null set; and if . Let be the design matrix with the -th Gaussian mirror, i.e.,

 GMj:=Xj=(x+j,x−j,X−j)=(xj+cjzj,xj−cjzj,X−j). (2)

Then we rewrite model in (1) as . We estimate the regression coefficients by the following least squares,

 ^βj=argminβj=(β+j,β−j,β−j)||y−Xjβj||22. (3)

We have and

are both unbiased estimates. We construct the mirror statistics for the

-th variable as

 Mj=|^β+j+^β−j|−|^β+j−^β−j|. (4)

The mirror statistics has two parts: the first part reflects the importance of the -th predictor; and the second part captures the noise cancellation effect. Intuitively, for a relevant variable , the coefficients and tend to be close to each other so as to cancel out the noise effect, which results in a relatively large value of . Based on this observation, we regard variable as “significant” when for some threshold . If , we choose appropriately such that follows a symmetric distribution with respect to zero, i.e., , . Consequently, the number of false positive approximately equals .

In practice, we do not know the underlying active set . Thus, we use as an estimate of the number of false positives, which can be shown to be a slightly over-estimation under certain conditions, and use

 ˆFDP(t)≜#{j|Mj≤−t}#{j|Mj≥t}∨1 (5)

as an estimate of the FDP. For any designated FDR level , leads to a natural choice of the cutoff such that

 τq=mint{ˆFDP(t)≤q}. (6)

Based on the data-driven threshold in (6), we select the set of variables as .

The key to achieve a reasonable estimate of the FDP is to construct the Gaussian mirror to guarantee the symmetric property of when , which, as shown in the next subsection, can be achieved by carefully choosing the scalar . We discuss GM constructions for the OLS estimates (requiring ), and then extend the results to the Lasso estimates for high-dimensional cases.

### 2.2 Gaussian mirrors for the OLS estimator

The -th GM design () for the OLS estimates lead to the following quantity:

 ^βj:=(^β+j,^β−j,^β−j)=argminβj=(β+j,β−j,β−j)||y−Xjβj||22, (7)

which has an explicit expression as . It is known that in , follow a bi-variate normal distribution with mean zero conditional on . The following result provides a sufficient condition for the mirror statistics being symmetrically distributed with .

###### Proposition 1.

Suppose and are two random variables following a bivariate normal distribution with mean zero. If the correlation between and is zero, we have following a symmetric distribution about zero, i.e., , .

###### Proof.

Since the correlation between and is zero, we have . Furthermore, combining with the fact that , the joint distribution of is identical to . Thus, and follow the same distribution, i.e., , . ∎

The following lemma provides an explicit construction of such that the correlation between and is zero, resulting in a symmetric distribution of .

###### Lemma 1.

 cj= ⎷x⊤j(In−X−j(X⊤−jX−j)−1X⊤−j)xjz⊤j(In−X−j(X⊤−jX−j)−1X⊤−j)zj, (8)

so that the correlation between and given and is zero.

###### Proof.

Conditioning on , the covariance matrix of is

 (9)

where , , , , and . Through calculating the inverse of the block matrix in (9), we have the covariance of given as

 1nCov(^β+j,^β−j∣Xj) = K−1σ2(1−vzj−(ρ(xj,X−j))⊤Σ−1−jρ(xj,X−j)+(ρ(zj,X−j))⊤Σ−1−jρ(zj,X−j)), (10)

where

 K−1 = (1+vzj)2−4(ρ(xj,zj))2−((ρ(xj,X−j))⊤Σ−1−jρ(xj,X−j)+(ρ(zj,X−j))⊤Σ−1−jρ(zj,X−j))2 −(1−vzj−(ρ(xj,X−j))⊤Σ−1−jρ(xj,X−j)+(ρ(zj,X−j))⊤Σ−1−jρ(zj,X−j))2.

Simple calculation shows that if we choose such that , which is equivalant to

 c2jz⊤j(In−X−j(X⊤−jX−j)−1X⊤−j)zj=x⊤j(In−X−j(X⊤−jX−j)−1X⊤−j)xj,

then the covariance between and is zero. We note that the numerator and denominator of in (8) are the lengths of the projections of and onto the space of ’s orthogonal complement, respectively. ∎

Consequently, to construct the for , we first generate the random vector from , and then choose as (8) such that the covariance between and is zero. We summarize the construction of the Gaussian mirror as follows.

###### Definition 1.

(Gaussian Mirror for OLS) For , one first generates -dimensional Gaussian random vectors from , and then computes based on equation (8). The -th GM is designed as .

As we have argued, the defined in Definition 1 guarantees that the covariance between and is zero based on Lemma 1. We further construct the mirror test statistics as in (4). The following theorem is a direct consequence of Proposition 1.

###### Theorem 1.

Let be the test statistics defined in (4) based on in Definition 1, then

 P(Mj<−t∣zj)=P(Mj>t∣zj),∀t>0,

for .

The GM design is a special type of data perturbation method. Perturbation has been widely used in statistics to ensure stability and quantify uncertainty (Yu and Kumbier, 2019; Yu, 2013). For example, jackknife and bootstrap (Efron, 1992) are efficient data perturbation methods with wide applications in statistical inference. In this work, we aim to control FDR based on a new type of data perturbation. The perturbation in the GM design can reduce correlations between the mirrored predictor and other ones, likely improving the robustness and power of variable selection. Theorem 1 guarantees that the distribution of is symmetric with respect to zero for the null variables. Therefore, for any threshold , if we select the variables with the mirror statistics , a natural estimate of the FDP is given by (5). For a designated FDR level , a data-driven threshold can be obtained as in (6). Note that the GM construction enables independent calculations of the ’s for , and is easily parallelizable. Algorithm 1 can thus be implemented on a parallel computing architecture, which significantly reduces the computational time. In Section 4, we will show that the data-driven choice of in Algorithm 1 guarantees that FDR is controlled asymptotically under some weak dependence assumptions of the mirror statistics.

## 3 Gaussian Mirrors for High Dimensional Regression

In high-dimensional settings with the number of features greater than the number of observations , one can still create mirror variables and , for , and construct the mirror statistics naturally using the Lasso estimator in (14). Although this simple extension appears to work quite well empirically, its theoretical justification is challenging due to the following reasons: (i) The specific design of as in the OLS case is no longer available. (ii) The Lasso estimator is biased because of the regularization resulting from penalty with . This implies that and for in the false discovered set . (iii) The penalization also forces a linear constraint on in the selection procedure. Since the Lasso estimator is a nonlinear function of , the constraint on the Lasso estimator is nonlinear, leading to its complicated distribution. We here propose a post-selection strategy to design the GM and construct the mirror statistics such that each satisfies the symmetric property when . Then, a consistent estimate of FDR can be derived via (5).

### 3.1 Literature on high-dimensional inference

To enable a proper statistical inference in high dimensional settings involving variable selections, Zhang and Zhang (2014) and Van de Geer et al. (2014) propose de-sparsified Lasso; Wasserman and Roeder (2009) advocate using data splitting (Cox, 1975) to avoid dealing with complex constraints induced by variable selection; and Berk et al. (2013) suggest the post-selection adjustments framework, aiming to provide valid statistical inference conditioning on the data-driven variable selection result. In data splitting, the data is divided into two parts, with one half used to select the model and the other half to conduct inference. For post-selection adjustments, (Berk et al., 2013), Lee et al. (2016) propose a procedure that first selects variables using Lasso and then obtains the OLS estimates for the selected model. By characterizing the selection event as a series of linear constraints on the post-selection OLS estimates, they provide valid post-selection inference on certain linear combinations of the coefficients of the selected variables.

Our goal is to control the number of false selected variables based on mirror statistics, which requires that the mirror statistics be symmetrically distributed for null variables. To characterize the distribution of mirror statistics on the null variable set, we consider the post-selection procedure based on Lasso. More specifically, similar to Lee et al. (2016), we first use Lasso to select the model and then re-fit the model with OLS and construct mirror statistics the same way as in the low-dimensional case. In the following, we explain how to adjust such constructed mirror statistics to accommodate the fact that the fitted model has to be conditional on the selection event.

### 3.2 Post-selection adjustment for Gaussian mirrors

Recall that Lasso solves the minimization problem

 ~β=argminβ||y−Xβ||22+λn||β||1. (11)

Denote and as the active set and the related sign based on Lasso estimator. By Lemma 4.1 in Lee et al. (2016), the event can be rewritten as a series of constrains on as follows

 {^S=S,^s=s}={(A0(S,s)A1(S,s))y≤(b0(S,s)b1(S,s))}, (12)

where encodes the “inactive” constraints determine the selection set, and encodes the “active” constraints which determine the sign of nonzero coefficients. The expression of these matrices are

 A0(S,s) =1λ(X⊤−S(I−PS)−X⊤−S(I−PS)) (13) b0(S,s) =(1−X⊤−S(XS)+s1+X⊤−S(XS)+s) A1(S,s) =−diag(s)(X⊤SXS)−1X⊤S b1(S,s) =−λdiag(s)(X⊤SXS)−1s,

where is the projection matrix of .

We now develop the mirror statistics conditional on the event , and focus on the design matrix . We first define the Gaussian mirrors for the -th variable in , . Denote as the the submatrix of by removing its -th column. Then we choose as

 (14)

where with generated from . Comparing with (8) in the OLS case, is constructed by first projecting on the orthognal space of . Based on , we define the Gaussian mirror designs as follows.

###### Definition 2.

(Post-selection Gaussian Mirror) Given the post-selection set and the corresponding design matrix , for , we generate -dimensional random vector from , then compute based on equation (14). The -th GM is designed as

Let and where is the standard basis vector with the th entry as one and the others as zero for . and are the first two dimensional OLS estimates of regress on . That is,

 ^β+j=η⊤1y and ^β−j=η⊤2y. (15)

Before deriving the joint post-selection distribution of and , we first characterize the linear constrains on resulted from the post-selection event .

###### Lemma 2.

Let , we have

 (η1+η2)⊤(η1−η2)=0,ηT1η1=ηT2η2=α. (16)

Let and be matrices defined in (13), then there exist and , such that

 A0(S,s)η=a0(1,−1) and A1(S,s)η=a1(1,1). (17)

In Lemma 2 equation (17), we show that and are both rank one. Write , and let and it is easy to verify that is uncorrelated with , hence independent of . Then the constrain in (13) is equivalent to

 A0(S,s)η(η⊤η)−1η⊤y+A0(S,s)r =a0(1,−1)diag(α,α)(^β+j,^β−j)⊤+A0(S,s)r =αa0(^β+j−^β−j)+A0(S,s)r

i.e., the “inactive” constraints on are applied on the direction of . Similarly, we have , that is, the “active” constraints are applied on the direction of . As shown in Figure 2 (a), the constraint regions for and are along the line with slope and . By rotating the coordinate system by , we have the joint distribution of and shown in Figure 2 (b), where the constraint regions are parallel to the -axis and -axis. We characterize the constraints provided by the selection event in the following Lemma 3.

###### Lemma 3.

The selection event can be written as follows:

 = {VL1(r)≤^β+j+^β−j≤VU1(r),VN1(r)>0}∩{VL0(r)≤^β+j−^β−j≤VU0(r),VN0(r)>0}

where

 VL0(r):=maxj:a0j<0b0j−(A0r)jαa0j,VU0(r):=minj:a0j>0b0j−(A0r)jαa0j,VN0(r):=minj:a0j=0b0j−(A0r)j, (18) VL1(r):=maxj:a1j<0b1j−(A1r)jαa1j,VU1(r):=minj:a1j>0b1j−(A1r)jαa1j,VN1(r):=minj:a1j=0b1j−(A1r)j,

with , are the -th element of in (17), are the th element of and in (13), respectively.

By the normality of and (16), we have . and are uncorrelated and hence independent. Since is independent of , it behaves as “fixed” quantities for the distribution of and conditioning on . Thus and

behave like two independent truncated normal random variables. We next use the probability integral transformation to construct a uniform distribution related to

and .

###### Theorem 2.

Let denote the CDF of a random variable truncated to the interval , that is,

 F[a,b]μ,σ(x)=Φ((x−μ)/σ)−Φ((a−μ)/σ)Φ((b−μ)/σ)−Φ((a−μ)/σ) (19)

where is the CDF of a random variable. Then for and , we have

 F[VL1(r),VU1(r)]0,σ2(^β+j+^β−j)∣^S=S,^s=s∼Unif(0,1) (20) F[VL0(r),VU0(r)]0,σ2(^β+j−^β−j)∣^S=S,^s=s∼Unif(0,