1 Introduction
Variable selection is a cornerstone of modern highdimensional statistics and, more generally, of datadriven scientific discovery. Examples include selecting a few genes correlated to the incidence of a certain disease, or discovering a number of demographic attributes correlated to crime rates.
A fruitful theoretical framework to study this question is the linear regression model in which we observe
independent copies of the pair such thatwhere
is an unknown vector of coefficients, and
is a noise random variable. Throughout this work we assume that
for some known covariance matrix . Note that for notational simplicity our linear regression model is multiplied by compared to standard scaling in highdimensional linear regression [BicRitTsy09]. Clearly, this scaling, also employed in [javanmard2014hypothesis] has no effect on our results. In this work, we consider asymptotics where is fixed.In this model, a variable selection procedure is a sequence of test statistics
for each of the hypothesis testing problem(1) 
When
is large, a simultaneous control of all the type I errors leads to overly conservative procedures that impedes statistical significant variables, and ultimately, scientific discovery. The False Discovery Rate (FDR) is a less conservative alternative to global type I error. The FDR of a procedure
is the expected proportion of erroneoulsy rejected tests. Formally(2) 
Since its introduction more than two decades ago, various procedures have been developed to provably control this quantity under various assumptions. Central among these is the BenjaminiHochberg procedure which is guaranteed to lead to a desired FDR control under the assumption that the design matrix formed by the concatenation of the column vectors is deterministic and orthogonal [benjamini1995controlling, storey2004strong].
In the presence of correlation between the variables, that is when the design matrix fails to be orthogonal, the problem becomes much more difficult. Indeed, if the variables and are highly correlated, any standard procedure will tend to output a similar coefficient for both, or in the case of Lasso for example, simply chose one of the two variables rather than both.
Recently, the knockoff filter of Barber and Candès [barber2015controlling, candes2018panning] has emerged as a competitive alternative to the BenjaminiHochberg procedure for FDR control in the presence of correlated variables, and has demonstrated great empirical success [katsevich2017, sesia2019]. The terminology “knockoffs” refers to a vector that is easy to mistake for the original vector but is crucially independent of given . Formally, is a knockoff of if (i) is independent of given and (ii) for any , it holds
(3) 
where denotes equality in distribution and is the vector with th coordinate given by
In words, for any vector , the operator swaps each coordinate in with the coordinate and leaves the other coordinates unchanged. We call a knockoff mechanism
any probability family of probability distributions
over such that is a knockoff of . Since the knockoff is constructed independently of , it serves as a benchmark to evaluate how much of the coefficient of a certain variable is due to its correlation with and how much of it is due to its correlation with the other variables.With this idea in mind, the knockoff filter is then constructed from the following four steps:

Generate knockoffs. For , given , generate knockoff and form the design matrix where is obtained by concatenating the knockoff vectors.

Collect scores for each variable. Define the dimensional vector ^{1}^{1}1Regression problems with knockoffs are dimensional rather than dimensional. To keep track of this fact, we use to denote a dimensional vector.
as the Lasso estimator
(4) where is the response vector and, collect the differences of absolute coefficients between variables and knockoffs into a set where ’s are any constructed statistics satisfying certain symmetry conditions [barber2015controlling]. A frequent choice is
(5) In this work we replace by the debiased version (see (10) ahead) in the above definition.

Threshold. Given a desired FDR bound , define the threshold
(6) 
Test. For all , answer the hypothesis testing problem (1) with test
This procedure is guaranteed to satisfy [barber2015controlling, Theorem 1] no matter the choice of knockoffs. Clearly, is a valid choice for knockoffs but it will inevitably lead to no discoveries. The ability of a variable selection procedure to discover true positive is captured its power (or true positive proportion) defined as
Intuitively, to maximize power, knockoffs should be as uncorrelated with as possible while satisfying the exchangeability property (3
). Following this principle, various knockoff mechanisms have been proposed in different settings, which typically involves solving an optimization to minimize a heuristic notion of correlation
[barber2015controlling, candes2018panning, deep2018]. Because of this optimization problem, knockoff mechanisms with analytical expressions are rare, with the exception of the equiknockoff [barber2015controlling] and metropolized knockoff sampling [bates2019metropolized]). Partly due to this, the theoretical analysis of the power of the knockoff filter has been very limited, even in the Gaussian setting. In the special case where for some diagonal matrix, i.e. when the variables are independent, one can simply take independent of . In this case, the power of the knockoff filter tends to 1 as the signaltonoise ratio tends to infinity [weinstein2017power].When predictors are correlated, [fan2019rank] proved a lower bound on the power, where the limiting power as is bounded below in terms of the number
of predictors and extremal eigenvalues of the covariance matrix of the true and knockoff variables. While this lower bound provides a sufficient condition for situations when the power tends to 1, it is loose in certain scenarios. For example, if all predictors are independent except that two of them are almost surely equal, then the minimum eigenvalue of the covariance matrix is zero and yet, experimental results indicate that the FDR and the power of the knockoff filter are almost unchanged.
Our contribution. In this paper, we revisit the statistical performance of the knockoff filter and characterize the situation the knockoff filter is consistent, that is when its FDR tends to 0 and its power tends to 1 simultaneously. More specifically, under suitable limit assumptions, we show that the knockoff filter is consistent if and only if the empirical distribution of the diagonal elements of the precision matrix of converges to , where denotes the covariance matrix of converges to a point mass at 0. In turn, we propose an explicit criterion, called effective signal deficiency defined formally in (11) to practically evaluate consistency or lack thereof. Here the term “signal” refers to the covariance structure of and the effective signal deficiency essentially how much weak such a signal should be for a knockoff mechanism to be consistent.
A second contribution is to propose a new knockoffs mechanism, called Conditionally Independent Knockoffs (CIK), which possesses both simple analytic expressions and excellent experimental performance. CIK does not exist for all , but we show its existence for tree graphical models or other sufficiently sparse graphs. Note that in practice, the socalled modelX knockoff filter requires the knowledge of , an estimation of which is often prohibitive except when the graph has sparse or tree structures. CIK has simple explicit expressions of the effective signal deficiency for tree models, since the empirical distribution of the diagonals of is the same as that of . We remark that CIK is different than metropolized knockoff sampling studied in [bates2019metropolized] (originally appeared in [candes2018panning, Section 3.4.1]
), even in the case of Gaussian Markov chains. The latter exists for generic distributions and is computationally efficient for Markov chains.
Notation. We write and to denote the allones vector. For any vector , let and denote its and norms. Given a vector , we denote by the diagonal matrix whose diagonal elements are given by the entries of and for a matrix , we denote by the vector whose entries are given by the diagonal entries of . For a standard Gaussian random variable and any real number , we denote by , the Gaussian tail probability. Finally we use the notation to indicate the loewner order: is positive semidefinite.
2 Existing work
We focus this discussion on the case of Gaussian design . In this case, the exchangeability condition (3) implies that has a covariance matrix of the form
(7) 
As observed in [barber2015controlling], positive semidefiniteness of this matrix is equivalent to
(8) 
For some . As a result, finding a knockoff mechanism consists in finding .
The seminal work [barber2015controlling][candes2018panning] introduce the following knockoff mechanisms:
Equiknockoffs: The vector is chosen of the form for some . In light of (8) the smallest value possible for is . Assuming the normalization , [candes2018panning] recommend choosing
(9) 
with the goal of minimizing the correlation between and .
SDPknockoffs: The vector is chosen to solve the following semidefinite program:
ASDPknockoffs: Assume the normalization . Choose an approximation of (see [candes2018panning]) and solve:
and then solve:
and put .
We do not discuss other knockoff constructions, such as the exact construction [candes2018panning, Section 3.4.1] and deep knockoff [deep2018]
, which mostly target at general nonGaussian distributions.
As alluded, previously, [weinstein2017power] performed power analysis in the linear (fixed ) regime for , in which case all the above knockoff mechanisms give the same answer of . For a general , [fan2019rank] derived lower bounds on the power in terms of the minimum eigenvalue of the extended covariance matrix (no specific knockoff mechanism is assumed).
3 Overview of the main results
In the paper, we focus on the socalled linear regime where the sampling converges to a constant . We allow for general and for simplicity, rather than using the Lasso estimator defined in (4), we employ a debiased version [ZhaZha14, vandegeer2014, javanmard2014hypothesis]
(10) 
where . To allow for asymptotic results, we consider a sequence where are covariance matrices of size and are vectors of coefficients. Note that we will only consider the cases where or , depending on whether we consider predictors with or without knockoffs.
At first glance, it is unclear that for such general sequences, any meaningful result can be said about the debiased Lasso estimator defined in (10). To overcome this obvious limitation, we consider the asymptotic setting where a standard distributional limit exists in the sense [javanmard2014hypothesis, Definition 4.1].
Definition 1 (Standard distributional limit).
Assume constant sampling rate . A sequence is said to have a standard distributional limit
with sparsity ,
if
(i) there exist deterministic and , possibly random, such that the empirical measure
converges almost surely weakly to a probability measure on as . Here, is the probability distribution of , where , and and are some random variables independent of . Moreover, we ask that
(ii) as , it holds almost surely that
Note that (i) implies that , and , almost surely. We further impose that equalities are achieved in (ii).
As mentioned in [javanmard2014hypothesis], characterizing instances having a standard distributional limit is highly nontrivial. Yet, at least, the definition is nonempty since it contains the case of standard Gaussian design. Moreover, a nonrigorous replica argument indicates that the standard distributional limit exists as long as a certain functional defined on has a differentiable limit [javanmard2014hypothesis, Replica Method Claim 4.6], which is always satisfied for block diagonal where the empirical distribution of the blocks converges.
We remark that in the sparse regime where , rigorous results, that do not appeal to the replica method, show that the weak convergence of the distribution of is essentially sufficient for the existence of a standard distributional limit ([javanmard2014hypothesis, Theorem 4.5]), although the present paper does not concern that regime.
We now introduce the key criterion to characterize consistency of a knockoff mechanism and more generally of a variable selection procedure.
Definition 2 (Effective signal deficiency).
For a given variable selection procedure, is a function of with the following property: for the class of sequences satisfying suitable distributional limit conditions, vanishing ESD is equivalent to consistency of the test:
When we consider knockoff filters, ESD is frequently expressed in terms of the extended covariance matrix , which is in turn a function of for a given knockoff mechanism. In that setting, the “suitable distributional limit conditions” in the above definition requires that the sequence of extended instances has a standard distributional limit.
Note that by definition, ESD is not unique, and our goal is to find simple representations of its equivalence class. ESD is a potentially useful concept in comparing or evaluating different ways of generating knockoff matrices. As an analogy, think of the various notions of convergences of probability measures. A sequence of probability measures may converge in one topology but not in another. Similarly, one may cook up different functionals of the covariance matrix, such as and , which both intuitively characterize some sort of signal deficiency since they tend to be small when the signal gets stronger. However, they are not equivalent, and the second convergence to is stronger in the sense that the first must vanish when the second vanishes. ESD is intended to be the correct notion of “convergence” that characterizes FDR tending to and power tending to .
Of course, by definition it is not obvious that a succinct expression of such an effective signal deficiency exists. Remarkably, we find that the effective signal deficiency can be characterized by the convergence of certain empirical distribution derived from . The effective signal deficiency for various (old and new) variable selection procedures is as follows:
Lasso: The debiased Lasso [javanmard2014hypothesis] is a popular method for highdimensional statistical inference. It is implemented by first computing a Lasso estimator
where can be chosen as any fixed positive number independent of . Instead of a direct threshold test on , we first compute an “unbiased version” defined in (10), as in [javanmard2014hypothesis], and pass a threshold to select nonnulls. We show in Theorem 3 that we may chose
where denotes the LévyProkhorov distance between defined for any two measures and defined over a metric space as
where denotes the neighborhood of . In particular, we have
(11) 
The assumption of the standard distributional limit ensures the weak convergence of the empirical distribution of , and hence the convergence of (11). Hereafter, for any vector , we use the shorthand (abusive) notation
This characterization if ESD is, in fact tight: is a necessary and sufficient condition for consistency of thresholded Lasso as a variable selection procedure (see Proposition 4)
General knockoff: for a general knockoff construction, including variational formulations such as SDPknockoffs, it seems hopeless to find simple expressions of ESD in terms of . Nevertheless, if has a standard distributional limit, we can choose where we recall that is the extended precision matrix of .
Equiknockoff: Specializing the above result to the equiknockoff case, we see that we can choose , achieved when for any . Note that this is slightly different from the choice (9) prescribed in [barber2015controlling, candes2018panning] where .
CIknockoff: We introduce a new method for generating the knockoff matrix, called conditional independence knockoff or CIknockoff in short. If the Gaussian graphical model associated to is a tree, i.e. if the sparsity pattern of corresponds to the adjacency matrix of a tree, then the conditional independence knockoff always exists and . For example, in the independent case where is diagonal, we get which readily yields consistency.
The last knockoff construction, conditional independence knockoff, appears to be new. It is both analytically simple and empirically competitive. Comparing equi and CI knockoffs: the latter is more robust, since having a small fraction of with large does not increase its ESD much. For example, two predictors are identical, then the ESD for conditional independence knockoff almost does not change, but equiknockoff completely fails. Compared to other previous knockoffs, we find that CIknockoff usually shows similar or improved performance empirically, while being easier to compute and to manipulate.
4 Baseline: Lasso with oracle threshold
Consider a variable selection algorithm in which the Lasso parameters with absolute values above a threshold are selected, and suppose that the threshold which controls the FDR is given by an oracle. Note that the knockoff filter is based on the Lasso estimator but it must choose threshold in a data driven fashion. As a result, the Lasso with oracle threshold presents a strong baseline against which the performance of a given knockoff filter should be compared. Not surprisingly, and also as noted in [fan2019rank], although the knockoff filter has the advantage of controlling FDR, it usually has a lower power than Lasso with oracle threshold. This fact will become more transparent as we determine their ESD.
Theorem 3.
Let be arbitrary and let admit a standard distributional limit, and denote the distributional limit by , where , and and are some random variables independent of . Assume further that where the limit exists almost surely by the standard distributional limit assumption. Consider the algorithm which selects for which , where is defined in (10). Then with the choice of ,
where for any with and as in the definition of the standard distributional limit. In particular, if , then can be bounded in terms of , , and only (independent of ), and hence in the above inequality can be replaced by where .
The above theorem implies that is a sufficient condition for consistency; this is in fact also necessary, as indicated by the following complementary lower bound.
Proposition 4.
(Lower bound). In the previous theorem, assume further that is independent of . Then for any ,
(12) 
where is increasing in , strictly positive as long as .
Combining the above two results, we get the following interpretation. Suppose that the distribution of and the values of are fixed, and suppose that the parameters and in the algorithm optimally tuned (i.e. minimizing for any given distributions). If , then, remarkably, the variable selection procedure is consistent if and only if being small – as long as is independent of , while other characteristics of the law of are not necessary to know. In other words, we proved that . If , small may not be sufficient for consistency since also depends on through .
5 Results for general knockoff mechanisms
Given , let be the extended covariance matrix for the true predictors and their knockoffs. Let . Consider the procedure of the knockoff filter described in Section 2, with a slight tweak: define , where
and is defined in (4). This modification still fulfills the sufficiency and antisymmetry condition in [barber2015controlling, Section 2.2], so its FDR can still be controlled. This change allows us to perform analysis using results in [javanmard2014hypothesis]. We also assume that the Lasso parameter is an arbitrary number independent of .
Theorem 5.
Let admit a standard distributional limit for a given , and denote the distributional limit by , where , and and are some random variables independent of . Assume further that where the limit exists almost surely under the standard distributional limit assumption. Then the knockoff filter with FDR budget satisfies:
where for any given , , . Further if , then in the above inequality can be replaced by .
Taking in the above theorem implies that is sufficient for consistency; the following result shows the necessity in a representative setting:
Proposition 6.
In the previous theorem, further assume that where () is selected uniformly at random. Then, under a suitable distributional limit assumption, the knockoff filter with FDR budget satisfies:
The “suitable distributional limit assumption” in Proposition 6 postulates a Gaussian limit for the empirical distribution of the pair , which is stronger than the marginal Gaussian limit assumption in Definition 1, but nevertheless supported by the replica heuristics. Moreover, this condition can be rigorously shown for the case of , (least squares) and block diagonal . The assumption that under in Proposition 6 facilitates the proof but we expect that a similar inconsistency result holds for general . The assumption that is selected uniformly at random is a counterpart of the independence of and in Proposition 4.
6 Conditional independence knockoff and ESD
We introduce the conditional independence knockoff, where and are independent conditionally on , for each . This condition implies that
Therefore recalling that are as defined in (7), we get
(13) 
However such an may violate the positive semidefinite assumption for the joint covariance matrix (examples exist already in the case ). Yet, interestingly, we find that in the case of tree graphical models, this construction always exists. In many practical scenarios, the predictors comes from a tree graphical model, and we can estimate the underlying graph sing the ChowLiu algorithm [chow1968approximating].
Theorem 7.
Either condition in the theorem intuitive imposes that the graph is sparse. In practice, needs to be estimated, which is generally only feasible with some sparse structure (e.g. via graphical lasso).
Assuming the existence of a standard distributional limit and , we have the following results:
Theorem 8.
For tree graphical models, for CIknockoff.
Theorem 9.
for equiknockoff if , , .
7 Experimental results
First consider the setting where and the conditional independence graph forms a binary tree. The correlations between adjacent nodes are all equal to . Choose out of indices uniformly at random as the support of , and set for in the support. Generate independent copies of in where .
Figure 1, left shows the box plots of the power and FDR for Equiknockoff, ASDPknockoff, and CIknockoff, where is defined as in (9) for CIknockoff. The FDR is controlled at the target in all three cases. The powers are not statistically significantly different, but the rough trend is . We then compare the effective signal deficiency. Note that in the current setting, , and hence , for each , and we always have by definition (11), which cannot reveal any useful information for comparison. To resolve this, we can scale down by a common factor before computing the LP distances, noting that it yields a valid effective signal deficiency. Lacking a systematic way of choosing such a scaling factor, heuristically we choose it as so that the LP distances for the three algorithms are all “bounded away from and ”. We find that and and their ordering matches the ordering of the powers.
In the previous example, the simplest equiknockoff has a highly competitive performance. However, this is an artifact of the fact that the data covariance is highly structured (i.e., correlations are all the same). If the correlations have high fluctuations, and in particular, a small number of node pairs are highly correlated, then the equiknockoff has a much worse performance. This is demonstrated in the next example. Consider the setting where forms a Markov chain, in which . In other words, the Gaussian graphical model is a path graph. The correlation between and is , where , are chosen independently. Choose out of indices uniformly at random as the support of , and set for in the support. Generate independent copies of in where .
Figure 1 Right shows the box plots of the power and FDR for the knockoff filter with three different knockoff constructions. The target FDR is . Since the correlations are now chosen randomly, with high probability there exist highly correlated nodes, and hence can be very small, in which case the equiknockoff performs poorly. However is similar to , with the median of the former slightly higher. To compare the ESD, first scale down by a heuristically chosen factor 100. We find , , and and their ordering matches the ordering of the powers of the three knockoff constructions.
Appendix A Proof of Theorem 3
Suppose that the algorithm selects such that as nulls, for some threshold to be specified later. Let and . For any (independent of ), the asymptotic proportion of false negatives is bounded by
(14) 
almost surely, where the last step follows since the assumption implies that . Taking in (14) yields
almost surely. Note that the above inequality is also reversible: by similar lines of argument we obtain
(15) 
almost surely. Thus if is not a point of mass for , we have
(16) 
almost surely.
The analysis of false positives is similar: assuming that is not a point of mass for , and hence , we also have
(17) 
almost surely. Limiting expressions for the portions of total positives and total negatives can also be obtained similarly.
We thus obtain the following expressions of the FDR and POWER:
(18)  
(19) 
almost surely.
To see that these quantities tend to 0 and to 1 respectively, note that the probability terms in the above can be further bounded as follows:
In particular, taking shows the desired bounds on and with
Bound on for . Recall from [javanmard2014hypothesis, (37)] that satisfies the equation
(20) 
where , and . Moreover, the proximal operator is defined by
(21) 
Write . By optimality of , we have
which implies that
Comments
There are no comments yet.