Suppose that for each of subjects we have measured a set of features which we want to use in a linear model to explain a response, and plan to use the Lasso to select relevant features. A desirable procedure controls the (expected) proportion of false variables entering the explanatory model; that is, the ratio between the number of null coefficients estimated as nonzero by the Lasso procedure and the total number of coefficients estimated as nonzero. Conceptually, then, we imagine proceeding along the Lasso path by decreasing the penalty , and need to decide when to stop and reject the hypotheses corresponding to nonzero estimates, in such a way that the false discovery rate (FDR) does not exceed a preset level .
The setup above differs from many classical situations, where we know that the “null statistics”—the statistics for which the associated null hypothesis is true—are (approximately) independent draws from a distribution that is (approximately) known; in such situations, it is relatively easily to estimate the proportion of null hypotheses that are rejected. In stark contrast, in our setting we do not know the sampling distribution of
, and, to further complicate matters, this distribution generally depends on the entire vector(this is unlike the distribution of the least-squares estimator, where depends only on ).
Barber and Candès (2015)
have proposed a method that utilizes artificial “knockoff” features, and used it to circumvent the difficulties described above. The general idea behind the method is to introduce a set of “fake” variables (i.e., variables that are known to not belong in the model), in such a way that there is some form of exchangeability under swapping a true null variable and a knockoff counterpart. This exchangeability essentially implies that a true null variable and a corresponding knockoff variable are indistinguishable (in the sense that the sufficient statistic is insensitive to swapping the two), hence given the event that one of the two was selected, each is equally probable to be the selected. We can then use the number of knockoff selections to estimate the number of true nulls being selected.
In the original paper, Barber and Candès (2015)
considered a Gaussian linear regression model with afixed number of fixed (nonrandom) predictors, and . Since then, extensions and variations on knockoffs have been proposed, which made the method applicable to a wide range of problems. Barber and Candès (2016) considered the high dimensional () setting, where control over type III error (sign) is also studied. Candès et al. (2016) constructed “probabilistic” model-X knockoffs for a nonparametric setting with random covariates. A group knockoff filter has been proposed by Dai and Barber (2016) for detecting relevant groups of variables in a linear model. The focus in these works is on constructing knockoff variables with the necessary properties to enable control over the false discovery rate. On the other hand, the Type II error rate has been less carefully examined.
Power calculations are in general hard, especially when sophisticated test statistics, such as the Lasso estimates, are used. There is, however, a specific setting that enables us to exactly characterize (in some particular sense) the asymptotic distribution of the Lasso estimator relying on the theory of Approximate Message-Passing (AMP). More specifically, we work in the setup ofBayati and Montanari (2012) which entails an Gaussian design matrix with and ; and the regression coefficients are i.i.d. copies from a mixture distribution , where is the distribution of the nonzero components111With some abuse of notation, in this paper we use and
to refer to either a distribution or a random variable that has the corresponding distribution.(i.e., has no point mass at zero). We consider a procedure that enters variables into the model according to their order of (first) appearance on the Lasso path. Appealing to the results in Bayati and Montanari (2012), we identify an oracle procedure that, with the knowledge of the distribution (and the noise level ), is able to asymptotically attain maximum true positive rate subject to controlling the FDR. We show that a procedure which uses knockoffs for calibration, and hence a different solution path, is able to strictly control the FDR and at the same time asymptotically achieve power very close to that of the oracle adaptively in the unknown .
Figure 1 shows power of (essentially) the level- knockoff procedure proposed in Section 3 versus that of the level- oracle procedure, as varies and for two different distributions . The top two plots are theoretical asymptotic predictions, and the bottom plots show simulation counterparts. As implied above, the oracle gets to use the knowledge of and in choosing the value of the Lasso penalty parameter , so that the asymptotic false discovery proportion is ; by contrast, the knockoff procedure is applied in exactly the same way in both situations, that is, it does not utilize any knowledge of or . Yet the top two plots in Figure 1 show that the knockoff procedure adapts well to the unknown true distribution of the coefficients, at least for FDR levels in the range that would typically be of interest, in the sense that it blindly achieves close to optimal asymptotic power. Figures 4 and 5 of Section 4 give another representation of these findings, which are the main results of the paper.
While the main focus is on the testing problem, we address also the issue of model selection and prediction. Motivated by the work of Abramovich et al. (2006), which formally connected FDR control and estimation in the sequence model, our aim is to investigate whether favorable properties associated with FDR control carry over to the linear model. As Abramovich et al. (2006) remark at the outset, in the special case of the sequence model, a full decision-theoretic analysis of the predictive risk is possible. For the linear model case, Bogdan et al. (2015) proposed a penalized least squares estimator, SLOPE, motivated from the perspective of multiple testing and FDR control, and Su et al. (2016) even proved that this estimator is asymptotically minimax under the same design as that considered in the current paper, although in a different sparsity regime, namely . In truth, the linear model is substantially more difficult to analyze than the sequence model. Fortunately, however, the specific working assumptions we adopt allow to analyze the (asymptotic) predictive risk, again relying on the elegant results of Bayati and Montanari (2012) for the Lasso estimates. While our results lack the rigor of those in, e.g., Abramovich et al. (2006) and Su et al. (2016), we think that the findings reported in Section 5 are valuable in investigating a problem for which conclusive answers are currently not available.
The rest of the paper is organized as follows. Section 2 reviews some relevant results from Su et al. (2017) and presents the oracle false discovery rate-power tradeoff diagram for the Lasso. In Section 3 we present a simple testing procedure which uses knockoffs for FDR calibration, and prove some theoretical guarantees. A power analysis, which compares the knockoff procedure to the oracle procedure, is carried out in Section 4. Section 5 is concerned with using knockoffs for prediction, and Section 6 concludes with a short discussion.
2 An oracle tradeoff diagram
Adopting the setup from Su et al. (2017), we consider the linear model
in which has i.i.d. entries and the errors are i.i.d. with fixed and arbitrary. We assume that the regression coefficients are i.i.d. copies of a random variable with and for a fixed constant . In this section we will be interested in the case where with for a positive constant . Note that is assumed to not depend on , and so the expected number of nonzero elements is equal to .
Let be the Lasso solution,
and denote , , and . Hence is regarded as the number of false ‘discoveries’ made by the Lasso; is the number of true discoveries; is the total number of discoveries, and is the number of true signals. We would like to remark here that implies that (the -th variable) is independent of marginally and conditionally on any subset of ; hence the interpretation of rejecting the hypothesis as a false discovery is clear and unambiguous. The false discovery proportion (FDP) is defined as usual as
and the true positive proportion (TPP) is defined as
Su et al. (2017) build on the results in Bayati and Montanari (2012) to devise a tradeoff diagram between TPP and FDP in the linear sparsity regime. Lemma A.1 in Su et al. (2017), which is adopted from Bogdan et al. (2013) and is a consequence of Theorem 1.5 in Bayati and Montanari (2012), predicts the limits of FDP and TPP at a fixed value of . Throughout, let be the soft-thresholding operator, given by . Also, let denote a random variable distributed according to the conditional distribution of given ; that is,
Finally, denote by the unique root of the equation .
Lemma 2.1 (Lemma A.1 in Su et al., 2017; Theorem 1 in Bogdan et al., 2013; see also Theorem 1.5 in Bayati and Montanari, 2012).
The Lasso solution with a fixed obeys
where independently of , and is the unique solution to
As explained in Su et al., 2017, Lemma 2.1 is a consequence of the fact that, under the working assumptions, is in some (limited) sense asymptotically distributed as , where independently of ; here, the soft-thresholding operation acts on each component of a vector.
It follows immediately from Lemma 2.1 that for a fixed , the limits of FDP and TPP are
The parametric curve implicity defines the FDP level attainable at a specific TPP level for the Lasso solution with a fixed as
Interpreted in the opposite direction, for known and , we have an analytical expression for the asymptotic power of a procedure that, for a given , chooses in (2) in such a way that . We describe how to compute the tradeoff curve for a given and in the Appendix.
Because in practice and would usually be unknown, we refer to this procedure as the oracle procedure. The oracle procedure is not a “legal” multiple testing procedure, as it requires knowledge of and , which is of course realistically unavailable. Still, it serves as a reference for evaluating the performance of valid testing procedures. Competing with this oracle is the subject of the current paper, and the particular working assumptions above were chosen simply because it is possible under these assumptions to obtain a tractable expression for the empirical distribution of the Lasso solution.
Before proceeding, we would like to mention another oracle procedure. Hence, consider the procedure which for uses the value
This can be thought of as an oracle that is allowed to see the actual but is still committed to the Lasso path, and stops the last “time” the FDP is smaller than . From now on, we think of as decreasing in time, and regard a variable as entering before variable if . Although this may appear as a stronger oracle in some sense, it is asymptotically equivalent to the “fixed-” oracle discussed earlier: by Theorem 3 in Su et al. (2017), the event
has probability tending to one, hence the lower bound on FDP provided by holds uniformly in .
3 Calibration using knockoffs
If we limit ourselves to procedures that use the Lasso with a fixed—or even data-dependent—value of , from the previous section we see that the achievable TPP can never asymptotically exceed that of the oracle procedure. Hence the asymptotic power of the oracle procedure serves as an asymptotic benchmark. In this section we propose a competitor to the oracle, which utilizes knockoffs for FDR calibration. While the procedure presented below is sequential and formally does not belong to the class of Lasso estimates with a data-dependent choice of , it will be almost equivalent to the corresponding procedure belonging to that class. We discuss this further later on.
3.1 A knockoff filter for the i.i.d. design
Let the model be given by (1) as before. In this section, unless otherwise specified, it suffices to assume that the entries of are i.i.d. from an arbitrary known continuous distribution (not necessarily normal). Furthermore, unless otherwise indicated, in this section we treat as fixed and condition throughout on . The definitions in the current section override the previous definitions, if there is any conflict.
Let be a matrix with i.i.d. entries drawn from independently of and of all other variables, where is a fixed positive integer. Next, let
denote the augmented design matrix, and consider the Lasso solution for the augmented design,
be the sets of indices corresponding to the original variables, the null variables and the knockoff variables, respectively. Note that, because we are conditioning on , the set is nonrandom. Now define the statistics
Informally, measures how early the th variable enters the Lasso path (the larger, the earlier).
be the number of true null and fake null variables, respectively, which enter the Lasso path “before” time . Also, let
be the total number of original variables entering the Lasso path before “time” . A visual representation of the quantities defined above is given in Figure 2.
The class of procedures we propose (henceforth, knockoff procedures) reject the null hypotheses corresponding to
for an estimate of and for a set associated with . Note that conditionally on , is a (nonrandom) parameter, and in this section we speak of estimating rather than . Hence the tests in this class are indexed by the estimate of (and by ). Denote by , the columns of the augmented matrix . To see why the procedure makes sense, observe that, conditionally on , the distribution of is invariant under permutations of the variables in the set (just by symmetry). Therefore, conditionally on , any subset of size of is equally probable to be the selected set . It follows that
and hence (12) can be interpreted as the smallest such that an estimate of
is below . The fact that the procedure defined above, with , controls the FDR, is a consequence of the more general result below.
Theorem 3.1 (FDR control for knockoff procedure with ).
Let be test statistics with a joint distribution
that is invariant to permutations in the set
be test statistics with a joint distribution that is invariant to permutations in the set(that is, reordering the statistics with indices in the extended null set, leaves the joint distribution unchanged). Set , , and . Then the procedure that rejects if , where
controls the at level . In fact,
We will use the following lemma, which is proved in the Appendix.
Let be a hypergeometric random variable, i.e.,
with and set . Then222Here and below we adopt the convention that , and if or if either or .
for any value of .
We proceed to prove the theorem. Denote and recall that and . Recall also that we are conditioning throughout on (so that, in particular, is nonrandom). We will show that FDR control holds conditionally on (it therefore also holds unconditionally).
With the notation above,
We will show that
Consider the filtration
That is, at all times , we have knowledge about the number of true null, fake null and nonnull variables with statistics larger than . We claim that
is a supermartingale w.r.t. , and is a stopping time. Indeed, let . We need to show that
The crucial observation is that, conditional on , and ,
This follows from the exchangeability of the test statistics for the true and fake nulls (in other words, conditional on everything else, the order of the red markers in Figure 2 is random).
Now, Lemma 3.2 assures us that
regardless of the value of . Hence, the supermartingale property (15) holds (and it is also clear that is a stopping time).
By the optional stopping theorem for supermartingales,
But conditional on we have that , and so, invoking Lemma 3.2 once again, we have
no matter the value of . ∎
3.2 Estimating the proportion of nulls
From our analysis, we see that FDR control would continue to hold if we replaced the number of hypotheses with the number of nulls in the definition of . Of course, this quantity is unobservable, or, equivalently, is unobservable. We can, however, estimate using a similar approach to that of Storey (2002), except that we replace calculations under the theoretical null with the counting of knockoff variables at the “bottom” of the Lasso path: specifically, for any we have
again due to the exchangeability of null features. It follows that
For small , we expect only few non-nulls among so that . Hence, we propose to estimate by
We again state our result more generally.
Theorem 3.3 (FDR control for knockoff procedure with an estimate of ).
( is set to if the denominator vanishes). In the setting of Theorem 3.1, the procedure that rejects if , where
We use the same notation as in the proof of Theorem 3.1. As before, we have that
Again, the process starting at , is a supermartingale w.r.t. and is a stopping time. Then the optional stopping theorem gives
As before, conditional on . In the Appendix we prove that if a random variable follows this distribution, then
In particular, for all the result is no more than , thereby completing the proof of the theorem.
In practice, we of course suggest to use the minimum between one and (18) as an estimator for . While we cannot obtain a result similar to Theorem 3.3 when the truncated estimate of is used instead of (18), we can appeal to results from approximate message passing, in the spirit of the calculations from Section 4, to obtain an approximate asymptotic result. Thus, let
and adopt the working assumptions stated at the beginning of Section 4. As in the proof of the theorem above,
and by the same reasoning as above,
Now, as in equation (26) below, we have
Treating the approximations above as equalities (in Section 4 we argue that these should be good approximations), we have
We conclude that, up to the approximations above,
4 Power Analysis
The main observation in this paper is that we can use the consequences of Theorem 1.5 in Bayati and Montanari (2012) to obtain an asymptotic TPP-FDP tradeoff diagram for the Lasso solution associated with the knockoff setup; that is, the Lasso solution corresponding to the augmented design . Indeed, the working assumptions required for applying Lemma 2 continue to hold, only with a slightly different set of parameters. Recall that has i.i.d. entries. Taking for a constant , we have that as . Furthermore, because is related to through
where , we have that the empirical distribution of the coefficients corresponding to the augmented design converges exactly to
where . Fortunately, it is only the empirical distribution of the coefficient vector that needs to have a limit of the form (24), for Lemma 2.1 to follow from Theorem 1.5 of Bayati and Montanari (2012); see also the remarks in Section 2 of Su et al. (2017). The working hypothesis of Section 2 therefore holds with and replaced by and . Recalling the definitions of and from Section 3, a counterpart of Lemma 2.1 asserts that
Above, we allowed ourselves to approximate with , and likewise for . If least-angle regression (Efron et al., 2004) were used instead of the Lasso, the two quantities would have been equal; but because the Lasso solution path is considered here, we cannot completely rule out the event in which a variable that entered the path drops out, and the former quantity is in general larger. Still, we expect the difference to be very small, at least for large enough values of (corresponding to small FDP). This is verified by the simulation of Figure 3: the green circles, computed using the statistics , are very close to the red circles, computed when rejections correspond to variables with . All three simulation examples were computed for a single realization on the original design (not the augmented design). Similar results were obtained when we repeated the experiment (i.e., for a different realization of the data).