    # On the Construction of Knockoffs in Case-Control Studies

Consider a case-control study in which we have a random sample, constructed in such a way that the proportion of cases in our sample is different from that in the general population---for instance, the sample is constructed to achieve a fixed ratio of cases to controls. Imagine that we wish to determine which of the potentially many covariates under study truly influence the response by applying the new model-X knockoffs approach. This paper demonstrates that it suffices to design knockoff variables using data that may have a different ratio of cases to controls. For example, the knockoff variables can be constructed using the distribution of the original variables under any of the following scenarios: (1) a population of controls only; (2) a population of cases only; (3) a population of cases and controls mixed in an arbitrary proportion (irrespective of the fraction of cases in the sample at hand). The consequence is that knockoff variables may be constructed using unlabeled data, which is often available more easily than labeled data, while maintaining Type-I error guarantees.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Conditional Testing

In many scientific applications, researchers are often interested in understanding which of the potentially many explanatory variables truly influence a response variable of interest. For example, geneticists seek to understand the causes of a biologically complex disease using single nucleotide polymorphisms (SNPs) as covariates. A goal in such studies is to determine whether or not a given genetic mutation influences the risk of the disease. Moving away from a specific application, the general statistical problem is this: given covariates

and a response variable which may be discrete or continuous, for each variable we would like to know whether the distribution of the response depends on or not; or equivalently, whether the th variable has any predictive power or not. Under mild conditions [4, 3]

, this conditional null hypothesis is equivalent to

 (1)

under , is independent of once we have information about all the other features.

It is intuitively clear that the null hypothesis of conditional independence (1) does not depend on the marginal distribution of . Specifically, (1) can be verified by simply checking that the conditional distribution of depends on and not on —and therefore, knowing the conditional distribution of is sufficient for testing this property. Somewhat less intuitively, it is also the case that (1) can be verified through the conditional distribution of , regardless of the marginal distribution of .

###### Proposition 1.

Consider any two distributions and on the pair . Then:

• Assume and have the same likelihood of the response , i.e. ,111

In this paper, we write joint distributions as

, marginals as and , and conditionals as and . and that is absolutely continuous with respect to .222The absolute continuity is here to avoid certain types of trivial situations of the following kind: take with whereas and . Since is constant under , holds trivially under . It may however not hold under . Then if is true under , it is also true under .

• Assume and have the same conditional distribution of the covariates, i.e. , and that is absolutely continuous with respect to . Then if is true under , it is also true under . Furthermore, in this case we have

 P(Xj|X−j)=Q(Xj|X−j). (2)
###### Proof of Proposition 1.

We prove the proposition in the case where all the variables are discrete; the case where some of the variables may be continuous is proved analogously. The first part of the proposition is nearly a tautology. Assume that is true under , then we have333To emphasize the role of absolute continuity, the equality below should be interpreted as holding at all points in the support of .

 P(Y,Xj|X−j)=P(Y|X)P(Xj|X−j)=Q(Y|X)P(Xj|X−j)=Q(Y|X−j)P(Xj|X−j).

The second inequality comes from the assumption that the likelihoods are identical, and the third from our assumption that holds under . Hence, and are conditionally independent under , and so holds under .

For the second part, suppose that holds under . Then3

 P(Xj|Y,X−j)=Q(Xj|Y,X−j)=Q(Xj|X−j),

where the first step holds because and have the same conditional distribution of , while the second step uses the assumption that holds under , i.e.  under . This immediately implies that under , and so holds under . This gives . ∎

## 2 Case-Control Studies

Prospective and case-control studies in which the response takes on two values —e.g. indicating whether an individual suffers from a disease or not—offer well-known examples of distributions satisfying the second condition, where the conditional distribution of is the same but the marginal distribution of is not.

Prospective study

In a prospective study, we may be interested in a specific population—all adults living in the UK, all males, all pregnant women, and so on.

Retrospective (case-control) study

In a retrospective study, individuals are typically recruited from the population based on the value of their response . In a case-control study, for instance, we may recruit individuals at random in such a way that the proportion of cases and controls achieves a fixed ratio. Typically, cases are are more prevalent in a retrospective sample than they are in a prospective sample.

A prospective distribution and a retrospective distribution have equal conditional distributions of ,

 P(X|Y)=Q(X|Y).

This is because, conditioning on (the individual has the disease), both and sample individuals uniformly at random from the population of all individuals with the disease; the same holds for . That is, conditioned on the value of , the two types of studies both sample from the same distribution. On the other hand, and will in general have different marginal distributions,

 P(X)≠Q(X)andP(Y)≠Q(Y).

For instance, while the incidence of a disease may be low (say, less than 0.1%) in the population, it may be high in the retrospective sample (say, equal to 50%). This trivially implies that . In general we would also have since, under , values of associated with a high risk of the disease would be overrepresented relative to . Since , however, it then follows from the second part of Proposition 1 that in a case-control study, if conditional independence holds w.r.t. the retrospective distribution , it holds w.r.t. the prospective distribution . (This is because the retrospective distribution includes both cases () and controls (

) with positive probability and, therefore,

is absolutely continuous w.r.t. .)

## 3 Knockoffs in Case-Control Studies

We now turn to the main subject of this paper. Model-X knockoffs is a new framework for testing conditional hypotheses (1) in complex models. While most of the literature relies on a specification of the model that links together the response and the covariates, the originality of the knockoffs approach is that it does not make any assumption about the distribution of . The price to pay for this generality is that we would need to know the marginal distribution of the covariates. Assume we get independent samples from a distribution (as in a retrospective study, for example). Model-X knockoffs are fake variables obeying the following pairwise exchangeability property:

 X∼Q(X)⟹(Xj,˜Xj,X−j,˜X−j)d=(˜Xj,Xj,X−j,˜X−j)for any j∈H0. (3)

Here, is the subset of null hypotheses that are true, i.e. covariates for which under (and, therefore, hold also under any other distribution with the same conditional). Having achieved (3), a general selection procedure effectively using knockoff variables as negative controls can be invoked to select promising variables while rigorously controlling the false discovery rate. In other words, (3) implies that a variable selection procedure that is likely to mistakenly select irrelevant variable , is equally likely to select the constructed knockoff feature , which then alerts us to the fact that our variable selection procedure is selecting false positives. We refer the reader to the already extensive literature on the subject, e.g. [1, 3], for further information.

In the literature, we often encounters the claim that this shift in the burden of knowledge—i.e. knowledge about the distribution of versus that of —is appropriate in situations where we may have ample unlabeled data available to ‘learn’ the distribution of the covariates . After all, while the geneticist may have observed only a few instances of a rare disease, she may have at her disposal several hundreds of thousands of unlabeled genotypes. This means that we have very limited access to labeled data, i.e. pairs , where is known and where the sample is balanced to have a non-vanishing proportion of cases (i.e. )—this is the retrospective distribution . In contrast, unlabeled data ( only) is easy to obtain, but will be drawn from the general population, in which is extremely rare—that is, drawn from the prospective distribution . Imagine using this unlabeled data to learn the prospective distribution , i.e. the distribution of in the general population, and then using this knowledge for variable selection using our labeled case-control data, i.e. draws from the retrospective distribution . Using the distribution learned on the unlabeled data, we would in principle be able to construct exchangeable features for , i.e. knockoff variables constructed to satisfy the exchangeability property

 X∼P(X)⟹(Xj,˜Xj,X−j,˜X−j)d=(˜Xj,Xj,X−j,˜X−j)for any j∈H0. (4)

Now contrast (3) and (4): we want exchangeability w.r.t. the retrospective distribution , but since we have constructed our knockoffs using the unlabeled data, we have perhaps only achieved exchangeability w.r.t. the prospective distribution . The good news is that this mismatch does not affect the validity of our inference. By Proposition 1, exchangeability of the null features and their knockoffs under the prospective distribution implies exchangeability under the retrospective distribution. A more general statement is this:

###### Theorem 1.

Consider two distributions and such that for every null variable, i.e. for all . Then any knockoff sampling scheme obeying exchangeability w.r.t.  (4) obeys the same property w.r.t.  (3).

By (2) of Proposition 1, this conclusion applies to any situation where and have the same conditionals, i.e.  (with the proviso that is absolutely continuous w.r.t. ). In particular, it applies to case-control studies in which is a retrospective distribution and is either a population of controls only, or a population of cases only, or a population of cases and controls mixed in an arbitrary proportion (irrespective of the fraction of cases in the sample drawn from ).

This result allows considerable flexibility in the way we can construct knockoff variables since we can use lots of unlabeled data to estimate conditional distributions

. For example, by constructing our knockoffs from a data set consisting of controls only, which does not match the population in a case-control study, we are nonetheless using the correct conditionals for every null variable and can be assured that we are constructing valid knockoffs.

###### Proof of Theorem 1.

Once again, we prove the result in the case where all the variables are discrete. To prove our claim, we need to show the following: when , the distribution of is symmetric in the variables and . This distribution is given by

 Q(Xj|X−j)P(˜X|X)=P(Xj|X−j)P(˜X|X),

where denote the conditional distribution of , and the equality holds since by assumption. Our claim now follows from (4), the exchangeability of knockoffs and null variables under , which implies that the right-hand side is symmetric in and . Therefore, and are also exchangeable under , proving the theorem. ∎

## 4 Discussion

Our main result shows that if we use the right conditionals for each null variable, then the model-X framework applies and, ultimately, inference is valid—even when we construct knockoffs with reference to a distribution with the wrong marginals and . Mathematically, this result can be deduced from the arguments in . Our contribution here is to link this phenomenon with the situation in case-control studies as specialists have openly wondered about the validity of knockoffs methods in such settings . Not only is the approach valid but we can further leverage the shift in the burden of knowledge, using the ample availability of unlabeled data to construct valid knockoffs.

We have not discussed the question of power in this brief paper. However, we pose an interesting question for further investigation: now that we know that we can use either a population of controls to construct knockoffs, or a population of cases, or a population in which cases and controls are in an arbitrary proportion, which population should we use as to maximize power? We hope to report on this in a future paper.

### Acknowledgements

R. F. B. was partially supported by the National Science Foundation via grant DMS 1654076, and by an Alfred P. Sloan fellowship. E. C. was partially supported by the Office of Naval Research under grant N00014-16-1-2712, by the National Science Foundation via DMS 1712800, by the Math + X Award from the Simons Foundation and by a generous gift from TwoSigma. E. C. would like to thank Chiara Sabatti and Eugene Katsevich for useful conversations related to this project.

## References

• Barber and Candès  R. F. Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. Ann. Statist., 43(5):2055–2085, 2015.
• Barber et al.  R. F. Barber, E. J. Candès, and R. J. Samworth. Robust inference with knockoffs. arXiv:1801.03896, 2018.
• Candès et al.  E. J. Candès, Y. Fan, L. Janson, and J. Lv. Panning for gold: “model-X” knockoffs for high dimensional controlled variable selection. J. R. Statistic. Soc. B, 80(3):551–577, 2018.
• Edwards  D. Edwards. Introduction to Graphical Modelling. Springer New York, New York, NY, 2000.
• Marchini  J. L. Marchini. Discussion of

Gene hunting with knockoffs for hidden Markov models

.
Biometrika, page asy033, 2019.
• Prentice and Pyke  R. L. Prentice and R. Pyke. Logistic disease incidence models and case-control studies. Biometrika, 66(3):403–411, 1979.