1 Introduction
Identifying important features is an ubiquitous problem in machine learning and statistics. In relatively simple settings, the importance of a given feature is measured via the fitted parameters of some model. Consider Generalized Linear Models (GLM) for example, users often add a LASSO penalty (Tibshirani, 1996)
to promote sparsity in the coefficients, and subsequently select those features with nonzero coefficients. Stepwise procedures where we sequentially modify the model are another way of doing feature selection
(Mallows, 1973; Akaike, 1974; Raftery, 1986).These standard methods are all plagued by correlations between the features: a feature that is not really relevant for the outcome, in a precise sense we will define in Section 2, can be selected by LASSO or Stepwise procedure, because it is correlated with relevant features. The feature selection problem becomes even more difficult in more complex settings where we may not have a clean parametric model relating input features to the labels. Moreover, in these settings we usually lack statistical guarantees on the validity of the selected features. Finally, even procedures with statistical guarantees usually depend on having
valid values, which are based on a correct modeling of and (sometimes) assume some asymptotic regime. However, there are many common settings where these assumptions fail and we cannot perform inference based on those values (Sur and Candès, 2018).A powerful new approach called ModelX knockoff procedure (Candès et al., 2018) has recently emerged to deal with these issues. This method introduces a new paradigm: we no longer assume any model for the distribution of in order to do feature selection (and therefore do not compute values), but we assume that we have full knowledge of the feature distribution – or at least we can accurately model it. This knowledge of the ground truth allows us to sample new knockoff variables satisfying some precise distributional conditions. Although we make no assumption on , we can run any procedure on the data set extended with the knockoffs and perform feature selection while controlling the False Discovery Rate (FDR). Statistical guarantees no longer rely on valid values whenever our model for is correct, but exclusively on having valid knockoffs, obtained whenever we know (or our model for is correct).
There are two main obstacles to using knockoffs in practice (i.e. “for the mass”): 1) tractably generating valid knockoffs and 2) once we have generated knockoffs, computing powerful test statistics. Current tractable methods for generating knockoffs are restricted to the settings where is modeled as a multivariate Gaussian (Candès et al., 2018)
or as the set of observed nodes in a Hidden Markov Model (HMM)
(Sesia et al., 2017). Even though knockoff methods enjoy some robustness properties when approximating (Barber et al., 2018), the guarantee on FDR control breaks down even in very simple settings if these approximations turn out to be too crude, as we demonstrate in this paper. Sequential conditional independent pairs algorithm (SCIP) (Candès et al., 2018) is a universal knockoff sampling scheme but is computationally intractable as soon as we model by more complex distributions. Constructing tractable knockoff sampling procedures for more flexible classes of distributions used to model is essential to improve the validity of knockoffs, which is always subject to the quality of the approximation. Regarding test statistics, current methods (Barber et al., 2015; Candès et al., 2018; Sesia et al., 2017; Barber et al., 2018)focus on LASSObased statistics to obtain proxies for the importance of any given feature: such methods need to fit a GLM to the data, which is too restrictive in many supervised learning tasks.
Our Contributions.
First, we formulate a novel tractable procedure for sampling knockoffs in settings where features can be modeled as the observed variables in a Bayesian Network. This allows for great flexibility when modeling the feature distribution . We show that this procedure is different from SCIP, which is the current stateoftheart method to sample knockoffs. We construct valid knockoffs in settings where previous knockoff sampling procedures assumed a very restrictive model for and therefore failed to control FDR, and provide a unified framework for several different sampling procedures. In addition, we systematically evaluate and compare several nonlinear knockoff test statistics which can be applied in general supervised learning problems. We develop a new statistic, Swap Integral
, which has significantly better power than other methods. These two advances enable us to perform feature selection using blackbox classifiers to represent
while still retaining statistical guarantees under correct modeling of .2 Knockoff Background
We begin by introducing the usual setting of feature selection procedures. We consider the data as a sequence of i.i.d. samples from some unknown joint distribution:
, . We then define the set of null features by if and only if (where the subscript indicates all variables except the th). The nonnull features, also called alternatives, are important because they capture the truly influential effects: each nonnull feature is correlated with the label even conditioned on rest of the features. Our goal is to identify these nonnull features. Running the knockoff procedure gives us a selected set , while controlling for False Discovery Rate (FDR), which stands for the expected rate of false discoveries: .Assuming we know the ground truth for the distribution (we’ll come back to the realistic setting where we only have samples of ), the first step of the knockoff procedure is to obtain a knockoff sample that satisfies the following conditions:
Definition 2.1 (Knockoff sample).
A knockoff sample
of a ddimensional random variable
is a ddimensional random variable such that two properties are satisfied:
[noitemsep, topsep=0pt]

Conditional independence:

Exchangeability :
where stands for equality in distribution and the notation
refers to the vector where the original
th feature and the th knockoff feature have been transposed whenever .The first condition is immediately satisfied as long as knockoffs are sampled conditioned on the sample without considering any information about , which will always be the case in our sampling methods. More generally we say that any such that satisfies the exchangeability property.
Once we have the knockoff sample , the next step of the procedure constructs feature statistics , such that a high value for is evidence that the th feature is nonnull. Feature statistics described by Candès et al. (2018) depend only on such that for each we can write for some function . The only restriction these statistics must satisfy is the flipsign property: swapping the th feature and its corresponding knockoff feature should flip the sign of the statistic while leaving other feature statistics unchanged. More formally, for a subset of features, denoting the data matrix where the original th variable and its corresponding knockoff have been transposed whenever , we require that satisfies:
As suggested by Candès et al. (2018), the choice of feature statistics can be done in two steps: first, find a statistic where each coordinate corresponds to the “importance” — hence we will call them importance scores — of the corresponding feature (either original or knockoff). For example, would be the absolute value of the regression coefficient of the th feature.
After obtaining the importance score for the original and knockoff feature, we compute the feature statistic . The intuition is that importance scores of knockoffs serve as a control, such that larger importance score of the th feature compared to that of its knockoff implies larger (and therefore is evidence against the null). Given that feature statistics of the null features behave “symmetrically” around 0 by the flipsign property, the final selection procedure compares, at a given threshold the number of features such that vs. the number of those such that (Candès et al., 2018). By keeping this ratio “under control” for a given target FDR that we fix in advance, we obtain an adaptive procedure that selects features whose value is above some adaptive threshold. FDR control is guaranteed with this procedure, as long as the knockoffs are sampled correctly (i.e., satisfying the two conditions), regardless of the choices made to construct the feature statistics (Candès et al., 2018).
The power of the knockoff procedure is that it makes no assumptions on the relationship between and ; however, it requires full knowledge of the feature distribution which is often not known and needs to modeled. It is also important to notice that for any model we use for , we are required to know how to sample knockoffs from that model distribution. Current classes of distributions having a tractable knockoff sampling method have limited expressivity, and we can find simple settings where using models limited to these classes fails to provide valid knockoffs, and consequently the FDR is not controlled. We provide an example of such setting in Section 5
. However, the knockoff procedure is robust to small errors in the estimation of
(Barber et al., 2018). Our method for sampling knockoffs from a Bayesian network allows us to fit a better distribution to the data resulting in a better control of FDR up to some error term with respect to the FDR we target (under the assumption that our model distribution of is perfect, this procedure controls FDR at the target FDR). We discuss in Appendix B the reasons why we expect our method to be robust to model misspecification.Note that setting satisfies the knockoff properties, but no discovery would be made since all of the
would be zero. Therefore, in addition to generating valid knockoffs in order to control FDR, it is also important to consider the statistical power of the procedure, which is the probability that a true alternative is selected. This depends not only on the knockoff sampling procedure, but also on the choice of the importance scores. We discuss power in detail in Sections 4 and 5.
3 Generating Knockoffs From Latent Variable Models
We first show how to generate knockoff samples from a Gaussian mixture model, which is already a new contribution, as a warmup to our general procedure to generating knockoffs from latent variable models. This section relates just to the first step of the
ModelX knockoff procedure. We will consider different classes of distributions used to model ; we do not consider the response for this step.Gaussian Mixture Knockoffs
The setup:

[leftmargin=*]

is a categorical latent variable taking values in , where is the number of mixtures.

We observe a valued random variable with distribution
For a multivariate Gaussian distribution, we can explicitly write a joint density
on that satisfies the exchangeability property for a given :is a diagonal matrix with the constraint that the joint covariance matrix is positive definite (Barber et al., 2015). The Gaussian mixture knockoff sampling procedure consists of first sampling the mixture assignment variable from the posterior distribution . The knockoff is then sampled from the conditional distribution of the knockoff given the true variable and the sampled mixture assignment defined as . More explicitly:
as . 
Proposition 3.1.
The Gaussian mixture knockoff sampling procedure stated above is such that is a valid knockoff of .
The proof is deferred to Appendix A.2. Once we compute the necessary parameters of these conditional distributions, we can sample valid knockoffs for a Gaussian mixture model. Whenever we no longer know the ground truth for , Gaussian mixtures can approximate arbitrarily well any continuous distribution (Bacharoglou, 2010), so we now have a very flexible model for the feature distribution from which we know how to tractably sample knockoffs. In a real setting, the validity of our knockoffs will be limited by the quality of the approximation of our model for – the same way in a parametric model of the validity of the values depends on the validity of model and distributional assumptions (of the eventual random noise, on the asymptotic regime, etc). Next we show how to efficiently generate valid knockoffs for general Bayesian Networks, which can be used to accurately model in many settings.
Exchangeable Conjugate
Our knockoff sampling procedures will be based on the following idea. Instead of keeping long range dependencies across all features as SCIP does, we want to exploit the structure given by the local conditional distributions of a Bayesian network: sampling knockoffs becomes as hard as doing probabilistic inference on the Bayesian network, and sampling “local” knockoffs based on the local structure.
Definition 3.1.
Let be arbitrary multidimensional random variables. For any given conditional distribution , we say that is an exchangeable conjugate conditional if it is a probability kernel on such that, for any fixed , satisfies the exchangeability property in .
In other words, if is sampled conditionally on , and then is sampled from the probability kernel , then is a knockoff of (conditionally on ). In some cases it is easy to find examples of such a conjugate conditional distribution. If is univariate, then is a valid conjugate conditional. More generally, this choice is valid if the coordinates of are independent conditionally on . We can also construct valid conjugates when the distribution of conditionally on is Gaussian (Candès et al., 2018)
or a Markov chain
(Sesia et al., 2017).Sampling Knockoffs for Bayesian Networks
We define a procedure for sampling knockoffs when we assume a generative model for . corresponds to the observed variables in a latent variable model which we assume can be represented as a Bayesian Network (BN): Gaussian Mixture Models, Hidden Markov Models (HMM) (Poritz, 1988), Latent Dirichlet Allocation (LDA) (Blei et al., 2003)
(Friedman et al., 1997) and many other Bayesian models fit this description. Consider a BN defined by the directed acyclic graph (DAG) with cardinality where the index is a topological ordering. and refer respectively to the observed variables and the hidden ones: we have restated our initial set of features as , and we want to construct knockoffs by leveraging our knowledge about the DAG. The only restriction is that the observed variables must correspond to nodes in the graph without descendents. It is important to notice that here, the variables associated to a node of the DAG can be multidimensional, which sometimes simplifies the sampling procedure (an example are knockoffs for HMM, detailed in Appendix A.3).We assume that we can sample from (for simplicity we refer to the random variables by their indices), the conditional probability of the hidden variables given the observed ones. Assume also that, for every node , we can compute the local conditional probability of given its Markov blanket , and that we can compute a knockoff conjugate distribution of and sample from it, which will be a “local” knockoff. We describe the knockoff generating procedure in Algorithm 1 and further discuss the assumptions on the inputs on Appendix A.3.
Theorem 3.2.
The distribution of the concatenated random variables satisfies the exchangeability property. That is, is a valid knockoff of .
The proof is in Appendix A.4. This sampling method is not equivalent to the Sequential Conditional Independent Pairs (SCIP) introduced in (Candès et al., 2018). Consider the simple case with one observed multidimensional variable and one hidden variable . SCIP would sample each coordinate of
sequentially by computing (for each knockoff sample) conditional distributions of the joint probability distribution of
, without leveraging our knowledge of the generative process. That is not computationally feasible in an arbitrary generative model. Another difference with respect to SCIP is that our sampling procedure is not creating a knockoff from the full set of hidden and observed variables: not all “local” knockoffs are retained in the final output. That is, is not a valid knockoff for . We refer to Appendix A.3 for more details on these differences between methods.The HMM knockoff sampling procedure (Sesia et al., 2017) and the Gaussian mixture sampling above are special cases of Algorithm 1 (detailed in Appendix A.3). Naive Bayes would have a straightforward knockoff sampling scheme (as the observed nodes are independent conditionally on the hidden node): given the observed nodes, generate a sample for the hidden node (based on the posterior distribution), then sample independent knockoffs based on the sampled hidden node. We provide in Appendix A.3 an additional example of a DAG knockoff generating procedure based on the LDA model.
4 Power Analysis: Evaluating Importance Score Methods
The previous section shows how to generate valid knockoffs for complex latent variable distributions. The next challenge is to compute importance scores for each original and knockoff feature; this was denoted as and in Section. 2 (recall that the feature statistics is ). While the knockoff procedure guarantees that the FDR is controlled for any importance score, the power of the procedure—i.e. how many true nonnull features are discovered—is highly dependent on the method for computing the importance score. In this section we discuss several importance score methods applicable to a large family of predictive models and also define new ones.
LASSO CoefficientDifference (LCD)
were used as feature statistics by Candès et al. (2018). However, for nonlinear , using importance scores based on a linear model assumption will result in low power as LASSO is not able to fit the real inputoutput relationship (Figure 1(a)). Complex unknown nonlinear inputoutput relationship is an important restriction if we want to use a parametric model to obtain importance scores through interpretable parameters.
Saliency Methods
If
is fitted using a neural network, then one could compute saliency scores, a popular metric for importance scores
(Lipton, 2016; Ghorbani et al., 2017). Saliency is computed for each feature of each example, and averaged across all the examples to derive a global importance score. In our experiments, we use Gradient (Baehrens et al., 2010; Simonyan et al., 2013) and Integrated Gradients (Sundararajan et al., 2017) saliency methods.Importance Scores Based on Accuracy Drops
After a model is trained on the dataset , for each column of , we create a new “fake” feature column resulting in the “fake” set of features , for example by permuting across samples columnwise. Replacing each feature with its fake version one at a time, the accuracy drop can be used as the importance score for that feature.
Note that achieving a high performance of the classifier on the initial dataset is not necessary to generate these importance scores, as our analysis is based on the accuracy drops. Classifiers with low predictive power can still be useful at determining nonnull features, and although they may not make many discoveries, FDR is still controlled.
Permutation and Swapping
Originally introduced for Random Forests
(Breiman, 2001), the Permutation method creates each column of by simply shuffling the entries of the corresponding column in . Although the permutation method may be effective at creating a new “fake” feature independently from , it comes with an important issue: when evaluating the classifier on the dataset where just one of the features has been shuffled, for general distributions of the features, we may end up with fake data points that are completely off the original data distribution and as a result outside of the learned manifold of the predictive model. The classifier’s predictions on these offdistribution regions of the input space where it hasn’t been trained may be arbitrary. As a result, the performance can potentially drop across all features and their knockoffs, regardless of whether they actually had an impact on . A simple illustration of this phenomenon has been provided in Appendix E.The main problem occurs with the nonnull features, as an unexpectedly large drop in accuracy for the control knockoff feature would mask the nonnull importance score, hence limiting the possibility that it is selected. When the original distribution of the features is a mixture of Gaussians, permutation across any of the features will create data points that do not lie on any of the mixture components of , and as the number of mixtures increases, the chance for a fake data point to be offdistribution increases. We confirm this phenomenon with synthetic data experiments Figure 3(ac). Power for the Permutation method strictly decreases as the number of mixture components grows.
To tackle this problem, we introduce the Swap method for importance scores. Instead of shuffling, for each original feature column, we use its knockoff corresponding column as its fake replacement and the other way around: and . Keeping in mind the exchangeability condition when generating knockoffs, this method guarantees that replacing each feature with its fake will result in data points that still belong to the original data distribution, and therefore the problem with fake data points moving to regions of the space unknown to the predictive model is mitigated.
Path Integration Generates Importance Scores with Better Power
For a given parameter , we compute for each original feature the fake replacement (and reciprocally for each knockoff feature). The basic Swap (or Permutation) method corresponds to , and we plot the feature statistics as a function of the parameter , which we call lambda path. We give an example of one such path in Figure 1(c) for a simulation where the classifier is a neural network with 3 fully connected layers, and is categorical with 10 classes such that corresponds to the level sets of a random polynomial of degree 3 of the nonnull features, and comes from a mixture of 10 Gaussians. Similarly, we can also take lambda path for the Permutation method, where we compute by taking as the shuffled feature . However, our lambda path quickly becomes unstable as for we already get out of the data manifold.
Basic Swap and Permutation (which correspond to ) do not always give the best separation between the null features (grey) and the alternate features (red). This motivates taking the integral of the lambda path and use the area under the curve as the feature statistic —we call this Swap Integral and, similarly, Permutation Integral. For all of our experiments, we integrate from to . Figure 1(b) shows that, for the same data as used in Figure 1(c), Swap Integral and Permutation Integral methods are respectively much more powerful than the basic Swap and Permutation.
Our goal is to identify the classifier’s decision boundaries within the data manifold (thus likely to identify relevant features). When considering a lambda path, all comes down to choosing an appropriate direction when translating each feature, and keeping track of the decision boundaries crossed. We may temporarily step out of the data manifold (indeed there is no guarantee that the manifold is convex, our Gaussian mixture model is especially wellsuited for this) –and eventually will be out of it. The integral (up to ) assigns a higher importance to boundaries crossed early on. As the Swap Integral method “stays longer” in the data manifold early on ( is still in it), it performs better than when choosing an arbitrary direction (such as one given by a shuffled vector, i.e. the Permutation Integral). Knockoffs not only are useful for FDR control, they also contain information about the data manifold. The outcome of the following experiments whenever we use Swap Integral does not substantially change whenever stays in the range 5 to 15.
5 Experiments
We start by precising that we will not be able to compare our DAG knockoff procedure (for the Gaussian Mixture Model) with the more general SCIP procedure, because the latter has no tractable formulation even for very simple models like a simple multivariate Gaussian.
The Need for Mixture Models
We give a representative example of a simple setting where simply using the empirical covariance matrix to generate the knockoff—i.e. assuming a single Gaussian model—yields invalid knockoffs for which FDR is not controlled.
We generate data from a mixture of three correlated Gaussian random vectors in dimension 30. Approximating by a single Gaussian vector yields an empirical covariance matrix which is dominated by the diagonal (and therefore considers as having independent coordinates). We generate a binary outcome
in a logistic regression setting (such that the log odds of
is a linear combination of the features of X). The linear combination is such that just the first 10 features of are nonnull (i.e. have non zero coefficient, red), the next 10 (null, black) features are correlated with the nonnulls, and the last 10 are independent nulls. During the knockoff procedure, the importance scores are the absolute value of the logistic regression coefficients of each feature and the feature statistics are the difference of importance scores.We run our knockoff procedure, modeling as a mixture of three multivariate Gaussian variables fitted via EM (Figure 2(a)), and as a single Gaussian (Figure 2(b)). Importance scores of the nonnull original features are high in both cases, and those of the correlated null original features are also quite high. Importance scores of valid knockoffs cancel out the correlated null original features’ importance scores, and therefore the adaptively chosen threshold avoids selecting nulls (Figure 2(a)). However, when modeling by one mixture, the knockoffs corresponding to the null features (1020) have importance scores close to 0 (Figure 2(b)). As a result these correlated null features are selected by the procedure and the FDR control is broken. After 40 repetitions of the procedure (randomly generating each run the parameters of and sampling ), for a target FDR of , we get FDR control when we use our mixture model knockoff (empirical FDR: ) and fail to control FDR whenever we fit the data with one single multivariate Gaussian (empirical FDR: ).
Feature Selection in Synthetic Data
We generate as mixtures of 5, 10 and 20 multivariate Gaussians, and as polynomial functions of a subset of the dimensions of —i.e. these are the nonnull features. From each , we draw samples of and , fit a mixture of Gaussians and run our knockoff procedure. Figure 3 (ac) shows the power of five importance score methods, and Swap Integral consistently achieves the best power. In all of our experiments, the empirical FDR is controlled at the target level (Appendix Figure 5).
Feature Selection in Real Data
Our realworld simulations are based on three datasets from the UK biobank data and the Bank Marketing and Polish Bankruptcy from the UCI repository (Dheeru and Karra Taniskidou, 2017). We provide more information in Appendix D. As our first experiment, in order to have control over the nonnull features, we created synthetic labels for each dataset by randomly selecting a subset of the features to be alternatives (the rest are nulls) and use noisy polynomials on these features to generate labels. Details of the data generation are described in Appendix C. We fit each dataset with a Gaussian mixture model using the AIC method (Akaike, 1992). AIC automatically selected 9, 15, and 25 number of clusters for the UK Biobank, Bank Marketing and Bankruptcy datasets, respectively. Then we apply our Gaussian Mixture knockoff sampler to generate knockoffs for each dataset. Average results for repeating this procedure 100 times on each data set are reported in Figure 3. Across the range of target FDRs, the Swap Integral has at least as much power (and sometimes much more) than the other feature statistic methods. It is important to notice that, as we work with a real dataset that we cannot resample, our experiments cannot provide us with empirical FDR values in order to confirm FDR control. We expect that our model for is accurate enough so that the robustness of the procedure accounts for model inaccuracies.
As a discovery experiment, we run the knockoff procedure for feature selection on each of the three data sets with the real labels –cancer in UK Biobank, telemarketing success in Marketing, and bankruptcy in Bankruptcy. In all cases we train a neural network binary classifier with 4 hidden layers of size 100. All selected features are described in Appendix D. With the real data, we do not have ground truth on which are the true alternate features. However, knockoff selected features make sense and some of the features agree with previously reported literature. With target FDR of 0.3 (it can be shown that for small target FDR values the knockoff procedure requires a minimum number of rejections, which is why we selected a fairly large target FDR), in UK Biobank cancer prediction, the knockoff procedure selected 14 out of the total 284 features. These 14 include Number of smoked cigarettes daily and Wine intake. Eight out of ten features in Bank marketing data set were selected. For Bankruptcy prediction, the knockoff procedure selected features including gross profit and shortterm liabilities.
Discussion
The knockoff procedure is a powerful framework for selecting important features while statistically controlling false discoveries. The two main challenges of applying knockoffs in practice are the difficulty in generating valid knockoffs and the choice of importance scores. This paper takes a step toward addressing both challenges. We develop a framework to efficiently sample valid knockoffs from Bayesian Networks. We show that even a simple application of this framework to generate knockoffs from Gaussian mixture models already enables us to apply our procedure with several real datasets, while being optimistic that FDR is controlled. We also systematically evaluate and compare the statistical power of several importance scores. We propose a new score, Swap Integral, which is stable and substantially more powerful. Swap Integral can be applied on top of any classification model, including neural networks and random forests. Note that Swap Integral is a postprocessing on top of a single trained classifier and hence is computationally efficient. Our results enable knockoffs to be more practically useful and open many doors for future investigation.
Acknowledgments
J.R.G. and A.G. were supported by a Stanford Graduate Fellowship. J.Z. is supported by a Chan–Zuckerberg Biohub Investigator grant and National Science Foundation (NSF) Grant CRII 1657155.
References
 Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Mallows (1973) Colin L Mallows. Some comments on c p. Technometrics, 15(4):661–675, 1973.
 Akaike (1974) Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
 Raftery (1986) Adrian E Raftery. Choosing models for crossclassifications. American sociological review, 51(1):145–146, 1986.
 Sur and Candès (2018) Pragya Sur and Emmanuel J Candès. A modern maximumlikelihood theory for highdimensional logistic regression. arXiv preprint arXiv:1803.06964, 2018.
 Candès et al. (2018) Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold:‘modelx’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2018.
 Sesia et al. (2017) Matteo Sesia, Chiara Sabatti, and Emmanuel J Candès. Gene hunting with knockoffs for hidden markov models. arXiv preprint arXiv:1706.04677, 2017.
 Barber et al. (2018) Rina Foygel Barber, Emmanuel J Candès, and Richard J Samworth. Robust inference with knockoffs. arXiv preprint arXiv:1801.03896, 2018.
 Barber et al. (2015) Rina Foygel Barber, Emmanuel J Candès, et al. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
 Bacharoglou (2010) Athanassia Bacharoglou. Approximation of probability distributions by convex mixtures of gaussian measures. Proceedings of the American Mathematical Society, 138(7):2619–2628, 2010.
 Poritz (1988) Alan B Poritz. Hidden markov models: A guided tour. In Acoustics, Speech, and Signal Processing, 1988. ICASSP88., 1988 International Conference on, pages 7–13. IEEE, 1988.
 Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
 Friedman et al. (1997) Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine learning, 29(23):131–163, 1997.
 Lipton (2016) Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
 Ghorbani et al. (2017) Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. arXiv preprint arXiv:1710.10547, 2017.
 Baehrens et al. (2010) David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and KlausRobert MÃžller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
 Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
 Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 Dheeru and Karra Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Akaike (1992) Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Breakthroughs in statistics, pages 610–624. Springer, 1992.
 Sudlow et al. (2015) Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.
 Moro et al. (2014) Sérgio Moro, Paulo Cortez, and Paulo Rita. A datadriven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014.
 Zięba et al. (2016) Maciej Zięba, Sebastian K Tomczak, and Jakub M Tomczak. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Systems with Applications, 58:93–101, 2016.
References
 Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Mallows (1973) Colin L Mallows. Some comments on c p. Technometrics, 15(4):661–675, 1973.
 Akaike (1974) Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
 Raftery (1986) Adrian E Raftery. Choosing models for crossclassifications. American sociological review, 51(1):145–146, 1986.
 Sur and Candès (2018) Pragya Sur and Emmanuel J Candès. A modern maximumlikelihood theory for highdimensional logistic regression. arXiv preprint arXiv:1803.06964, 2018.
 Candès et al. (2018) Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold:‘modelx’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2018.
 Sesia et al. (2017) Matteo Sesia, Chiara Sabatti, and Emmanuel J Candès. Gene hunting with knockoffs for hidden markov models. arXiv preprint arXiv:1706.04677, 2017.
 Barber et al. (2018) Rina Foygel Barber, Emmanuel J Candès, and Richard J Samworth. Robust inference with knockoffs. arXiv preprint arXiv:1801.03896, 2018.
 Barber et al. (2015) Rina Foygel Barber, Emmanuel J Candès, et al. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
 Bacharoglou (2010) Athanassia Bacharoglou. Approximation of probability distributions by convex mixtures of gaussian measures. Proceedings of the American Mathematical Society, 138(7):2619–2628, 2010.
 Poritz (1988) Alan B Poritz. Hidden markov models: A guided tour. In Acoustics, Speech, and Signal Processing, 1988. ICASSP88., 1988 International Conference on, pages 7–13. IEEE, 1988.
 Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
 Friedman et al. (1997) Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine learning, 29(23):131–163, 1997.
 Lipton (2016) Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
 Ghorbani et al. (2017) Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. arXiv preprint arXiv:1710.10547, 2017.
 Baehrens et al. (2010) David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and KlausRobert MÃžller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
 Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
 Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 Dheeru and Karra Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Akaike (1992) Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Breakthroughs in statistics, pages 610–624. Springer, 1992.
 Sudlow et al. (2015) Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.
 Moro et al. (2014) Sérgio Moro, Paulo Cortez, and Paulo Rita. A datadriven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014.
 Zięba et al. (2016) Maciej Zięba, Sebastian K Tomczak, and Jakub M Tomczak. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Systems with Applications, 58:93–101, 2016.
Appendix
Appendix A Proofs
a.1 Validity of Importance Scores with Random Component
Following the notation by Candès et al. (2018) for Lemma 3.3, denoting the full vector of feature statistics when swapping features in , the flipsign property can be summarized as: where is the elementwise vector multiplication and . As discussed by Candès et al. (2018), it should be highlighted that the final selection procedure controls FDR just because of this property. Now, by directly referring to the proof of Lemma 3.3 by Candès et al. (2018), we observe that it relies on the flipsign property just as an equality in distribution. Therefore, with this exact same proof, we get that the result still holds when feature statistics satisfy the previous equality only in distribution. This allows us to construct valid feature statistics based on random components that are not limited to the randomness in the data itself. We can therefore also construct randomized statistics, and we prove that the constraint mentioned earlier only needs to hold in distribution to end up with satisfying the flipsign condition in distribution.
Proposition A.1.
Assume that the following equality holds in distribution for any subset :
Then we have the following equality in distribution:
(1) 
Proof.
It suffices to show the result for , as the general case can be decomposed as the concatenation of swaps of just one coordinate.
∎
a.2 Proof of Proposition 3.1: GMM Knockoff Sampling Procedure
We prove Proposition 3.1, although this exact same proof applies in the more general setting of the Algorithm 2 in next section.
Proof.
We consider the marginal distribution over by summing the full joint distribution over all possible values of . We then decompose the joint distribution along the sampling steps.
This proves exchangeability as the last line satisfies exchangeability in . ∎
a.3 Comparison of Algorithm 1 and SCIP
The main contribution of our Bayesian network knockoff sampling method is due to the intractability of SCIP for a general feature distribution as mentioned in 3.
Indeed, SCIP sequentially samples for the knockoff of the th feature from the conditional distribution of given . That means that, at each step of the sampling process, for each sample, one needs to compute the joint distribution of and the conditional distribution of , which is not computationally feasible if we assume a complex model for .
Another difference is shown by the following: suppose that we observe variable for which we want to sample a knockoff, and that its distribution is conditioned on a latent variable . If we assume that we can construct easily the conjugates and , then Algorithm 1 simplifies into Algorithm 2.
We show in this simple setting that is not a knockoff of . We write the joint distribution and decompose it along the sampling steps.
To prove exchangeability of , we have to marginalize over the hidden states.
Only if we marginalize out the hidden states we get to an expression where exchangeability is satisfied for . Otherwise we don’t, and therefore is not a knockoff of . In SCIP, all the random variables sampled are part of the final knockoff sample.
Our procedure also differs from SCIP insofar it is “modular” in each local conjugate conditional. The choice of each conjugate conditional is not unique, and poor choices yield local knockoffs that are too “close” to the initial sample and decrease the power of the procedure. The worst option, which is using the feature as its own knockoff (i.e. ) still gives valid knockoffs, though discards any possibility for that given feature to be selected. But this is why this procedure is flexible: in cases where a conditional has no closed form expression because of complex dependencies, we can locally choose poor conjugates and continue the procedure so that we still obtain valid knockoff samples, which is not possible when running SCIP directly as one has to sample from a complex conditional distribution that is predetermined.
We can analyze how the previous examples make use of this freedom in the choice of the conditional conjugate. In the Gaussian mixture case we could choose , in which case our knockoff for would be itself, and the knockoff procedure would be powerless. In the HMM setting, after sampling the hidden nodes from the posterior given the observed ones, we could choose to keep those same nodes as local knockoffs instead of sampling a different local knockoff. In this case, the final knockoff we obtain is different from . However, one can expect this choice to produce knockoffs with lower power, as the knockoff samples will stay “closer” to the true samples. The intuition is that if we leverage the knowledge we have about the generative process, we can sample more powerful knockoffs.
For the Hidden Markov Model, sample:

, we sample the hidden states. Conditionally on the distribution of is that of a Markov chain.

we sample a new knockoff Markov chain via SCIP.

. However, is constructed based on the distribution . But because of the structure of the HMM, the observed states are independent conditionally on the hidden states. If the observed states are univariate then we can simplify the conjugate conditional and sample (see comments after Definition 3.1).
Example: Sampling Knockoffs for LDA
As an additional example, we describe how Algorithm 1 works to generate knockoffs from Latent Dirichlet Allocation (LDA). We use the same notation for LDA as defined by Blei et al. (2003): the th word in document is sampled from a multinomial distribution parametrized by , where corresponds to the topic assignment for . Topic assignment is sampled from a multinomial distribution parametrized by , the distribution of topics in document . Finally
for each document is sampled from a Dirichlet distribution with hyperparameter
.
[leftmargin=*]

The first step to build knockoffs is to learn the parameters of the model, which can be done by variational EM (Blei et al., 2003). Then, we need to sample the hidden variables given the observed ones : this is an inference problem for which direct computation is intractable, but we can approximate that posterior distribution via standard variational Bayes methods.

Next is to sample local knockoffs. This is exactly analog to one pass of Gibbs sampling over the whole DAG following a topological ordering, except that instead of sampling with respect to the conditional distribution of the node given its Markov blanket, we sample from the conjugate conditional distribution, conditioning on the appropriate variables as explained in Algorithm (1). We sample each based on the conjugate conditional distribution. However, as is Dirichlet, any given coordinate is determined by the others, so the only possible choice is to set . Then, as is univariate, its conjugate conditional simplifies too so that we just sample from the local conditional probability, and so on for .
a.4 Proof of Theorem 3.2: DAG Knockoff Sampling Procedure
Proof.
The joint probability distribution can be decomposed as follows by following the sampling steps:
(2) 
In order to show that is exchangeable, we want to show that if we marginalize out this joint distribution with respect to the hidden states , we get an exchangeable distribution.
We first show that, iterating recursively over all the nodes, and summing over all values of , we obtain the following expression.
(3) 
For simplicity, here we consider discrete random variables, so that marginalizing the joint distribution over
means summing over all possible values of. Everything stays valid for continuous random variables, replacing sums by integrals. Starting from equation (
2), which corresponds to step 1, we do sequentially steps to get to (3). Suppose that at step we have the following equality where the lefthand term is the product of the righthand terms:The key element is that, by following the topological order, at step the variable only appears in the joint probability and in the term
(Notice that, if corresponds to an observed node, it has no descendents. Therefore the Markov blanket of such node is a subset of the nodes with smaller index/topological ordering). We isolate these two terms and start by writing down the joint probability as a conditional probability. By definition of the Markov blanket, we can simplify the expression of the conditional probability. Then, we obtain two terms that are conjugate in the exchangeable sense.
We swap in the previous expression the two terms and we get the following product:
If , then we sum both sides of the equality over . But now, only appears in the last term, and summing over it gives . If we reach a node with no descendents, i.e. , then we do not marginalize out. However we have the following simplification:
In both cases, we get to the next step in our recursion. After completing last step, we get to equation (3). Notice that this expression is exchangeable in , for every assignment of . Indeed, for , we have that , so
And do not appear in the last product term as two observed nodes cannot be in the Markov blanket of each other. only appear in the conjugate probabilities, therefore the exchangeability in holds. Again, as two observed nodes cannot appear in the Markov blanket of the other, this step can be repeated for different indices , hence the exchangeability of the expression. This symmetry is at fixed values of . Therefore, it still holds when we sum over . Hence
satisfies exchangeability. ∎
Appendix B Robustness of the Procedure to Model Misspecification
As explained when first introducing the ModelX knockoff procedure in Candès et al. (2018), instead of considering a model for the conditional distribution of , all the assumptions are related to modeling . The burden of knowledge shifts from to . The same way valid pvalues rely on assumptions on , (parametric model, noise distribution, asymptotic regime…), valid knockoffs rely on assumptions on the distribution of : mainly, that we can approximate it very well.
When we generate knockoffs based on a Gaussian mixture model, and more generally a Bayesian network, we assume that these probabilistic models are good approximations for , and that they can be properly fitted. This is a very strong assumption, as not only the model we use to represent
may be incorrect, but the estimated parameters of the model depend on the fitting procedure, which sometimes provides only an approximation to the actual distribution encoded by the Bayesian network (as when using Variational Inference). Optimization methods commonly used such as ExpectationMaximization (EM) can also get stuck in local minima. However, the knockoff procedure is remarkably robust when dealing with these issues. Existing theoretical robustness bounds
(Barber et al., 2018) are based on controlling the KLdivergence of the model with respect to the true . This can help explain why EM in our method, and hopefully other fitting procedures, yield knockoffs that are somehow valid: these methods minimize the KLdivergence of the model with respect to the distribution of .Appendix C Synthetic Data Generation and FDR Control
As an example, we provide simulations showing the empirical FDR for a mixture of tdistributions in Figure 4 (at a fixed target FDR ), as a function of the number of Gaussian mixtures we use to model the distribution. The conclusion is that, with enough mixtures, the knockoff procedure is able to control FDR, even though we cannot expect to correctly represent any tdistribution through a mixture of Gaussians.
To generate a synthetic data set with samples, features, mixtures and different classes, we implemented the following steps:

[noitemsep, topsep=0pt]

We generate random values for the means and covariance matrices (using scikitlearn