Knockoffs for the mass: new feature importance statistics with false discovery guarantees

07/17/2018 ∙ by Jaime Roquero Gimenez, et al. ∙ Stanford University 0

An important problem in machine learning and statistics is to identify features that causally affect the outcome. This is often impossible to do from purely observational data, and a natural relaxation is to identify features that are correlated with the outcome even conditioned on all other observed features. For example, we want to identify that smoking really is correlated with cancer conditioned on demographics. The knockoff procedure is a recent breakthrough in statistics that, in theory, can identify truly correlated features while guaranteeing that the false discovery is limited. The idea is to create synthetic data -knockoffs- that captures correlations amongst the features. However there are substantial computational and practical challenges to generating and using knockoffs. This paper makes several key advances that enable knockoff application to be more efficient and powerful. We develop an efficient algorithm to generate valid knockoffs from Bayesian Networks. Then we systematically evaluate knockoff test statistics and develop new statistics with improved power. The paper combines new mathematical guarantees with systematic experiments on real and synthetic data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identifying important features is an ubiquitous problem in machine learning and statistics. In relatively simple settings, the importance of a given feature is measured via the fitted parameters of some model. Consider Generalized Linear Models (GLM) for example, users often add a LASSO penalty (Tibshirani, 1996)

to promote sparsity in the coefficients, and subsequently select those features with non-zero coefficients. Step-wise procedures where we sequentially modify the model are another way of doing feature selection

(Mallows, 1973; Akaike, 1974; Raftery, 1986).

These standard methods are all plagued by correlations between the features: a feature that is not really relevant for the outcome, in a precise sense we will define in Section 2, can be selected by LASSO or Step-wise procedure, because it is correlated with relevant features. The feature selection problem becomes even more difficult in more complex settings where we may not have a clean parametric model relating input features to the labels. Moreover, in these settings we usually lack statistical guarantees on the validity of the selected features. Finally, even procedures with statistical guarantees usually depend on having

valid -values, which are based on a correct modeling of and (sometimes) assume some asymptotic regime. However, there are many common settings where these assumptions fail and we cannot perform inference based on those -values (Sur and Candès, 2018).

A powerful new approach called Model-X knockoff procedure (Candès et al., 2018) has recently emerged to deal with these issues. This method introduces a new paradigm: we no longer assume any model for the distribution of in order to do feature selection (and therefore do not compute -values), but we assume that we have full knowledge of the feature distribution – or at least we can accurately model it. This knowledge of the ground truth allows us to sample new knockoff variables satisfying some precise distributional conditions. Although we make no assumption on , we can run any procedure on the data set extended with the knockoffs and perform feature selection while controlling the False Discovery Rate (FDR). Statistical guarantees no longer rely on valid -values whenever our model for is correct, but exclusively on having valid knockoffs, obtained whenever we know (or our model for is correct).

There are two main obstacles to using knockoffs in practice (i.e. “for the mass”): 1) tractably generating valid knockoffs and 2) once we have generated knockoffs, computing powerful test statistics. Current tractable methods for generating knockoffs are restricted to the settings where is modeled as a multivariate Gaussian (Candès et al., 2018)

or as the set of observed nodes in a Hidden Markov Model (HMM)

(Sesia et al., 2017). Even though knockoff methods enjoy some robustness properties when approximating (Barber et al., 2018), the guarantee on FDR control breaks down even in very simple settings if these approximations turn out to be too crude, as we demonstrate in this paper. Sequential conditional independent pairs algorithm (SCIP) (Candès et al., 2018) is a universal knockoff sampling scheme but is computationally intractable as soon as we model by more complex distributions. Constructing tractable knockoff sampling procedures for more flexible classes of distributions used to model is essential to improve the validity of knockoffs, which is always subject to the quality of the approximation. Regarding test statistics, current methods (Barber et al., 2015; Candès et al., 2018; Sesia et al., 2017; Barber et al., 2018)

focus on LASSO-based statistics to obtain proxies for the importance of any given feature: such methods need to fit a GLM to the data, which is too restrictive in many supervised learning tasks.

Our Contributions.

First, we formulate a novel tractable procedure for sampling knockoffs in settings where features can be modeled as the observed variables in a Bayesian Network. This allows for great flexibility when modeling the feature distribution . We show that this procedure is different from SCIP, which is the current state-of-the-art method to sample knockoffs. We construct valid knockoffs in settings where previous knockoff sampling procedures assumed a very restrictive model for and therefore failed to control FDR, and provide a unified framework for several different sampling procedures. In addition, we systematically evaluate and compare several nonlinear knockoff test statistics which can be applied in general supervised learning problems. We develop a new statistic, Swap Integral

, which has significantly better power than other methods. These two advances enable us to perform feature selection using black-box classifiers to represent

while still retaining statistical guarantees under correct modeling of .

2 Knockoff Background

We begin by introducing the usual setting of feature selection procedures. We consider the data as a sequence of i.i.d. samples from some unknown joint distribution:

, . We then define the set of null features by if and only if (where the subscript indicates all variables except the th). The non-null features, also called alternatives, are important because they capture the truly influential effects: each non-null feature is correlated with the label even conditioned on rest of the features. Our goal is to identify these non-null features. Running the knockoff procedure gives us a selected set , while controlling for False Discovery Rate (FDR), which stands for the expected rate of false discoveries: .

Assuming we know the ground truth for the distribution (we’ll come back to the realistic setting where we only have samples of ), the first step of the knockoff procedure is to obtain a knockoff sample that satisfies the following conditions:

Definition 2.1 (Knockoff sample).

A knockoff sample

of a d-dimensional random variable

is a d-dimensional random variable such that two properties are satisfied:

  • [noitemsep, topsep=0pt]

  • Conditional independence:    

  • Exchangeability :

where stands for equality in distribution and the notation

refers to the vector where the original

th feature and the th knockoff feature have been transposed whenever .

The first condition is immediately satisfied as long as knockoffs are sampled conditioned on the sample without considering any information about , which will always be the case in our sampling methods. More generally we say that any such that satisfies the exchangeability property.

Once we have the knockoff sample , the next step of the procedure constructs feature statistics , such that a high value for is evidence that the th feature is non-null. Feature statistics described by Candès et al. (2018) depend only on such that for each we can write for some function . The only restriction these statistics must satisfy is the flip-sign property: swapping the th feature and its corresponding knockoff feature should flip the sign of the statistic while leaving other feature statistics unchanged. More formally, for a subset of features, denoting the data matrix where the original th variable and its corresponding knockoff have been transposed whenever , we require that satisfies:

As suggested by Candès et al. (2018), the choice of feature statistics can be done in two steps: first, find a statistic where each coordinate corresponds to the “importance” — hence we will call them importance scores — of the corresponding feature (either original or knockoff). For example, would be the absolute value of the regression coefficient of the th feature.

After obtaining the importance score for the original and knockoff feature, we compute the feature statistic . The intuition is that importance scores of knockoffs serve as a control, such that larger importance score of the th feature compared to that of its knockoff implies larger (and therefore is evidence against the null). Given that feature statistics of the null features behave “symmetrically” around 0 by the flip-sign property, the final selection procedure compares, at a given threshold the number of features such that vs. the number of those such that (Candès et al., 2018). By keeping this ratio “under control” for a given target FDR that we fix in advance, we obtain an adaptive procedure that selects features whose value is above some adaptive threshold. FDR control is guaranteed with this procedure, as long as the knockoffs are sampled correctly (i.e., satisfying the two conditions), regardless of the choices made to construct the feature statistics (Candès et al., 2018).

The power of the knockoff procedure is that it makes no assumptions on the relationship between and ; however, it requires full knowledge of the feature distribution which is often not known and needs to modeled. It is also important to notice that for any model we use for , we are required to know how to sample knockoffs from that model distribution. Current classes of distributions having a tractable knockoff sampling method have limited expressivity, and we can find simple settings where using models limited to these classes fails to provide valid knockoffs, and consequently the FDR is not controlled. We provide an example of such setting in Section 5

. However, the knockoff procedure is robust to small errors in the estimation of

(Barber et al., 2018). Our method for sampling knockoffs from a Bayesian network allows us to fit a better distribution to the data resulting in a better control of FDR up to some error term with respect to the FDR we target (under the assumption that our model distribution of is perfect, this procedure controls FDR at the target FDR). We discuss in Appendix B the reasons why we expect our method to be robust to model misspecification.

Note that setting satisfies the knockoff properties, but no discovery would be made since all of the

would be zero. Therefore, in addition to generating valid knockoffs in order to control FDR, it is also important to consider the statistical power of the procedure, which is the probability that a true alternative is selected. This depends not only on the knockoff sampling procedure, but also on the choice of the importance scores. We discuss power in detail in Sections 4 and 5.

3 Generating Knockoffs From Latent Variable Models

We first show how to generate knockoff samples from a Gaussian mixture model, which is already a new contribution, as a warm-up to our general procedure to generating knockoffs from latent variable models. This section relates just to the first step of the

Model-X knockoff procedure. We will consider different classes of distributions used to model ; we do not consider the response for this step.

Gaussian Mixture Knockoffs

The setup:

  • [leftmargin=*]

  • is a categorical latent variable taking values in , where is the number of mixtures.

  • and

    are the mean and variances for component

    , that are known or estimated.

  • We observe a -valued random variable with distribution

For a multivariate Gaussian distribution, we can explicitly write a joint density

on that satisfies the exchangeability property for a given :

is a diagonal matrix with the constraint that the joint covariance matrix is positive definite (Barber et al., 2015). The Gaussian mixture knockoff sampling procedure consists of first sampling the mixture assignment variable from the posterior distribution . The knockoff is then sampled from the conditional distribution of the knockoff given the true variable and the sampled mixture assignment defined as . More explicitly:

as .
Proposition 3.1.

The Gaussian mixture knockoff sampling procedure stated above is such that is a valid knockoff of .

The proof is deferred to Appendix A.2. Once we compute the necessary parameters of these conditional distributions, we can sample valid knockoffs for a Gaussian mixture model. Whenever we no longer know the ground truth for , Gaussian mixtures can approximate arbitrarily well any continuous distribution (Bacharoglou, 2010), so we now have a very flexible model for the feature distribution from which we know how to tractably sample knockoffs. In a real setting, the validity of our knockoffs will be limited by the quality of the approximation of our model for – the same way in a parametric model of the validity of the -values depends on the validity of model and distributional assumptions (of the eventual random noise, on the asymptotic regime, etc). Next we show how to efficiently generate valid knockoffs for general Bayesian Networks, which can be used to accurately model in many settings.

Exchangeable Conjugate

Our knockoff sampling procedures will be based on the following idea. Instead of keeping long range dependencies across all features as SCIP does, we want to exploit the structure given by the local conditional distributions of a Bayesian network: sampling knockoffs becomes as hard as doing probabilistic inference on the Bayesian network, and sampling “local” knockoffs based on the local structure.

Definition 3.1.

Let be arbitrary multi-dimensional random variables. For any given conditional distribution , we say that is an exchangeable conjugate conditional if it is a probability kernel on such that, for any fixed , satisfies the exchangeability property in .

In other words, if is sampled conditionally on , and then is sampled from the probability kernel , then is a knockoff of (conditionally on ). In some cases it is easy to find examples of such a conjugate conditional distribution. If is univariate, then is a valid conjugate conditional. More generally, this choice is valid if the coordinates of are independent conditionally on . We can also construct valid conjugates when the distribution of conditionally on is Gaussian (Candès et al., 2018)

or a Markov chain

(Sesia et al., 2017).

Sampling Knockoffs for Bayesian Networks

We define a procedure for sampling knockoffs when we assume a generative model for . corresponds to the observed variables in a latent variable model which we assume can be represented as a Bayesian Network (BN): Gaussian Mixture Models, Hidden Markov Models (HMM) (Poritz, 1988), Latent Dirichlet Allocation (LDA) (Blei et al., 2003)

, Naive Bayes

(Friedman et al., 1997) and many other Bayesian models fit this description. Consider a BN defined by the directed acyclic graph (DAG) with cardinality where the index is a topological ordering. and refer respectively to the observed variables and the hidden ones: we have restated our initial set of features as , and we want to construct knockoffs by leveraging our knowledge about the DAG. The only restriction is that the observed variables must correspond to nodes in the graph without descendents. It is important to notice that here, the variables associated to a node of the DAG can be multi-dimensional, which sometimes simplifies the sampling procedure (an example are knockoffs for HMM, detailed in Appendix A.3).

We assume that we can sample from (for simplicity we refer to the random variables by their indices), the conditional probability of the hidden variables given the observed ones. Assume also that, for every node , we can compute the local conditional probability of given its Markov blanket , and that we can compute a knockoff conjugate distribution of and sample from it, which will be a “local” knockoff. We describe the knockoff generating procedure in Algorithm 1 and further discuss the assumptions on the inputs on Appendix A.3.

Input : Initial sample , conditional distribution , conjugate conditionals for every node .
Output : Knockoff sample
1 Sample ;
2 for  to  do
3       Sample
4 end for
return
Algorithm 1 Knockoff Sampling Procedure in BN
Theorem 3.2.

The distribution of the concatenated random variables satisfies the exchangeability property. That is, is a valid knockoff of .

The proof is in Appendix A.4. This sampling method is not equivalent to the Sequential Conditional Independent Pairs (SCIP) introduced in (Candès et al., 2018). Consider the simple case with one observed multi-dimensional variable and one hidden variable . SCIP would sample each coordinate of

sequentially by computing (for each knockoff sample) conditional distributions of the joint probability distribution of

, without leveraging our knowledge of the generative process. That is not computationally feasible in an arbitrary generative model. Another difference with respect to SCIP is that our sampling procedure is not creating a knockoff from the full set of hidden and observed variables: not all “local” knockoffs are retained in the final output. That is, is not a valid knockoff for . We refer to Appendix A.3 for more details on these differences between methods.

The HMM knockoff sampling procedure (Sesia et al., 2017) and the Gaussian mixture sampling above are special cases of Algorithm 1 (detailed in Appendix A.3). Naive Bayes would have a straightforward knockoff sampling scheme (as the observed nodes are independent conditionally on the hidden node): given the observed nodes, generate a sample for the hidden node (based on the posterior distribution), then sample independent knockoffs based on the sampled hidden node. We provide in Appendix A.3 an additional example of a DAG knockoff generating procedure based on the LDA model.

4 Power Analysis: Evaluating Importance Score Methods

Figure 1: Comparison Between Importance Score Methods (a) Synthetic data where

is a polynomial. Using a linear LASSO regression model to generate importance scores results in near-zero power; our proposed Swap Integral statistic achieves higher power across all the target FDRs. (b) Swap Integral and Permutation Integral achieve higher power compared respectively to basic Swap and Permutation. (c) Example of lambda path for Swap method.

The previous section shows how to generate valid knockoffs for complex latent variable distributions. The next challenge is to compute importance scores for each original and knockoff feature; this was denoted as and in Section. 2 (recall that the feature statistics is ). While the knockoff procedure guarantees that the FDR is controlled for any importance score, the power of the procedure—i.e. how many true non-null features are discovered—is highly dependent on the method for computing the importance score. In this section we discuss several importance score methods applicable to a large family of predictive models and also define new ones.

LASSO Coefficient-Difference (LCD)

were used as feature statistics by Candès et al. (2018). However, for non-linear , using importance scores based on a linear model assumption will result in low power as LASSO is not able to fit the real input-output relationship (Figure 1(a)). Complex unknown non-linear input-output relationship is an important restriction if we want to use a parametric model to obtain importance scores through interpretable parameters.

Saliency Methods

If

is fitted using a neural network, then one could compute saliency scores, a popular metric for importance scores 

(Lipton, 2016; Ghorbani et al., 2017). Saliency is computed for each feature of each example, and averaged across all the examples to derive a global importance score. In our experiments, we use Gradient (Baehrens et al., 2010; Simonyan et al., 2013) and Integrated Gradients (Sundararajan et al., 2017) saliency methods.

Importance Scores Based on Accuracy Drops

After a model is trained on the dataset , for each column of , we create a new “fake” feature column resulting in the “fake” set of features , for example by permuting across samples column-wise. Replacing each feature with its fake version one at a time, the accuracy drop can be used as the importance score for that feature.

Note that achieving a high performance of the classifier on the initial dataset is not necessary to generate these importance scores, as our analysis is based on the accuracy drops. Classifiers with low predictive power can still be useful at determining non-null features, and although they may not make many discoveries, FDR is still controlled.

Permutation and Swapping

Originally introduced for Random Forests

(Breiman, 2001), the Permutation method creates each column of by simply shuffling the entries of the corresponding column in . Although the permutation method may be effective at creating a new “fake” feature independently from , it comes with an important issue: when evaluating the classifier on the dataset where just one of the features has been shuffled, for general distributions of the features, we may end up with fake data points that are completely off the original data distribution and as a result outside of the learned manifold of the predictive model. The classifier’s predictions on these off-distribution regions of the input space where it hasn’t been trained may be arbitrary. As a result, the performance can potentially drop across all features and their knockoffs, regardless of whether they actually had an impact on . A simple illustration of this phenomenon has been provided in Appendix E.

The main problem occurs with the non-null features, as an unexpectedly large drop in accuracy for the control knockoff feature would mask the non-null importance score, hence limiting the possibility that it is selected. When the original distribution of the features is a mixture of Gaussians, permutation across any of the features will create data points that do not lie on any of the mixture components of , and as the number of mixtures increases, the chance for a fake data point to be off-distribution increases. We confirm this phenomenon with synthetic data experiments Figure 3(a-c). Power for the Permutation method strictly decreases as the number of mixture components grows.

To tackle this problem, we introduce the Swap method for importance scores. Instead of shuffling, for each original feature column, we use its knockoff corresponding column as its fake replacement and the other way around: and . Keeping in mind the exchangeability condition when generating knockoffs, this method guarantees that replacing each feature with its fake will result in data points that still belong to the original data distribution, and therefore the problem with fake data points moving to regions of the space unknown to the predictive model is mitigated.

Path Integration Generates Importance Scores with Better Power

For a given parameter , we compute for each original feature the fake replacement (and reciprocally for each knockoff feature). The basic Swap (or Permutation) method corresponds to , and we plot the feature statistics as a function of the parameter , which we call lambda path. We give an example of one such path in Figure 1(c) for a simulation where the classifier is a neural network with 3 fully connected layers, and is categorical with 10 classes such that corresponds to the level sets of a random polynomial of degree 3 of the non-null features, and comes from a mixture of 10 Gaussians. Similarly, we can also take lambda path for the Permutation method, where we compute by taking as the shuffled feature . However, our lambda path quickly becomes unstable as for we already get out of the data manifold.

Basic Swap and Permutation (which correspond to ) do not always give the best separation between the null features (grey) and the alternate features (red). This motivates taking the integral of the lambda path and use the area under the curve as the feature statistic —we call this Swap Integral and, similarly, Permutation Integral. For all of our experiments, we integrate from to . Figure 1(b) shows that, for the same data as used in Figure 1(c), Swap Integral and Permutation Integral methods are respectively much more powerful than the basic Swap and Permutation.

Our goal is to identify the classifier’s decision boundaries within the data manifold (thus likely to identify relevant features). When considering a lambda path, all comes down to choosing an appropriate direction when translating each feature, and keeping track of the decision boundaries crossed. We may temporarily step out of the data manifold (indeed there is no guarantee that the manifold is convex, our Gaussian mixture model is especially well-suited for this) –and eventually will be out of it. The integral (up to ) assigns a higher importance to boundaries crossed early on. As the Swap Integral method “stays longer” in the data manifold early on ( is still in it), it performs better than when choosing an arbitrary direction (such as one given by a shuffled vector, i.e. the Permutation Integral). Knockoffs not only are useful for FDR control, they also contain information about the data manifold. The outcome of the following experiments whenever we use Swap Integral does not substantially change whenever stays in the range 5 to 15.

5 Experiments

We start by precising that we will not be able to compare our DAG knockoff procedure (for the Gaussian Mixture Model) with the more general SCIP procedure, because the latter has no tractable formulation even for very simple models like a simple multivariate Gaussian.

The Need for Mixture Models

We give a representative example of a simple setting where simply using the empirical covariance matrix to generate the knockoff—i.e. assuming a single Gaussian model—yields invalid knockoffs for which FDR is not controlled.

We generate data from a mixture of three correlated Gaussian random vectors in dimension 30. Approximating by a single Gaussian vector yields an empirical covariance matrix which is dominated by the diagonal (and therefore considers as having independent coordinates). We generate a binary outcome

in a logistic regression setting (such that the log odds of

is a linear combination of the features of X). The linear combination is such that just the first 10 features of are non-null (i.e. have non zero coefficient, red), the next 10 (null, black) features are correlated with the non-nulls, and the last 10 are independent nulls. During the knockoff procedure, the importance scores are the absolute value of the logistic regression coefficients of each feature and the feature statistics are the difference of importance scores.

Figure 2: FDR Control Fails for Non Valid Knockoffs: We plot side by side the importance scores for each feature (original and knockoff, 60 in total) for one run of the simulation, and the feature statistics computed as the difference between importance scores of original vs. knockoff features (plotted in the first 30 indices). Selected features are those whose feature statistic is above the threshold.
Figure 3: Power Comparison Between Importance Score Methods (a-c) As the number of mixture components increases, the power of Swap Integral method remains high while that of other methods decreases. (d-f) Power for Swap Integral method is as good as and sometimes better than other methods on real world data with synthetic labels.

We run our knockoff procedure, modeling as a mixture of three multivariate Gaussian variables fitted via EM (Figure 2(a)), and as a single Gaussian (Figure 2(b)). Importance scores of the non-null original features are high in both cases, and those of the correlated null original features are also quite high. Importance scores of valid knockoffs cancel out the correlated null original features’ importance scores, and therefore the adaptively chosen threshold avoids selecting nulls (Figure 2(a)). However, when modeling by one mixture, the knockoffs corresponding to the null features (10-20) have importance scores close to 0 (Figure 2(b)). As a result these correlated null features are selected by the procedure and the FDR control is broken. After 40 repetitions of the procedure (randomly generating each run the parameters of and sampling ), for a target FDR of , we get FDR control when we use our mixture model knockoff (empirical FDR: ) and fail to control FDR whenever we fit the data with one single multivariate Gaussian (empirical FDR: ).

Feature Selection in Synthetic Data

We generate as mixtures of 5, 10 and 20 multivariate Gaussians, and as polynomial functions of a subset of the dimensions of —i.e. these are the non-null features. From each , we draw samples of and , fit a mixture of Gaussians and run our knockoff procedure. Figure 3 (a-c) shows the power of five importance score methods, and Swap Integral consistently achieves the best power. In all of our experiments, the empirical FDR is controlled at the target level (Appendix Figure 5).

Feature Selection in Real Data

Our real-world simulations are based on three datasets from the UK biobank data and the Bank Marketing and Polish Bankruptcy from the UCI repository (Dheeru and Karra Taniskidou, 2017). We provide more information in Appendix D. As our first experiment, in order to have control over the non-null features, we created synthetic labels for each dataset by randomly selecting a subset of the features to be alternatives (the rest are nulls) and use noisy polynomials on these features to generate labels. Details of the data generation are described in Appendix C. We fit each dataset with a Gaussian mixture model using the AIC method (Akaike, 1992). AIC automatically selected 9, 15, and 25 number of clusters for the UK Biobank, Bank Marketing and Bankruptcy datasets, respectively. Then we apply our Gaussian Mixture knockoff sampler to generate knockoffs for each dataset. Average results for repeating this procedure 100 times on each data set are reported in Figure 3. Across the range of target FDRs, the Swap Integral has at least as much power (and sometimes much more) than the other feature statistic methods. It is important to notice that, as we work with a real dataset that we cannot resample, our experiments cannot provide us with empirical FDR values in order to confirm FDR control. We expect that our model for is accurate enough so that the robustness of the procedure accounts for model inaccuracies.

As a discovery experiment, we run the knockoff procedure for feature selection on each of the three data sets with the real labels –cancer in UK Biobank, telemarketing success in Marketing, and bankruptcy in Bankruptcy. In all cases we train a neural network binary classifier with 4 hidden layers of size 100. All selected features are described in Appendix D. With the real data, we do not have ground truth on which are the true alternate features. However, knockoff selected features make sense and some of the features agree with previously reported literature. With target FDR of 0.3 (it can be shown that for small target FDR values the knockoff procedure requires a minimum number of rejections, which is why we selected a fairly large target FDR), in UK Biobank cancer prediction, the knockoff procedure selected 14 out of the total 284 features. These 14 include Number of smoked cigarettes daily and Wine intake. Eight out of ten features in Bank marketing data set were selected. For Bankruptcy prediction, the knockoff procedure selected features including gross profit and short-term liabilities.

Discussion

The knockoff procedure is a powerful framework for selecting important features while statistically controlling false discoveries. The two main challenges of applying knockoffs in practice are the difficulty in generating valid knockoffs and the choice of importance scores. This paper takes a step toward addressing both challenges. We develop a framework to efficiently sample valid knockoffs from Bayesian Networks. We show that even a simple application of this framework to generate knockoffs from Gaussian mixture models already enables us to apply our procedure with several real datasets, while being optimistic that FDR is controlled. We also systematically evaluate and compare the statistical power of several importance scores. We propose a new score, Swap Integral, which is stable and substantially more powerful. Swap Integral can be applied on top of any classification model, including neural networks and random forests. Note that Swap Integral is a post-processing on top of a single trained classifier and hence is computationally efficient. Our results enable knockoffs to be more practically useful and open many doors for future investigation.

Acknowledgments

J.R.G. and A.G. were supported by a Stanford Graduate Fellowship. J.Z. is supported by a Chan–Zuckerberg Biohub Investigator grant and National Science Foundation (NSF) Grant CRII 1657155.

References

References

Appendix

Appendix A Proofs

a.1 Validity of Importance Scores with Random Component

Following the notation by Candès et al. (2018) for Lemma 3.3, denoting the full vector of feature statistics when swapping features in , the flip-sign property can be summarized as: where is the element-wise vector multiplication and . As discussed by Candès et al. (2018), it should be highlighted that the final selection procedure controls FDR just because of this property. Now, by directly referring to the proof of Lemma 3.3 by Candès et al. (2018), we observe that it relies on the flip-sign property just as an equality in distribution. Therefore, with this exact same proof, we get that the result still holds when feature statistics satisfy the previous equality only in distribution. This allows us to construct valid feature statistics based on random components that are not limited to the randomness in the data itself. We can therefore also construct randomized statistics, and we prove that the constraint mentioned earlier only needs to hold in distribution to end up with satisfying the flip-sign condition in distribution.

Proposition A.1.

Assume that the following equality holds in distribution for any subset :

Then we have the following equality in distribution:

(1)
Proof.

It suffices to show the result for , as the general case can be decomposed as the concatenation of swaps of just one coordinate.

a.2 Proof of Proposition 3.1: GMM Knockoff Sampling Procedure

We prove Proposition 3.1, although this exact same proof applies in the more general setting of the Algorithm 2 in next section.

Proof.

We consider the marginal distribution over by summing the full joint distribution over all possible values of . We then decompose the joint distribution along the sampling steps.

This proves exchangeability as the last line satisfies exchangeability in . ∎

a.3 Comparison of Algorithm 1 and SCIP

The main contribution of our Bayesian network knockoff sampling method is due to the intractability of SCIP for a general feature distribution as mentioned in 3.

Indeed, SCIP sequentially samples for the knockoff of the th feature from the conditional distribution of given . That means that, at each step of the sampling process, for each sample, one needs to compute the joint distribution of and the conditional distribution of , which is not computationally feasible if we assume a complex model for .

Another difference is shown by the following: suppose that we observe variable for which we want to sample a knockoff, and that its distribution is conditioned on a latent variable . If we assume that we can construct easily the conjugates and , then Algorithm 1 simplifies into Algorithm 2.

1Sample ;
2 Sample ;
3 Sample ;
Algorithm 2 Knockoff Sampling Procedure for a simple Latent Variable Model

We show in this simple setting that is not a knockoff of . We write the joint distribution and decompose it along the sampling steps.

To prove exchangeability of , we have to marginalize over the hidden states.

Only if we marginalize out the hidden states we get to an expression where exchangeability is satisfied for . Otherwise we don’t, and therefore is not a knockoff of . In SCIP, all the random variables sampled are part of the final knockoff sample.

Our procedure also differs from SCIP insofar it is “modular” in each local conjugate conditional. The choice of each conjugate conditional is not unique, and poor choices yield local knockoffs that are too “close” to the initial sample and decrease the power of the procedure. The worst option, which is using the feature as its own knockoff (i.e. ) still gives valid knockoffs, though discards any possibility for that given feature to be selected. But this is why this procedure is flexible: in cases where a conditional has no closed form expression because of complex dependencies, we can locally choose poor conjugates and continue the procedure so that we still obtain valid knockoff samples, which is not possible when running SCIP directly as one has to sample from a complex conditional distribution that is predetermined.

We can analyze how the previous examples make use of this freedom in the choice of the conditional conjugate. In the Gaussian mixture case we could choose , in which case our knockoff for would be itself, and the knockoff procedure would be powerless. In the HMM setting, after sampling the hidden nodes from the posterior given the observed ones, we could choose to keep those same nodes as local knockoffs instead of sampling a different local knockoff. In this case, the final knockoff we obtain is different from . However, one can expect this choice to produce knockoffs with lower power, as the knockoff samples will stay “closer” to the true samples. The intuition is that if we leverage the knowledge we have about the generative process, we can sample more powerful knockoffs.

For the Hidden Markov Model, sample:

  • , we sample the hidden states. Conditionally on the distribution of is that of a Markov chain.

  • we sample a new knockoff Markov chain via SCIP.

  • . However, is constructed based on the distribution . But because of the structure of the HMM, the observed states are independent conditionally on the hidden states. If the observed states are univariate then we can simplify the conjugate conditional and sample (see comments after Definition 3.1).

Example: Sampling Knockoffs for LDA

As an additional example, we describe how Algorithm 1 works to generate knockoffs from Latent Dirichlet Allocation (LDA). We use the same notation for LDA as defined by Blei et al. (2003): the th word in document is sampled from a multinomial distribution parametrized by , where corresponds to the topic assignment for . Topic assignment is sampled from a multinomial distribution parametrized by , the distribution of topics in document . Finally

for each document is sampled from a Dirichlet distribution with hyperparameter

.

  1. [leftmargin=*]

  2. The first step to build knockoffs is to learn the parameters of the model, which can be done by variational EM (Blei et al., 2003). Then, we need to sample the hidden variables given the observed ones : this is an inference problem for which direct computation is intractable, but we can approximate that posterior distribution via standard variational Bayes methods.

  3. Next is to sample local knockoffs. This is exactly analog to one pass of Gibbs sampling over the whole DAG following a topological ordering, except that instead of sampling with respect to the conditional distribution of the node given its Markov blanket, we sample from the conjugate conditional distribution, conditioning on the appropriate variables as explained in Algorithm (1). We sample each based on the conjugate conditional distribution. However, as is Dirichlet, any given coordinate is determined by the others, so the only possible choice is to set . Then, as is univariate, its conjugate conditional simplifies too so that we just sample from the local conditional probability, and so on for .

a.4 Proof of Theorem 3.2: DAG Knockoff Sampling Procedure

Proof.

The joint probability distribution can be decomposed as follows by following the sampling steps:

(2)

In order to show that is exchangeable, we want to show that if we marginalize out this joint distribution with respect to the hidden states , we get an exchangeable distribution.

We first show that, iterating recursively over all the nodes, and summing over all values of , we obtain the following expression.

(3)

For simplicity, here we consider discrete random variables, so that marginalizing the joint distribution over

means summing over all possible values of

. Everything stays valid for continuous random variables, replacing sums by integrals. Starting from equation (

2), which corresponds to step 1, we do sequentially steps to get to (3). Suppose that at step we have the following equality where the left-hand term is the product of the right-hand terms:

The key element is that, by following the topological order, at step the variable only appears in the joint probability and in the term

(Notice that, if corresponds to an observed node, it has no descendents. Therefore the Markov blanket of such node is a subset of the nodes with smaller index/topological ordering). We isolate these two terms and start by writing down the joint probability as a conditional probability. By definition of the Markov blanket, we can simplify the expression of the conditional probability. Then, we obtain two terms that are conjugate in the exchangeable sense.

We swap in the previous expression the two terms and we get the following product:

If , then we sum both sides of the equality over . But now, only appears in the last term, and summing over it gives . If we reach a node with no descendents, i.e. , then we do not marginalize out. However we have the following simplification:

In both cases, we get to the next step in our recursion. After completing last step, we get to equation (3). Notice that this expression is exchangeable in , for every assignment of . Indeed, for , we have that , so

And do not appear in the last product term as two observed nodes cannot be in the Markov blanket of each other. only appear in the conjugate probabilities, therefore the exchangeability in holds. Again, as two observed nodes cannot appear in the Markov blanket of the other, this step can be repeated for different indices , hence the exchangeability of the expression. This symmetry is at fixed values of . Therefore, it still holds when we sum over . Hence

satisfies exchangeability. ∎

Appendix B Robustness of the Procedure to Model Misspecification

Figure 4: Fitting a t-distribution with a Mixture of Gaussians

We evaluate the empirical FDR when running the knockoff procedure with a misspecified model. We generate knockoffs by fitting a Gaussian mixture in settings where the features are from a mixture of t-distribution, for different degrees of freedom (DOF).

Figure 5: Empirical FDR for Experiments With Synthetic Features

As explained when first introducing the Model-X knockoff procedure in Candès et al. (2018), instead of considering a model for the conditional distribution of , all the assumptions are related to modeling . The burden of knowledge shifts from to . The same way valid p-values rely on assumptions on , (parametric model, noise distribution, asymptotic regime…), valid knockoffs rely on assumptions on the distribution of : mainly, that we can approximate it very well.

When we generate knockoffs based on a Gaussian mixture model, and more generally a Bayesian network, we assume that these probabilistic models are good approximations for , and that they can be properly fitted. This is a very strong assumption, as not only the model we use to represent

may be incorrect, but the estimated parameters of the model depend on the fitting procedure, which sometimes provides only an approximation to the actual distribution encoded by the Bayesian network (as when using Variational Inference). Optimization methods commonly used such as Expectation-Maximization (EM) can also get stuck in local minima. However, the knockoff procedure is remarkably robust when dealing with these issues. Existing theoretical robustness bounds

(Barber et al., 2018) are based on controlling the KL-divergence of the model with respect to the true . This can help explain why EM in our method, and hopefully other fitting procedures, yield knockoffs that are somehow valid: these methods minimize the KL-divergence of the model with respect to the distribution of .

Appendix C Synthetic Data Generation and FDR Control

As an example, we provide simulations showing the empirical FDR for a mixture of t-distributions in Figure 4 (at a fixed target FDR ), as a function of the number of Gaussian mixtures we use to model the distribution. The conclusion is that, with enough mixtures, the knockoff procedure is able to control FDR, even though we cannot expect to correctly represent any t-distribution through a mixture of Gaussians.

To generate a synthetic data set with samples, features,