Deep Direct Likelihood Knockoffs

Predictive modeling often uses black box machine learning methods, such as deep neural networks, to achieve state-of-the-art performance. In scientific domains, the scientist often wishes to discover which features are actually important for making the predictions. These discoveries may lead to costly follow-up experiments and as such it is important that the error rate on discoveries is not too high. Model-X knockoffs enable important features to be discovered with control of the FDR. However, knockoffs require rich generative models capable of accurately modeling the knockoff features while ensuring they obey the so-called "swap" property. We develop Deep Direct Likelihood Knockoffs (DDLK), which directly minimizes the KL divergence implied by the knockoff swap property. DDLK consists of two stages: it first maximizes the explicit likelihood of the features, then minimizes the KL divergence between the joint distribution of features and knockoffs and any swap between them. To ensure that the generated knockoffs are valid under any possible swap, DDLK uses the Gumbel-Softmax trick to optimize the knockoff generator under the worst-case swap. We find DDLK has higher power than baselines while controlling the false discovery rate on a variety of synthetic and real benchmarks including a task involving a large dataset from one of the epicenters of COVID-19.



There are no comments yet.


page 1

page 2

page 3

page 4


Black Box FDR

Analyzing large-scale, multi-experiment studies requires scientists to t...

Interpreting Black Box Models with Statistical Guarantees

While many methods for interpreting machine learning models have been pr...

Training Deep Energy-Based Models with f-Divergence Minimization

Deep energy-based models (EBMs) are very flexible in distribution parame...

Reliable Estimation of Kullback-Leibler Divergence by Controlling Discriminator Complexity in the Reproducing Kernel Hilbert Space

Several scalable methods to compute the Kullback Leibler (KL) divergence...

Normalizing Flows for Knockoff-free Controlled Feature Selection

The goal of controlled feature selection is to discover the features a r...

A Brief Introduction to Generative Models

We introduce and motivate generative modeling as a central task for mach...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High dimensional multivariate datasets pervade many disciplines including biology, neuroscience, and medicine. In these disciplines, a core question is which variables are important to predict the phenomenon being observed. Finite data and noisy observations make finding informative variables impossible without some error rate. Scientists therefore seek to find informative variables subject to a specific tolerance on an error rate such as the false discovery rate (fdr).

Traditional methods to control to the fdr rely on assumptions on how the covariates of interest may be related to the response . Model-X knockoffs (Candes et al., 2016) provide an alternative framework that controls the fdr by constructing synthetic variables , called knockoffs, that look like the original covariates, but have no relationship with the response given the original covariates. Variables of this form can be used to test the conditional independence of each covariate in a collection, with the response given the rest of the covariates by comparing the association the original covariate has with the response with the association the knockoff has with the response.

The focus on modeling the covariates shifts the testing burden to building good generative models of the covariates. Knockoffs need to satisfy two properties: (i) they need to be independent of the response given the real covariates, (ii) they need to be equal in distribution when any subset of variables is swapped between knockoffs and real data. Satisfying the first property is trivial by generating the knockoffs without looking using the response. The second property requires building conditional generative models that are able to match the distribution of the covariates.

Related work.

Existing knockoff generation methods can be broadly classified as either model-specific or flexible. Model-specific methods such as hmm

(Sesia et al., 2017) or Auto-Encoding Knockoffs (Liu and Zheng, 2018)

make assumptions about the covariate distribution, which can be problematic if the data does not satisfy these assumptions. hmm assume the joint distribution of the covariates can be factorized into a markov chain. Auto-Encoding Knockoffs use vae to model

and sample knockoffs . vae assume lies near a low dimensional manifold, whose dimension is controlled by a latent variable. Covariates that violate this low-dimensional assumption can be better modeled by increasing the dimension of the latent variable, but risk retaining more information about , which can reduce the power to select important variables.

Flexible methods for generating knockoffs such as KnockoffGAN (Jordon et al., 2019) or Deep Knockoffs (Romano et al., 2018) focus on likelihood-free generative models. KnockoffGAN uses gan-based generative models, which can be difficult to estimate (Mescheder et al., 2018)

and sensitive to hyperparameters

(Salimans et al., 2016; Gulrajani et al., 2017; Mescheder et al., 2017). Deep Knockoffs employ mmd, the effectiveness of which often depends on the choice of a kernel which can involve selecting a bandwidth hyperparameter. Ramdas et al. (2015) show that in several cases, across many choices of bandwidth, mmd approaches 0 as dimensionality increases while

divergence remains non-zero, suggesting mmd may not reliably generate high-dimensional knockoffs. Deep Knockoffs also prevent the knockoff generator from memorizing the covariates by explicitly controlling the correlation between the knockoffs and covariates. This is specific to second order moments, and may ignore higher order ones present in the data.

We propose ddlk, a likelihood-based method for generating knockoffs without the use of latent variables. ddlk is a two stage algorithm. The first stage uses maximum likelihood to estimate the distribution of the covariates from observed data. The second stage estimates the knockoff distribution with likelihoods by minimizing the kld between the joint distribution of the real covariates with knockoffs and the joint distribution of any swap of coordinates between covariates and knockoffs. ddlk expresses the likelihoods for swaps in terms of the original joint distribution with real covariates swapped with knockoffs. Through the Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016)

, we optimize the knockoff distribution under the worst swaps. By ensuring that the knockoffs are valid in the worst cases, ddlk learns valid knockoffs in all cases. To prevent ddlk from memorizing covariates, we introduce a regularizer to encourage high conditional entropy for the knockoffs given the covariates. We study ddlk on synthetic, semi-synthetic, and real datasets. Across each study, ddlk controls the fdr while achieving higher power than competing gan-based, mmd-based, and autoencoder-based methods.

2 Knockoff filters

Model-X knockoffs (Candes et al., 2016) is a tool used to build variable selection methods. Specifically, it facilitates the control of the fdr, which is the proportion of selected variables that are not important. In this section, we review the requirements to build variable selection methods using Model-X knockoffs.

Consider a data generating distribution where variables , response only depends on , , and is a subset of the variables. Let be a set of indices identified by a variable selection algorithm. The goal of such algorithms is to return an that maximizes the number of indices in while maintaining the fdr at some nominal level:

To control the fdr at the nominal level, Model-X knockoffs requires (a) knockoffs , and (b) a knockoff statistic to assess the importance of each feature .


are random vectors that satisfy the following properties for any set of indices



The swap property eq. 1 ensures that the joint distribution of is invariant under any swap. A swap operation at position is defined as exchanging the entry of and . For example, when , and , . Equation 2 ensures that the response is independent of the knockoff given the original features .

A knockoff statistic must satisfy the flip-sign property. This means if , must be positive. Otherwise, the sign of

must be positive or negative with equal probability.

Given knockoff and knockoff statistics , exact control of the fdr at level can be obtained by selecting variables in such that . The threshold is given by:


While knockoffs are a powerful tool to ensure that the fdr is controlled at the nominal level, the choice of method to generate knockoffs is left to the practitioner.

Existing methods for knockoffs include model-specific approaches that make specific assumptions about the covariate distribution, and flexible likelihood-free methods. If the joint distribution of cannot be factorized into a markov chain (Sesia et al., 2017) or if does not lie near a low-dimensional manifold (Liu and Zheng, 2018), model-specific generators will yield knockoffs that are not guaranteed to control the fdr. Likelihood-free generation methods that use gan (Jordon et al., 2019) or mmd (Romano et al., 2018) make fewer assumptions about , but can be difficult to estimate (Mescheder et al., 2018), sensitive to hyperparameters (Salimans et al., 2016; Gulrajani et al., 2017; Mescheder et al., 2017), or suffer from low power in high dimensions (Ramdas et al., 2015). In realistic datasets, where can come from an arbitrary distribution and dimensionality is high, it remains to be seen how to reliably generate knockoffs that satisfy eqs. 2 and 1.

3 ddlk

We motivate ddlk with the following observation. The swap property in eq. 1 is satisfied if the divergence between the original and swapped distributions is zero. Formally, let be a set of indices to swap, and , . Then under any such :


A natural algorithm for generating valid knockoffs might be to parameterize each distribution above and solve for the parameters by minimizing the LHS of eq. 4. However, modeling for every possible swap is difficult and computationally infeasible in high dimensions. Theorem 3.1 provides a useful solution to this problem.

Theorem 3.1.

Let be a probability measure defined on a measurable space. Let be a swap function using indices . If is a sample from , the probability law of is .

As an example, in the continuous case, where and are the densities of and respectively, evaluated at a sample is simply evaluated at the swap of . We show the direct proof of this example and theorem 3.1 in appendix A. A useful consequence of theorem 3.1 is that ddlk needs to only model , instead of and every possible swap distribution . To derive the ddlk algorithm, we first expand eq. 4:


where . ddlk models the RHS by parameterizing and with and respectively. The parameters and can be optimized separately in two stages.

Stage 1: Covariate distribution estimation.

We model the distribution of using . The parameters of the model are learned by maximizing over a dataset of samples.

Stage 2: Knockoff generation.

For any fixed swap , minimizing the KL divergence between the following distributions ensures the swap property eq. 1 required of knockoffs:


Fitting the knockoff generator involves minimizing this KL divergence for all possible swaps . To make this problem tractable, we use several building blocks that help us (a) sample swaps with the highest values of this KL and (b) prevent from memorizing to trivially satisfy the swap property in eq. 1.

3.1 Fitting ddlk

Knockoffs must satisfy the swap property eq. 1 for all potential sets of swap indices . While this seems to imply that the KL objective in eq. 6 must be minimized under an exponential number of swaps, swapping every coordinate suffices (Candes et al., 2018). More generally, showing the swap property for a collection of sets where every coordinate can be represented as the symmetric difference of members of the collection is sufficient. See section A.4 more more details.

Sampling swaps.

Swapping coordinates can be expensive in high dimensions, so existing methods resort to randomly sampling swaps (Romano et al., 2018; Jordon et al., 2019) during optimization. Rather than sample each coordinate uniformly at random, we propose parameterizing the sampling process for swap indices so that swaps sampled from this process yields large values of the KL objective in eq. 6. We do so because of the following property of swaps, which we prove in appendix A.

Lemma 3.2.

Worst case swaps: Let be the worst case swap distribution. That is, the distribution over swap indices that maximizes


with respect to . If eq. 7 is minimized with respect to , knockoffs sampled from will satisfy the swap property in eq. 1 for any swap in the power set of .

Randomly sampling swaps can be thought of as sampling from

Bernoulli random variables

with parameters respectively, where each indicates whether the th coordinate is to be swapped. A set of indices can be generated by letting . To learn a sampling process that helps maximize eq. 7, we optimize the values of

. However, since score function gradients for the parameters of Bernoulli random variables can have high variance, ddlk uses a continuous relaxation instead. For each coordinate

, ddlk learns the parameters for a Gumbel-Softmax (Jang et al., 2016; Maddison et al., 2016) distribution .

Input: , dataset of covariates; , regularization parameter; , learning rate for ; , learning rate for

Output: , parameter for , , parameter for

while  not converged do

       Sample , where   Sample swap   Create , where   Let   Let
end while
return , ,
Algorithm 1 ddlk

Entropy regularization.

Minimizing the the KL objective in eq. 6 over the worst case swap distribution will generate knockoffs that satisfy the swap property eq. 1. However, a potential solution in the optimization of is to memorize the covariates , which reduces the power to select important variables.

To solve this problem, ddlk introduces a regularizer based on the conditional entropy, to push to not be a copy of . This regularizer takes the form , where is a hyperparameter.

Including the regularizer on conditional entropy, and Gumbel-Softmax sampling of swap indices, the final optimization objective for ddlk is:


where . We show the full ddlk algorithm in algorithm 1. ddlk fits by maximizing the likelihood of the data. It then fits by optimizing eq. 8 with noisy gradients. To do this, ddlk first samples knockoffs conditioned on the covariates and a set of swap coordinates, then computes Monte-Carlo gradients of the ddlk objective in eq. 8 with respect to parameters and . In practice ddlk can use stochastic gradient estimates like the score function or reparameterization gradients for this step. The and models can be implemented with flexible models like MADE (Germain et al., 2015) or mixture density networks (Bishop, 1994).

4 Experiments

We study the performance of ddlk on several synthetic, semi-synthetic, and real-world datasets. We compare ddlk with several non-Gaussian knockoff generation methods: Auto-Encoding Knockoffs (AEK) (Liu and Zheng, 2018), KnockoffGAN (Jordon et al., 2019), and Deep Knockoffs (Romano et al., 2018).111For all comparison methods, we downloaded the publicly available implementations of the code (if available) and used the appropriate configurations and hyperparameters recommended by the authors. See appendix D for code and model hyperparameters used.

Figure 1: ddlk controls fdr at the nominal rate and achieves highest power on a variety of benchmarks. For each benchmark, we show the fdp and power of each knockoff method.

Each experiment involves three stages. First, we fit a knockoff generator using a dataset of covariates . Next, we fit a response model , and use its performance on a held-out set to create a knockoff statistic for each feature

. We detail the construction of these test statistics in

appendix E. Finally, we apply a knockoff filter to the statistics to select features at a nominal fdr level , and measure the ability of each knockoff method to select relevant features while maintaining fdr at . For the synthetic and semi-synthetic tasks, we repeat these three stages 30 times to obtain interval estimates of each performance metric.

In our experiments, we assume to be real-valued with continuous support and decompose the models and

via the chain rule:

For each conditional distribution, we fit a mixture density network Bishop (1994) where the number of mixture components is a hyperparameter. In principle, any model that produces an explicit likelihood value can be used to model each conditional distribution.

Fitting involves using samples from a dataset, but fitting involves sampling from . This is a potential issue because the gradient of the ddlk objective eq. 8 with respect to is difficult to compute as it involves integrating , which depends on . To solve this, we use an implicit reparameterization (Figurnov et al., 2018) of mixture densities. Further details of this formulation are presented in appendix B.

Across each benchmark involving ddlk, we vary only the entropy regularization parameter based on the amount of dependence among covariates. The number of parameters, learning rate, and all other hyperparameters are kept constant. To sample swaps , we sample using a straight-through Gumbel-Softmax estimator (Jang et al., 2016). This allows us to sample binary values for each swap, but use gradients of a continuous approximation during optimization. For brevity, we present the exact hyperparameter details for ddlk in appendix C.

We run each experiment on a single CPU with 4GB of memory. ddlk takes roughly 40 minutes in total to fit both and on a 100-dimensional dataset.

Synthetic benchmarks.

Our tests on synthetic data seek to highlight differences in power and fdr between each knockoff generation method. Each dataset in this section consists of samples, 100 features, 20 of which are used to generate the response . We split the data into a training set () to fit each knockoff method, a validation set () used to tune the hyperparameters of each method, and a test set () for evaluating knockoff statistics.

[gaussian]: We first replicate the multivariate normal benchmark of Romano et al. (2018). We sample , where is a -dimensional covariance matrix whose entries . This autoregressive Gaussian data exhibits strong correlations between adjacent features, and lower correlations between features that are further apart. We generate , where coefficients for the important features are drawn as . In our experiments, we set . We let the ddlk entropy regularization parameter . Our model for is a -layer neural network with parameters.

Figure 2: ddlk closely models the modes, covariances, and mixture proportions of the mixture dataset. Auto-Encoding Knockoffs also capture every mode, but does so by overfitting to the covariates. Deep Knockoffs are able to match the first two moments, but fail to capture every mode. KnockoffGAN suffers from mode collapse and fails to capture every mode.

[mixture]: To compare each method on its ability to generate non-Gaussian knockoffs, we use a mixture of autoregressive Gaussians. This is a more challenging benchmark as each covariate is multi-modal, and highly correlated with others. We sample , where each is a -dimensional covariance matrix whose th entry is . We generate , where coefficients for the important features are drawn as . In our experiments, we set , and . Cluster centers are set to , and mixture proportions are set to . We let the ddlk entropy regularization parameter . Figure 2 visualizes two randomly selected dimensions of this data.


Figure 1

compares the average fdp and power (percentage of important features selected) of each knockoff generating method. The average fdp is an empirical estimate of the fdr. In the case of the

gaussian dataset, all methods control fdr at or below the the nominal level, while achieving 100% power to select important features. The main difference between each method is in the calibration of null statistics. Recall that a knockoff filter assumes a null statistic to be positive or negative with equal probability, and features with negative statistics below a threshold are used to control the number of false discoveries when features with positive statistics above the same threshold are selected. ddlk produces the most well calibrated null statistics as evidenced by the closeness of its fdp curve to the dotted diagonal line.

Figure 1 also demonstrates the effectiveness of ddlk in modeling non-Gaussian covariates. In the case of the mixture dataset, ddlk achieves significantly higher power than the baseline methods, while controlling the fdr at nominal levels. To understand why this may be the case, we plot the joint distribution of two randomly selected features in fig. 2. ddlk and Auto-Encoding Knockoffs both seem to capture all three modes in the data. However, Auto-Encoding Knockoffs tend to produce knockoffs that are very similar to the original features, and yield lower power when selecting variables, shown in fig. 1. Deep Knockoffs manage to capture the first two moments of the data – likely due to an explicit second-order term in the objective function – but tend to over-smooth and fail to properly estimate the knockoff distribution. KnockoffGAN suffers from mode collapse, and fails to capture even the first two moments of the data. This yields knockoffs that not only have low power, but also fail to control fdr at nominal levels.

Robustness of ddlk to entropy regularization.

To provide guidance on how to set the entropy regularization parameter, we explore the effect of on both fdr control and power. Intuitively, lower values of will yield solutions of that may satisfy eq. 1 and control fdr well, but may also memorize the covariates and yield low power. Higher values of may help improve power, but at the cost of fdr control. In this experiment, we again use the gaussian dataset, but vary and the correlation parameter . Figure 3 highlights the performance of ddlk over various settings of and . We show a heatmap where each cell represents the RMSE between the nominal fdr and mean fdp curves over 30 simulations. In each of these settings ddlk achieves a power of 1, so we only visualize fdp. We observe that the fdp of ddlk is very close to its expected value for most settings where . This is true over a wide range of explored, demonstrating that ddlk is not very sensitive to the choice of this hyperparameter. We also notice that data with weaker correlations see a smaller increase in fdp with larger values of . In general, checking the fdp on synthetic responses generated conditional on real covariates can aid in selecting .

Figure 3: ddlk is robust to choices of entropy regularization parameter . For most choices of , ddlk achieves fdp very close to that of the nominal fdr rate. This figure shows the RMSE between the expected and actual fdp curves.

Semi-synthetic benchmark.

[gene]: In order to evaluate the fdr and power of each knockoff method using covariates found in a genomics context, we create a semi-synthetic dataset. We use RNA expression data of 963 cancer cell lines from the Genomics of Drug Sensitivity in Cancer study (Yang et al., 2012). Each cell line has expression levels for 20K genes, of which we sample 100 such that every feature is highly correlated with at least one other feature. We create 30 independent replications of this experiment by repeating the following process. We first sample a gene uniformly at random, adding it to the set . For , , we sample uniformly at random from and compute the set of genes not in with the highest correlation with . From this set of , we uniformly sample a gene and add it to the feature set. We repeat this process for , yielding genes in total.

We generate using a nonlinear response function adapted from a study on feature selection in neural network models of gene-drug interactions (Liang et al., 2018). The response consists of a nonlinear term, a second-order term, and a linear term. For brevity, appendix F contains the full simulation details. We let ddlk entropy regularization parameter .

Figure 4: ddlk learns the marginals of COVID-19 data better than competing baselines. We plot the marginal distribution of a feature in a COVID-19 dataset, and the corresponding marginal of samples from each knockoff method.


Figure 1 (right) highlights the empirical fdp and power of each knockoff generating method in the context of gene. All methods control the fdr below the nominal level, but the average fdp of ddlk at fdr thresholds below 0.3 is closer to its expected value. This range of thresholds is especially important as nominal levels of fdr below 0.3 are most used in practice. In this range, ddlk achieves power on par with Deep Knockoffs at levels below 0.1, and higher power everywhere else. Auto-Encoding Knockoffs and KnockoffGAN achieve noticeably lower power across all thresholds. Deep Knockoffs perform well here likely due to a lack of strong third or higher moments of dependence between features. We attribute the success of ddlk and Deep Knockoffs to their ability to model highly correlated data.

COVID-19 adverse events.

[covid-19]: The widespread impact of COVID-19 has led to the deployment of machine learning models to guide triage. Data for COVID-19 is messy because of both the large volume of patients and the changing practice for patient care. Establishing trust in models for COVID-19 involves vetting the training data to ensure it does not contain artifacts that models can exploit. Conditional independence tests help achieve this goal in two ways: (a) they highlight which features are most important to the response, and (b) they prune the feature set for a deployed model, reducing the risk of overfitting to processes in a particular hospital. We apply each knockoff method to a large dataset from one of the epicenters of COVID-19 to understand the features most predictive of adverse events.

We use electronic health record data on COVID-positive patients from a large metropolitan health network. Our covariates include demographics, vitals, and lab test results from every cbc taken for each patient. The response is a binary label indicating whether or not a patient had an adverse event (intubation, mortality, ICU transfer, hospice discharge, emergency department representation, O2 support in excess of nasal cannula at 6 L/min) within 96 hours of their cbc. There are 17K samples of 37 covariates in the training dataset, 5K in a validation set, and 6K in a held-out test set. We let the ddlk entropy regularization parameter

. In this experiment, we use gradient boosted regression trees

(Friedman, 2002; Ke et al., 2017) as our model, and expected log-likelihood as a knockoff statistic. We also standardize the data in the case of Deep Knockoffs since mmd that use the rbf kernel with a single bandwidth parameter work better when features are on the same scale.

Feature DDLK Deep Knockoffs AEK KnockoffGAN Validated
Eosinophils count
Eosinophils percent
Blood urea nitrogen
O2 Saturation
Heart rate
Respiratory rate
O2 Rate
On room air
High O2 support
Table 1: ddlk selects 10/37 features, 8 of which were found to be meaningful by doctors at a large metropolitan hospital. Here we show the union of covid-19 features selected by each knockoff method at a nominal fdr of 0.2. Deep Knockoffs, Auto-Encoding Knockoffs, and KnockoffGAN exhibit lower power to select important features.


As COVID-19 is a recently identified disease, there is no ground truth set of important features for this dataset. We therefore use each knockoff method to help discover a set of features at a nominal fdr threshold of , and validate each feature by manual review with doctors at a large metropolitan hospital. Table 1 shows a list of features returned by each knockoff method, and indicates whether or not a team of doctors thought the feature should have clinical relevance.

We note that ddlk achieves highest power to select features, identifying 10 features, compared to 5 by DeepKnockoffs, and 4 each by Auto-Encoding Knockoffs and KnockoffGAN. To understand why, we visualize the marginal distributions of each covariate in, and the respective marginal distribution of samples from each knockoff method, in fig. 4.

We notice two main differences between ddlk and the baselines. First, ddlk is able to fit asymmetric distributions better than the baselines. Second, despite the fact that the implementation of ddlk using mixture density networks is misspecified for discrete variables, ddlk is able to model them better than existing baselines. This implementation uses continuous models for , but is still able approximate discrete distributions well. The components of each mixture appear centered around a discrete value, and have very low variance as shown in fig. 5. This yields a close approximation to the true discrete marginal. We show the marginals of every feature for each knockoff method in appendix G.

5 Conclusion

ddlk is a generative model for sampling knockoffs that directly minimizes a divergence implied by the knockoff swap property. The optimization for ddlk involves first maximizing the explicit likelihood of the covariates, then minimizing the divergence between the joint distribution of covariates and knockoffs and any swap between them. To ensure ddlk satisfies the swap property under any swap indices, we use the Gumbel-Softmax trick to learn swaps that maximize the divergence. To generate knockoffs that satisfy the swap property while maintaining high power to select variables, ddlk includes a regularization term that encourages high conditional entropy of the knockoffs given the covariates. We find ddlk to outperform various baselines on several synthetic and real benchmarks including a task involving a large dataset from one of the epicenters of COVID-19.

Broader Impact

There are several benefits to using flexible methods for identifying conditional independence like ddlk. Practitioners that care about transparency have traditionally eschewed deep learning, but methods like ddlk can present a solution. By performing conditional independence tests with deep networks and by providing guarantees on the false discovery rate, scientists and practitioners can develop more trust in their models. This can lead to greater adoption of flexible models in basic sciences, resulting in new discoveries and better outcomes for the beneficiaries of a deployed model. Conditional independence can also help detect bias from data, by checking if an outcome like length of stay in a hospital was related to a sensitive variable like race or insurance type, even when conditioning on many other factors.

While we believe greater transparency can only be better for society, we note that interpretation methods for machine learning may not exactly provide transparency. These methods visualize only a narrow part of a model’s behavior, and may lead to gaps in understanding. Relying solely on these domain-agnostic methods could yield unexpected behavior from the model. As users of machine learning, we must be wary of too quickly dismissing the knowledge of domain experts. No interpretation technique is suitable for all scenarios, and different notions of transparency may be desired in different domains.



Appendix A Proofs

a.1 Proof of theorem 3.1

Let be a probability measure defined on a measurable space. Let be a swap function using indices . If is a sample from , the probability law of is .


The swap operation on swaps coordinates in the following manner: for each , the th and th coordinates are swapped. Let be a measurable space, where elements of are -dimensional vectors, and is a -algebra on . Let also be a measurable space where each element of is an element of but with the th coordinate swapped with the th coordinate for each . Similarly, let be constructed by applying the same swap transformations to each element of . is a -algebra as swaps are one-to-one transformations, and is a -algebra.

We first show that is a measurable function with respect to and . This is true by construction of the measurable space . For every element , .

We can now construct a mapping for all . This is the pushforward measure of under transformation , and is well defined because is measurable.

Using the fact that a swap applied twice is the identity, we get . With this, we see that the probability measure on is . ∎

a.2 Alternative derivation of theorem 3.1 for continuous random variables

In this section we derive theorem 3.1

for continuous random variables in an alternative manner. Let

be a set of covariates, and be a set of knockoffs. Let , and where is a set of coordinates in which we swap and . Recall that a swap operation on is an affine transformation , where is a permutation matrix. Using this property, we get:

The first step is achieved by using a change of variables, noting that is invertible, and . The determinant of the Jacobian here is just the determinant of . is a permutation matrix whose parity is even, meaning its determinant is , and that . I.e. the density of the swapped variables evaluated at is equal to the original density evaluated at .

a.3 Proof of lemma 3.2

Recall lemma 3.2:

Worst case swaps: Let be the worst case swap distribution. That is, the distribution over swap indices that maximizes

with respect to . If this quantity is minimized with respect to , knockoffs sampled from will satisfy the swap property in eq. 1 for any swap in the power set of .




is minimized with respect to but maximized with respect to , then for any other distribution , eq. 9 will be lesser. Minimizing eq. 9, which is non-negative, with respect to implies that for any swap sampled from and for any knockoff sampled from ,

As eq. 9 is also maximized with respect to , swaps drawn from all other distributions will only result in lower values of eq. 9. Therefore, the joint distribution will be invariant under any swap in the power set of . ∎

a.4 Sufficient sets for swap condition expectation

Recall the swap property required of knockoffs highlighted in eq. 1:

where is a set of coordinates under which we swap the covariates and knockoffs. For valid knockoffs, this equality in distribution must hold for any such . One approach to check if knockoffs are valid is to verify this equality in distribution for all singleton sets [Romano et al., 2018, Jordon et al., 2019]. To check if the swap property eq. 1 holds under any , it suffices to check if eq. 1 holds under each of .

We can generalize this approach to check the validity of knockoffs under other collections of indices besides singleton sets using the following property. Let and


where is the symmetric difference of and . Swapping the indices in is equivalent to swapping the indices in , then the indices in . If , swapping twice will negate the effect of the swap.

We can extend this property to sets and define sufficient conditions to check if the swap property holds. Let be a sequence of sets where each . Let

Checking the swap property eq. 1 under a sequence of swaps is equivalent to checking eq. 1 under the singleton set . Therefore, the swap property must also hold under the singleton set .

If collection of sets of swap indices contains a sub-sequence such that their sequential symmetric difference is the singleton for each , then a set of knockoffs that satisfies the swap property under each , will also satisfy the swap property under each singleton set, which is sufficient to generate valid knockoffs.

Appendix B Implicit reparameterization of mixture density networks

In our experiments, we decompose both and via the chain rule:

We model each conditional using mixture density networks Bishop [1994] which take the form

where functions , , and characterize a univariate gaussian mixture. These parameters of these functions are .

Fitting .

Let , the parameters of contain parameters for every conditional . The optimization of is straightforward:

only requires taking the derivative of .

Fitting .

Let , the parameters of contain parameters for every conditional

. Recall the loss function

The optimization of requires , which involves the derivative of an expectation with respect to to . We use implicit reparameterization [Figurnov et al., 2018]. The advantage of implicit reparameterization over explicit reparameterization [Kingma et al., 2015] is that an inverse standardization function – which transforms random noise into samples from a distribution parameterized by – is not needed. Using implicit reparameterization, gradients of some objective can be rewritten as

We use this useful property to reparameterize gaussian mixture models. Let

be a gaussian mixture model:

where . Let the standardization function be the CDF of :

where is the standard normal gaussian CDF. We use this to compute the gradient of with respect to each parameter:

Putting it all together, we use the implicit reparameterization trick to implement each conditional distribution in and .

Appendix C Implementation details and hyperparameter settings for ddlk experiments

We decompose and into the product of univariate conditional distributions using the product rule. We use mixture density networks [Bishop, 1994] to parameterize each conditional distribution. Each mixture density network is a 3-layer neural network with 50 parameters in each layer and a residual skip connection from the input to the last layer. Each network outputs the parameters for a univariate gaussian mixture with 5 components. We initialize the network such that the modes are evenly spaced within the support of training data.

Using , we sample binary swap matrices of the same dimension as the data. As we require discrete samples from the Gumbel-Softmax distribution, we implement a straight-through estimator [Jang et al., 2016]

. The straight-through estimator facilitates sampling discrete indices, but uses a continuous approximation during backpropagation.

The model is optimized using Adam [Kingma and Ba, 2014], with a learning rate of

for a maximum of 50 epochs. The

model is optimized using Adam, with a learning rate of for and for for a maximum of 250 epochs. We also implement early stopping using validation loss using the PyTorch Lightning framework [Falcon, 2019].

Our code can be found online by installing:

pip install -i ddlk==0.2

Compute resources.

We run each experiment on a single CPU core using 4GB of memory. Fitting for a 100-dimensional dataset with 2000 samples requires fitting 100 conditional models, and takes roughly 10 minutes. Fitting for the same data takes roughly 30 minutes.

Fitting ddlk to our covid-19 dataset takes roughly 15 minutes in total.

Appendix D Baseline model settings

For Deep Knockoffs and KnockoffGAN, we use code from each respective repository:

and use the recommended hyperparameter settings.

At the time of writing this paper, there was no publicly available implementation for Auto-Encoding Knockoffs. We implemented Auto-Encoding Knockoffs with a vae with a gaussian posterior and likelihood . Each of

is a 2-layer neural network with 400 units in the first hidden layer, 500 units in the second, and ReLU activations. The outputs of networks

are exponentiated to ensure variances are non-negative. The outputs of network and are of dimension , and the outputs of and are of dimension , the covariate dimension. For each dataset, we choose the dimension of latent variable that maximizes the estimate of the ELBO on a validation dataset. In our experiments, we search for over the set . For each dataset, we use the following :

  1. gaussian: 20

  2. mixture: 140

  3. gene: 30

  4. covid-19: 60.

The neural networks are trained using Adam [Kingma and Ba, 2014], with a learning rate of

for a maximum of 150 epochs. To avoid very large gradients, we standardize the data using the mean and standard deviation of the training set. To generate knockoffs

, we use the same approach prescribed by Liu and Zheng [2018]. We first sample the latent variable conditioned on the covariates using the posterior distribution:

This sample of is then used to sample a knockoff using the likelihood distribution:

Since these are standardized, we re-scale them by the the training mean and standard deviation.

Appendix E Robust model-based statistics for fdr-control

The goal of any knockoff method is to help compute test statistics for a conditional independence test. We employ a variant of hrt [Tansey et al., 2018] to compute test statistics for each feature . We split dataset into train and test sets , and respectively, then sample knockoff datasets and conditioned on each. Next, a model is fit with .

To compute knockoff statistics with , we use a measure of performance on the test set. For real-valued , is the mean squared-error, and for categorical , is expected log-probability of . A knockoff statistic is recorded for each feature , where is but with the th feature swapped with .

In practice, we use use flexible models like neural networks or boosted trees for . While the model-based statistic above will satisfy the properties detailed in section 2 and control fdr, its ability to do so is hindered by imperfect knockoffs. In such cases, we observe that knockoff statistics for null features are centered around some , violating a condition required for empirical fdr control. This happens because if the covariates and knockoffs are not equal in distribution, models trained on the covariates will fit the covariates better than the knockoffs and inflate the value of test statistic . This can lead to an increase in the false discovery rate as conditionally independent features may be selected if their statistic is larger than the selection threshold. To combat this, we propose a mixture statistic that trades off power for fdr-control.

The mixture statistic involves fitting a model for each feature using an equal mixture of data in and , then computing as above. Such a achieves lower performance on , but higher performance on , yielding values of with modes closer to , enabling finite sample fdr control. However, this fdr-control comes at the cost of power as the method’s ability to identify conditionally dependent features is reduced.

Appendix F Nonlinear response for gene experiments

We simulate the response for the gene experiment using a nonlinear response function designed for genomics settings [Liang et al., 2018]. The response consists of two first-order terms, a second-order term, and an additional nonlinearity in the form of a :

where is the number of important features. In our experiments, we set . This means that the first features are important, while the remaining are unimportant.

Appendix G Generating knockoffs for COVID-19 data

In this section, we visualize the marginals of each feature in our covid-19 dataset, and the marginals of knockoffs sampled from each method. Figures 8, 7, 6 and 5 plot the marginals of samples from ddlk, KnockoffGAN, Deep Knockoffs, and Auto-Encoding Knockoffs respectively. These provide insight into why ddlk is able to select more features at the same nominal fdr threshold of 0.2. We first notice that knockoff samples from ddlk match the marginals of the data very well. ddlk is the only method that models asymmetric distributions well.

Despite our experiment using mixture density networks to implement ddlk, discrete data is also modeled well. For example the values of O2_support_above_NC – a binary feature indicating whether a patient required oxygen support greater than nasal cannula – are also the modes of a mixture density learned by ddlk. Samples from Auto-Encoding Knockoffs, KnockoffGAN, and Deep Knockoffs tend to place mass spread across these values.

Figure 5: The marginal distributions of knockoff samples from ddlk look very similar to those from the data. Despite this implementation of ddlk using mixture density networks, the modes of each marginal line up with discrete values in the data.

Figure 6: The marginals distributions of samples from KnockoffGAN match the data only when the feature is univariate, and has roughly equal mass on either side of the mode.

Figure 7: The marginals distributions of samples from Deep Knockoffs match the data only when the feature is univariate and has fat tails.

Figure 8: Auto-Encoding Knockoffs tend to learn underdispersed distributions for the covariates. Further, all of the marginal distributions learned are univariate and exhibit variance much smaller than that of the data.