1 Introduction
High dimensional multivariate datasets pervade many disciplines including biology, neuroscience, and medicine. In these disciplines, a core question is which variables are important to predict the phenomenon being observed. Finite data and noisy observations make finding informative variables impossible without some error rate. Scientists therefore seek to find informative variables subject to a specific tolerance on an error rate such as the false discovery rate (fdr).
Traditional methods to control to the fdr rely on assumptions on how the covariates of interest may be related to the response . ModelX knockoffs (Candes et al., 2016) provide an alternative framework that controls the fdr by constructing synthetic variables , called knockoffs, that look like the original covariates, but have no relationship with the response given the original covariates. Variables of this form can be used to test the conditional independence of each covariate in a collection, with the response given the rest of the covariates by comparing the association the original covariate has with the response with the association the knockoff has with the response.
The focus on modeling the covariates shifts the testing burden to building good generative models of the covariates. Knockoffs need to satisfy two properties: (i) they need to be independent of the response given the real covariates, (ii) they need to be equal in distribution when any subset of variables is swapped between knockoffs and real data. Satisfying the first property is trivial by generating the knockoffs without looking using the response. The second property requires building conditional generative models that are able to match the distribution of the covariates.
Related work.
Existing knockoff generation methods can be broadly classified as either modelspecific or flexible. Modelspecific methods such as hmm
(Sesia et al., 2017) or AutoEncoding Knockoffs (Liu and Zheng, 2018)make assumptions about the covariate distribution, which can be problematic if the data does not satisfy these assumptions. hmm assume the joint distribution of the covariates can be factorized into a markov chain. AutoEncoding Knockoffs use vae to model
and sample knockoffs . vae assume lies near a low dimensional manifold, whose dimension is controlled by a latent variable. Covariates that violate this lowdimensional assumption can be better modeled by increasing the dimension of the latent variable, but risk retaining more information about , which can reduce the power to select important variables.Flexible methods for generating knockoffs such as KnockoffGAN (Jordon et al., 2019) or Deep Knockoffs (Romano et al., 2018) focus on likelihoodfree generative models. KnockoffGAN uses ganbased generative models, which can be difficult to estimate (Mescheder et al., 2018)
and sensitive to hyperparameters
(Salimans et al., 2016; Gulrajani et al., 2017; Mescheder et al., 2017). Deep Knockoffs employ mmd, the effectiveness of which often depends on the choice of a kernel which can involve selecting a bandwidth hyperparameter. Ramdas et al. (2015) show that in several cases, across many choices of bandwidth, mmd approaches 0 as dimensionality increases whiledivergence remains nonzero, suggesting mmd may not reliably generate highdimensional knockoffs. Deep Knockoffs also prevent the knockoff generator from memorizing the covariates by explicitly controlling the correlation between the knockoffs and covariates. This is specific to second order moments, and may ignore higher order ones present in the data.
We propose ddlk, a likelihoodbased method for generating knockoffs without the use of latent variables. ddlk is a two stage algorithm. The first stage uses maximum likelihood to estimate the distribution of the covariates from observed data. The second stage estimates the knockoff distribution with likelihoods by minimizing the kld between the joint distribution of the real covariates with knockoffs and the joint distribution of any swap of coordinates between covariates and knockoffs. ddlk expresses the likelihoods for swaps in terms of the original joint distribution with real covariates swapped with knockoffs. Through the GumbelSoftmax trick (Jang et al., 2016; Maddison et al., 2016)
, we optimize the knockoff distribution under the worst swaps. By ensuring that the knockoffs are valid in the worst cases, ddlk learns valid knockoffs in all cases. To prevent ddlk from memorizing covariates, we introduce a regularizer to encourage high conditional entropy for the knockoffs given the covariates. We study ddlk on synthetic, semisynthetic, and real datasets. Across each study, ddlk controls the fdr while achieving higher power than competing ganbased, mmdbased, and autoencoderbased methods.
2 Knockoff filters
ModelX knockoffs (Candes et al., 2016) is a tool used to build variable selection methods. Specifically, it facilitates the control of the fdr, which is the proportion of selected variables that are not important. In this section, we review the requirements to build variable selection methods using ModelX knockoffs.
Consider a data generating distribution where variables , response only depends on , , and is a subset of the variables. Let be a set of indices identified by a variable selection algorithm. The goal of such algorithms is to return an that maximizes the number of indices in while maintaining the fdr at some nominal level:
To control the fdr at the nominal level, ModelX knockoffs requires (a) knockoffs , and (b) a knockoff statistic to assess the importance of each feature .
Knockoffs
are random vectors that satisfy the following properties for any set of indices
:(1)  
(2) 
The swap property eq. 1 ensures that the joint distribution of is invariant under any swap. A swap operation at position is defined as exchanging the entry of and . For example, when , and , . Equation 2 ensures that the response is independent of the knockoff given the original features .
A knockoff statistic must satisfy the flipsign property. This means if , must be positive. Otherwise, the sign of
must be positive or negative with equal probability.
Given knockoff and knockoff statistics , exact control of the fdr at level can be obtained by selecting variables in such that . The threshold is given by:
(3) 
While knockoffs are a powerful tool to ensure that the fdr is controlled at the nominal level, the choice of method to generate knockoffs is left to the practitioner.
Existing methods for knockoffs include modelspecific approaches that make specific assumptions about the covariate distribution, and flexible likelihoodfree methods. If the joint distribution of cannot be factorized into a markov chain (Sesia et al., 2017) or if does not lie near a lowdimensional manifold (Liu and Zheng, 2018), modelspecific generators will yield knockoffs that are not guaranteed to control the fdr. Likelihoodfree generation methods that use gan (Jordon et al., 2019) or mmd (Romano et al., 2018) make fewer assumptions about , but can be difficult to estimate (Mescheder et al., 2018), sensitive to hyperparameters (Salimans et al., 2016; Gulrajani et al., 2017; Mescheder et al., 2017), or suffer from low power in high dimensions (Ramdas et al., 2015). In realistic datasets, where can come from an arbitrary distribution and dimensionality is high, it remains to be seen how to reliably generate knockoffs that satisfy eqs. 2 and 1.
3 ddlk
We motivate ddlk with the following observation. The swap property in eq. 1 is satisfied if the divergence between the original and swapped distributions is zero. Formally, let be a set of indices to swap, and , . Then under any such :
(4) 
A natural algorithm for generating valid knockoffs might be to parameterize each distribution above and solve for the parameters by minimizing the LHS of eq. 4. However, modeling for every possible swap is difficult and computationally infeasible in high dimensions. Theorem 3.1 provides a useful solution to this problem.
Theorem 3.1.
Let be a probability measure defined on a measurable space. Let be a swap function using indices . If is a sample from , the probability law of is .
As an example, in the continuous case, where and are the densities of and respectively, evaluated at a sample is simply evaluated at the swap of . We show the direct proof of this example and theorem 3.1 in appendix A. A useful consequence of theorem 3.1 is that ddlk needs to only model , instead of and every possible swap distribution . To derive the ddlk algorithm, we first expand eq. 4:
(5) 
where . ddlk models the RHS by parameterizing and with and respectively. The parameters and can be optimized separately in two stages.
Stage 1: Covariate distribution estimation.
We model the distribution of using . The parameters of the model are learned by maximizing over a dataset of samples.
Stage 2: Knockoff generation.
For any fixed swap , minimizing the KL divergence between the following distributions ensures the swap property eq. 1 required of knockoffs:
(6) 
Fitting the knockoff generator involves minimizing this KL divergence for all possible swaps . To make this problem tractable, we use several building blocks that help us (a) sample swaps with the highest values of this KL and (b) prevent from memorizing to trivially satisfy the swap property in eq. 1.
3.1 Fitting ddlk
Knockoffs must satisfy the swap property eq. 1 for all potential sets of swap indices . While this seems to imply that the KL objective in eq. 6 must be minimized under an exponential number of swaps, swapping every coordinate suffices (Candes et al., 2018). More generally, showing the swap property for a collection of sets where every coordinate can be represented as the symmetric difference of members of the collection is sufficient. See section A.4 more more details.
Sampling swaps.
Swapping coordinates can be expensive in high dimensions, so existing methods resort to randomly sampling swaps (Romano et al., 2018; Jordon et al., 2019) during optimization. Rather than sample each coordinate uniformly at random, we propose parameterizing the sampling process for swap indices so that swaps sampled from this process yields large values of the KL objective in eq. 6. We do so because of the following property of swaps, which we prove in appendix A.
Lemma 3.2.
Randomly sampling swaps can be thought of as sampling from
Bernoulli random variables
with parameters respectively, where each indicates whether the th coordinate is to be swapped. A set of indices can be generated by letting . To learn a sampling process that helps maximize eq. 7, we optimize the values of. However, since score function gradients for the parameters of Bernoulli random variables can have high variance, ddlk uses a continuous relaxation instead. For each coordinate
, ddlk learns the parameters for a GumbelSoftmax (Jang et al., 2016; Maddison et al., 2016) distribution .Entropy regularization.
Minimizing the the KL objective in eq. 6 over the worst case swap distribution will generate knockoffs that satisfy the swap property eq. 1. However, a potential solution in the optimization of is to memorize the covariates , which reduces the power to select important variables.
To solve this problem, ddlk introduces a regularizer based on the conditional entropy, to push to not be a copy of . This regularizer takes the form , where is a hyperparameter.
Including the regularizer on conditional entropy, and GumbelSoftmax sampling of swap indices, the final optimization objective for ddlk is:
(8) 
where . We show the full ddlk algorithm in algorithm 1. ddlk fits by maximizing the likelihood of the data. It then fits by optimizing eq. 8 with noisy gradients. To do this, ddlk first samples knockoffs conditioned on the covariates and a set of swap coordinates, then computes MonteCarlo gradients of the ddlk objective in eq. 8 with respect to parameters and . In practice ddlk can use stochastic gradient estimates like the score function or reparameterization gradients for this step. The and models can be implemented with flexible models like MADE (Germain et al., 2015) or mixture density networks (Bishop, 1994).
4 Experiments
We study the performance of ddlk on several synthetic, semisynthetic, and realworld datasets. We compare ddlk with several nonGaussian knockoff generation methods: AutoEncoding Knockoffs (AEK) (Liu and Zheng, 2018), KnockoffGAN (Jordon et al., 2019), and Deep Knockoffs (Romano et al., 2018).^{1}^{1}1For all comparison methods, we downloaded the publicly available implementations of the code (if available) and used the appropriate configurations and hyperparameters recommended by the authors. See appendix D for code and model hyperparameters used.
Each experiment involves three stages. First, we fit a knockoff generator using a dataset of covariates . Next, we fit a response model , and use its performance on a heldout set to create a knockoff statistic for each feature
. We detail the construction of these test statistics in
appendix E. Finally, we apply a knockoff filter to the statistics to select features at a nominal fdr level , and measure the ability of each knockoff method to select relevant features while maintaining fdr at . For the synthetic and semisynthetic tasks, we repeat these three stages 30 times to obtain interval estimates of each performance metric.In our experiments, we assume to be realvalued with continuous support and decompose the models and
via the chain rule:
For each conditional distribution, we fit a mixture density network Bishop (1994) where the number of mixture components is a hyperparameter. In principle, any model that produces an explicit likelihood value can be used to model each conditional distribution.Fitting involves using samples from a dataset, but fitting involves sampling from . This is a potential issue because the gradient of the ddlk objective eq. 8 with respect to is difficult to compute as it involves integrating , which depends on . To solve this, we use an implicit reparameterization (Figurnov et al., 2018) of mixture densities. Further details of this formulation are presented in appendix B.
Across each benchmark involving ddlk, we vary only the entropy regularization parameter based on the amount of dependence among covariates. The number of parameters, learning rate, and all other hyperparameters are kept constant. To sample swaps , we sample using a straightthrough GumbelSoftmax estimator (Jang et al., 2016). This allows us to sample binary values for each swap, but use gradients of a continuous approximation during optimization. For brevity, we present the exact hyperparameter details for ddlk in appendix C.
We run each experiment on a single CPU with 4GB of memory. ddlk takes roughly 40 minutes in total to fit both and on a 100dimensional dataset.
Synthetic benchmarks.
Our tests on synthetic data seek to highlight differences in power and fdr between each knockoff generation method. Each dataset in this section consists of samples, 100 features, 20 of which are used to generate the response . We split the data into a training set () to fit each knockoff method, a validation set () used to tune the hyperparameters of each method, and a test set () for evaluating knockoff statistics.
[gaussian]: We first replicate the multivariate normal benchmark of Romano et al. (2018). We sample , where is a dimensional covariance matrix whose entries . This autoregressive Gaussian data exhibits strong correlations between adjacent features, and lower correlations between features that are further apart. We generate , where coefficients for the important features are drawn as . In our experiments, we set . We let the ddlk entropy regularization parameter . Our model for is a layer neural network with parameters.
[mixture]: To compare each method on its ability to generate nonGaussian knockoffs, we use a mixture of autoregressive Gaussians. This is a more challenging benchmark as each covariate is multimodal, and highly correlated with others. We sample , where each is a dimensional covariance matrix whose th entry is . We generate , where coefficients for the important features are drawn as . In our experiments, we set , and . Cluster centers are set to , and mixture proportions are set to . We let the ddlk entropy regularization parameter . Figure 2 visualizes two randomly selected dimensions of this data.
Results.
compares the average fdp and power (percentage of important features selected) of each knockoff generating method. The average fdp is an empirical estimate of the fdr. In the case of the
gaussian dataset, all methods control fdr at or below the the nominal level, while achieving 100% power to select important features. The main difference between each method is in the calibration of null statistics. Recall that a knockoff filter assumes a null statistic to be positive or negative with equal probability, and features with negative statistics below a threshold are used to control the number of false discoveries when features with positive statistics above the same threshold are selected. ddlk produces the most well calibrated null statistics as evidenced by the closeness of its fdp curve to the dotted diagonal line.Figure 1 also demonstrates the effectiveness of ddlk in modeling nonGaussian covariates. In the case of the mixture dataset, ddlk achieves significantly higher power than the baseline methods, while controlling the fdr at nominal levels. To understand why this may be the case, we plot the joint distribution of two randomly selected features in fig. 2. ddlk and AutoEncoding Knockoffs both seem to capture all three modes in the data. However, AutoEncoding Knockoffs tend to produce knockoffs that are very similar to the original features, and yield lower power when selecting variables, shown in fig. 1. Deep Knockoffs manage to capture the first two moments of the data – likely due to an explicit secondorder term in the objective function – but tend to oversmooth and fail to properly estimate the knockoff distribution. KnockoffGAN suffers from mode collapse, and fails to capture even the first two moments of the data. This yields knockoffs that not only have low power, but also fail to control fdr at nominal levels.
Robustness of ddlk to entropy regularization.
To provide guidance on how to set the entropy regularization parameter, we explore the effect of on both fdr control and power. Intuitively, lower values of will yield solutions of that may satisfy eq. 1 and control fdr well, but may also memorize the covariates and yield low power. Higher values of may help improve power, but at the cost of fdr control. In this experiment, we again use the gaussian dataset, but vary and the correlation parameter . Figure 3 highlights the performance of ddlk over various settings of and . We show a heatmap where each cell represents the RMSE between the nominal fdr and mean fdp curves over 30 simulations. In each of these settings ddlk achieves a power of 1, so we only visualize fdp. We observe that the fdp of ddlk is very close to its expected value for most settings where . This is true over a wide range of explored, demonstrating that ddlk is not very sensitive to the choice of this hyperparameter. We also notice that data with weaker correlations see a smaller increase in fdp with larger values of . In general, checking the fdp on synthetic responses generated conditional on real covariates can aid in selecting .
Semisynthetic benchmark.
[gene]: In order to evaluate the fdr and power of each knockoff method using covariates found in a genomics context, we create a semisynthetic dataset. We use RNA expression data of 963 cancer cell lines from the Genomics of Drug Sensitivity in Cancer study (Yang et al., 2012). Each cell line has expression levels for 20K genes, of which we sample 100 such that every feature is highly correlated with at least one other feature. We create 30 independent replications of this experiment by repeating the following process. We first sample a gene uniformly at random, adding it to the set . For , , we sample uniformly at random from and compute the set of genes not in with the highest correlation with . From this set of , we uniformly sample a gene and add it to the feature set. We repeat this process for , yielding genes in total.
We generate using a nonlinear response function adapted from a study on feature selection in neural network models of genedrug interactions (Liang et al., 2018). The response consists of a nonlinear term, a secondorder term, and a linear term. For brevity, appendix F contains the full simulation details. We let ddlk entropy regularization parameter .
Results.
Figure 1 (right) highlights the empirical fdp and power of each knockoff generating method in the context of gene. All methods control the fdr below the nominal level, but the average fdp of ddlk at fdr thresholds below 0.3 is closer to its expected value. This range of thresholds is especially important as nominal levels of fdr below 0.3 are most used in practice. In this range, ddlk achieves power on par with Deep Knockoffs at levels below 0.1, and higher power everywhere else. AutoEncoding Knockoffs and KnockoffGAN achieve noticeably lower power across all thresholds. Deep Knockoffs perform well here likely due to a lack of strong third or higher moments of dependence between features. We attribute the success of ddlk and Deep Knockoffs to their ability to model highly correlated data.
COVID19 adverse events.
[covid19]: The widespread impact of COVID19 has led to the deployment of machine learning models to guide triage. Data for COVID19 is messy because of both the large volume of patients and the changing practice for patient care. Establishing trust in models for COVID19 involves vetting the training data to ensure it does not contain artifacts that models can exploit. Conditional independence tests help achieve this goal in two ways: (a) they highlight which features are most important to the response, and (b) they prune the feature set for a deployed model, reducing the risk of overfitting to processes in a particular hospital. We apply each knockoff method to a large dataset from one of the epicenters of COVID19 to understand the features most predictive of adverse events.
We use electronic health record data on COVIDpositive patients from a large metropolitan health network. Our covariates include demographics, vitals, and lab test results from every cbc taken for each patient. The response is a binary label indicating whether or not a patient had an adverse event (intubation, mortality, ICU transfer, hospice discharge, emergency department representation, O2 support in excess of nasal cannula at 6 L/min) within 96 hours of their cbc. There are 17K samples of 37 covariates in the training dataset, 5K in a validation set, and 6K in a heldout test set. We let the ddlk entropy regularization parameter
. In this experiment, we use gradient boosted regression trees
(Friedman, 2002; Ke et al., 2017) as our model, and expected loglikelihood as a knockoff statistic. We also standardize the data in the case of Deep Knockoffs since mmd that use the rbf kernel with a single bandwidth parameter work better when features are on the same scale.Feature  DDLK  Deep Knockoffs  AEK  KnockoffGAN  Validated 

Eosinophils count  ✓  ✗  ✗  ✗  ✓ 
Eosinophils percent  ✓  ✓  ✗  ✗  ✗ 
Blood urea nitrogen  ✓  ✗  ✗  ✗  ✓ 
Ferritin  ✓  ✗  ✗  ✗  ✗ 
O2 Saturation  ✓  ✗  ✗  ✗  ✓ 
Heart rate  ✓  ✗  ✗  ✗  ✓ 
Respiratory rate  ✓  ✓  ✗  ✗  ✓ 
O2 Rate  ✓  ✓  ✓  ✓  ✓ 
On room air  ✓  ✓  ✓  ✓  ✓ 
High O2 support  ✓  ✓  ✓  ✓  ✓ 
Age  ✗  ✗  ✓  ✓  ✗ 
Results.
As COVID19 is a recently identified disease, there is no ground truth set of important features for this dataset. We therefore use each knockoff method to help discover a set of features at a nominal fdr threshold of , and validate each feature by manual review with doctors at a large metropolitan hospital. Table 1 shows a list of features returned by each knockoff method, and indicates whether or not a team of doctors thought the feature should have clinical relevance.
We note that ddlk achieves highest power to select features, identifying 10 features, compared to 5 by DeepKnockoffs, and 4 each by AutoEncoding Knockoffs and KnockoffGAN. To understand why, we visualize the marginal distributions of each covariate in, and the respective marginal distribution of samples from each knockoff method, in fig. 4.
We notice two main differences between ddlk and the baselines. First, ddlk is able to fit asymmetric distributions better than the baselines. Second, despite the fact that the implementation of ddlk using mixture density networks is misspecified for discrete variables, ddlk is able to model them better than existing baselines. This implementation uses continuous models for , but is still able approximate discrete distributions well. The components of each mixture appear centered around a discrete value, and have very low variance as shown in fig. 5. This yields a close approximation to the true discrete marginal. We show the marginals of every feature for each knockoff method in appendix G.
5 Conclusion
ddlk is a generative model for sampling knockoffs that directly minimizes a divergence implied by the knockoff swap property. The optimization for ddlk involves first maximizing the explicit likelihood of the covariates, then minimizing the divergence between the joint distribution of covariates and knockoffs and any swap between them. To ensure ddlk satisfies the swap property under any swap indices, we use the GumbelSoftmax trick to learn swaps that maximize the divergence. To generate knockoffs that satisfy the swap property while maintaining high power to select variables, ddlk includes a regularization term that encourages high conditional entropy of the knockoffs given the covariates. We find ddlk to outperform various baselines on several synthetic and real benchmarks including a task involving a large dataset from one of the epicenters of COVID19.
Broader Impact
There are several benefits to using flexible methods for identifying conditional independence like ddlk. Practitioners that care about transparency have traditionally eschewed deep learning, but methods like ddlk can present a solution. By performing conditional independence tests with deep networks and by providing guarantees on the false discovery rate, scientists and practitioners can develop more trust in their models. This can lead to greater adoption of flexible models in basic sciences, resulting in new discoveries and better outcomes for the beneficiaries of a deployed model. Conditional independence can also help detect bias from data, by checking if an outcome like length of stay in a hospital was related to a sensitive variable like race or insurance type, even when conditioning on many other factors.
While we believe greater transparency can only be better for society, we note that interpretation methods for machine learning may not exactly provide transparency. These methods visualize only a narrow part of a model’s behavior, and may lead to gaps in understanding. Relying solely on these domainagnostic methods could yield unexpected behavior from the model. As users of machine learning, we must be wary of too quickly dismissing the knowledge of domain experts. No interpretation technique is suitable for all scenarios, and different notions of transparency may be desired in different domains.
References
 Bishop [1994] Christopher M Bishop. Mixture density networks. Technical report, Aston University, Birmingham, UK, 1994.
 Candes et al. [2016] E. Candes, Y. Fan, L. Janson, and J. Lv. Panning for Gold: ModelX Knockoffs for Highdimensional Controlled Variable Selection. ArXiv eprints, October 2016.
 Candes et al. [2018] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold:‘modelx’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
 Falcon [2019] WA Falcon. Pytorch lightning. GitHub. Note: https://github. com/williamFalcon/pytorchlightning Cited by, 3, 2019.
 Figurnov et al. [2018] Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pages 441–452, 2018.
 Friedman [2002] Jerome H Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367–378, 2002.
 Germain et al. [2015] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: masked autoencoder for distribution estimation. CoRR, abs/1502.03509, 2015. URL http://arxiv.org/abs/1502.03509.
 Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. CoRR, abs/1704.00028, 2017. URL http://arxiv.org/abs/1704.00028.
 Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. CoRR, abs/1611.01144, 2016. URL http://arxiv.org/abs/1611.01144.

Jordon et al. [2019]
James Jordon, Jinsung Yoon, and Mihaela van der Schaar.
KnockoffGAN: Generating knockoffs for feature selection using generative adversarial networks.
In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByeZ5jC5YQ. 
Ke et al. [2017]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei
Ye, and TieYan Liu.
Lightgbm: A highly efficient gradient boosting decision tree.
In Advances in neural information processing systems, pages 3146–3154, 2017.  Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma et al. [2015] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pages 2575–2583, 2015.
 Liang et al. [2018] Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972, 2018.
 Liu and Zheng [2018] Ying Liu and Cheng Zheng. Autoencoding knockoff generator for fdr controlled variable selection. arXiv preprint arXiv:1809.10765, 2018.
 Maddison et al. [2016] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016. URL http://arxiv.org/abs/1611.00712.
 Mescheder et al. [2017] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pages 1825–1835, 2017.
 Mescheder et al. [2018] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? arXiv preprint arXiv:1801.04406, 2018.

Ramdas et al. [2015]
Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and
Larry Wasserman.
On the decreasing power of kernel and distance based nonparametric
hypothesis tests in high dimensions.
In
TwentyNinth AAAI Conference on Artificial Intelligence
, 2015.  Romano et al. [2018] Yaniv Romano, Matteo Sesia, and Emmanuel J. Candès. Deep knockoffs, 2018.
 Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
 Sesia et al. [2017] Matteo Sesia, Chiara Sabatti, and Emmanuel J Candès. Gene hunting with knockoffs for hidden markov models. arXiv preprint arXiv:1706.04677, 2017.
 Tansey et al. [2018] Wesley Tansey, Victor Veitch, Haoran Zhang, Raul Rabadan, and David M Blei. The holdout randomization test: Principled and easy black box feature selection. arXiv preprint arXiv:1811.00645, 2018.
 Yang et al. [2012] Wanjuan Yang, Jorge Soares, Patricia Greninger, Elena J Edelman, Howard Lightfoot, Simon Forbes, Nidhi Bindal, Dave Beare, James A Smith, I Richard Thompson, et al. Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic acids research, 41(D1):D955–D961, 2012.
Appendix
Appendix A Proofs
a.1 Proof of theorem 3.1
Let be a probability measure defined on a measurable space. Let be a swap function using indices . If is a sample from , the probability law of is .
Proof.
The swap operation on swaps coordinates in the following manner: for each , the th and th coordinates are swapped. Let be a measurable space, where elements of are dimensional vectors, and is a algebra on . Let also be a measurable space where each element of is an element of but with the th coordinate swapped with the th coordinate for each . Similarly, let be constructed by applying the same swap transformations to each element of . is a algebra as swaps are onetoone transformations, and is a algebra.
We first show that is a measurable function with respect to and . This is true by construction of the measurable space . For every element , .
We can now construct a mapping for all . This is the pushforward measure of under transformation , and is well defined because is measurable.
Using the fact that a swap applied twice is the identity, we get . With this, we see that the probability measure on is . ∎
a.2 Alternative derivation of theorem 3.1 for continuous random variables
In this section we derive theorem 3.1
for continuous random variables in an alternative manner. Let
be a set of covariates, and be a set of knockoffs. Let , and where is a set of coordinates in which we swap and . Recall that a swap operation on is an affine transformation , where is a permutation matrix. Using this property, we get:The first step is achieved by using a change of variables, noting that is invertible, and . The determinant of the Jacobian here is just the determinant of . is a permutation matrix whose parity is even, meaning its determinant is , and that . I.e. the density of the swapped variables evaluated at is equal to the original density evaluated at .
a.3 Proof of lemma 3.2
Recall lemma 3.2:
Worst case swaps: Let be the worst case swap distribution. That is, the distribution over swap indices that maximizes
with respect to . If this quantity is minimized with respect to , knockoffs sampled from will satisfy the swap property in eq. 1 for any swap in the power set of .
Proof.
If
(9) 
is minimized with respect to but maximized with respect to , then for any other distribution , eq. 9 will be lesser. Minimizing eq. 9, which is nonnegative, with respect to implies that for any swap sampled from and for any knockoff sampled from ,
As eq. 9 is also maximized with respect to , swaps drawn from all other distributions will only result in lower values of eq. 9. Therefore, the joint distribution will be invariant under any swap in the power set of . ∎
a.4 Sufficient sets for swap condition expectation
Recall the swap property required of knockoffs highlighted in eq. 1:
where is a set of coordinates under which we swap the covariates and knockoffs. For valid knockoffs, this equality in distribution must hold for any such . One approach to check if knockoffs are valid is to verify this equality in distribution for all singleton sets [Romano et al., 2018, Jordon et al., 2019]. To check if the swap property eq. 1 holds under any , it suffices to check if eq. 1 holds under each of .
We can generalize this approach to check the validity of knockoffs under other collections of indices besides singleton sets using the following property. Let and
Then,
where is the symmetric difference of and . Swapping the indices in is equivalent to swapping the indices in , then the indices in . If , swapping twice will negate the effect of the swap.
We can extend this property to sets and define sufficient conditions to check if the swap property holds. Let be a sequence of sets where each . Let
Checking the swap property eq. 1 under a sequence of swaps is equivalent to checking eq. 1 under the singleton set . Therefore, the swap property must also hold under the singleton set .
If collection of sets of swap indices contains a subsequence such that their sequential symmetric difference is the singleton for each , then a set of knockoffs that satisfies the swap property under each , will also satisfy the swap property under each singleton set, which is sufficient to generate valid knockoffs.
Appendix B Implicit reparameterization of mixture density networks
In our experiments, we decompose both and via the chain rule:
We model each conditional using mixture density networks Bishop [1994] which take the form
where functions , , and characterize a univariate gaussian mixture. These parameters of these functions are .
Fitting .
Let , the parameters of contain parameters for every conditional . The optimization of is straightforward:
only requires taking the derivative of .
Fitting .
Let , the parameters of contain parameters for every conditional
. Recall the loss function
The optimization of requires , which involves the derivative of an expectation with respect to to . We use implicit reparameterization [Figurnov et al., 2018]. The advantage of implicit reparameterization over explicit reparameterization [Kingma et al., 2015] is that an inverse standardization function – which transforms random noise into samples from a distribution parameterized by – is not needed. Using implicit reparameterization, gradients of some objective can be rewritten as
We use this useful property to reparameterize gaussian mixture models. Let
be a gaussian mixture model:where . Let the standardization function be the CDF of :
where is the standard normal gaussian CDF. We use this to compute the gradient of with respect to each parameter:
Putting it all together, we use the implicit reparameterization trick to implement each conditional distribution in and .
Appendix C Implementation details and hyperparameter settings for ddlk experiments
We decompose and into the product of univariate conditional distributions using the product rule. We use mixture density networks [Bishop, 1994] to parameterize each conditional distribution. Each mixture density network is a 3layer neural network with 50 parameters in each layer and a residual skip connection from the input to the last layer. Each network outputs the parameters for a univariate gaussian mixture with 5 components. We initialize the network such that the modes are evenly spaced within the support of training data.
Using , we sample binary swap matrices of the same dimension as the data. As we require discrete samples from the GumbelSoftmax distribution, we implement a straightthrough estimator [Jang et al., 2016]
. The straightthrough estimator facilitates sampling discrete indices, but uses a continuous approximation during backpropagation.
The model is optimized using Adam [Kingma and Ba, 2014], with a learning rate of
for a maximum of 50 epochs. The
model is optimized using Adam, with a learning rate of for and for for a maximum of 250 epochs. We also implement early stopping using validation loss using the PyTorch Lightning framework [Falcon, 2019].Our code can be found online by installing:
pip install i https://test.pypi.org/simple/ ddlk==0.2 
Compute resources.
We run each experiment on a single CPU core using 4GB of memory. Fitting for a 100dimensional dataset with 2000 samples requires fitting 100 conditional models, and takes roughly 10 minutes. Fitting for the same data takes roughly 30 minutes.
Fitting ddlk to our covid19 dataset takes roughly 15 minutes in total.
Appendix D Baseline model settings
For Deep Knockoffs and KnockoffGAN, we use code from each respective repository:
and use the recommended hyperparameter settings.
At the time of writing this paper, there was no publicly available implementation for AutoEncoding Knockoffs. We implemented AutoEncoding Knockoffs with a vae with a gaussian posterior and likelihood . Each of
is a 2layer neural network with 400 units in the first hidden layer, 500 units in the second, and ReLU activations. The outputs of networks
are exponentiated to ensure variances are nonnegative. The outputs of network and are of dimension , and the outputs of and are of dimension , the covariate dimension. For each dataset, we choose the dimension of latent variable that maximizes the estimate of the ELBO on a validation dataset. In our experiments, we search for over the set . For each dataset, we use the following :
gaussian: 20

mixture: 140

gene: 30

covid19: 60.
The neural networks are trained using Adam [Kingma and Ba, 2014], with a learning rate of
for a maximum of 150 epochs. To avoid very large gradients, we standardize the data using the mean and standard deviation of the training set. To generate knockoffs
, we use the same approach prescribed by Liu and Zheng [2018]. We first sample the latent variable conditioned on the covariates using the posterior distribution:This sample of is then used to sample a knockoff using the likelihood distribution:
Since these are standardized, we rescale them by the the training mean and standard deviation.
Appendix E Robust modelbased statistics for fdrcontrol
The goal of any knockoff method is to help compute test statistics for a conditional independence test. We employ a variant of hrt [Tansey et al., 2018] to compute test statistics for each feature . We split dataset into train and test sets , and respectively, then sample knockoff datasets and conditioned on each. Next, a model is fit with .
To compute knockoff statistics with , we use a measure of performance on the test set. For realvalued , is the mean squarederror, and for categorical , is expected logprobability of . A knockoff statistic is recorded for each feature , where is but with the th feature swapped with .
In practice, we use use flexible models like neural networks or boosted trees for . While the modelbased statistic above will satisfy the properties detailed in section 2 and control fdr, its ability to do so is hindered by imperfect knockoffs. In such cases, we observe that knockoff statistics for null features are centered around some , violating a condition required for empirical fdr control. This happens because if the covariates and knockoffs are not equal in distribution, models trained on the covariates will fit the covariates better than the knockoffs and inflate the value of test statistic . This can lead to an increase in the false discovery rate as conditionally independent features may be selected if their statistic is larger than the selection threshold. To combat this, we propose a mixture statistic that trades off power for fdrcontrol.
The mixture statistic involves fitting a model for each feature using an equal mixture of data in and , then computing as above. Such a achieves lower performance on , but higher performance on , yielding values of with modes closer to , enabling finite sample fdr control. However, this fdrcontrol comes at the cost of power as the method’s ability to identify conditionally dependent features is reduced.
Appendix F Nonlinear response for gene experiments
We simulate the response for the gene experiment using a nonlinear response function designed for genomics settings [Liang et al., 2018]. The response consists of two firstorder terms, a secondorder term, and an additional nonlinearity in the form of a :
where is the number of important features. In our experiments, we set . This means that the first features are important, while the remaining are unimportant.
Appendix G Generating knockoffs for COVID19 data
In this section, we visualize the marginals of each feature in our covid19 dataset, and the marginals of knockoffs sampled from each method. Figures 8, 7, 6 and 5 plot the marginals of samples from ddlk, KnockoffGAN, Deep Knockoffs, and AutoEncoding Knockoffs respectively. These provide insight into why ddlk is able to select more features at the same nominal fdr threshold of 0.2. We first notice that knockoff samples from ddlk match the marginals of the data very well. ddlk is the only method that models asymmetric distributions well.
Despite our experiment using mixture density networks to implement ddlk, discrete data is also modeled well. For example the values of O2_support_above_NC – a binary feature indicating whether a patient required oxygen support greater than nasal cannula – are also the modes of a mixture density learned by ddlk. Samples from AutoEncoding Knockoffs, KnockoffGAN, and Deep Knockoffs tend to place mass spread across these values.
Comments
There are no comments yet.