In the past decades, the machine learning methods have achieved great advancement in improving prediction accuracy. However, prediction accuracy is not the sole quest of ’big data’ analysis. A universal aim across a lot of scientific disciplines is to identify the causes of certain outcome. For example, in new drug development, a personalized medicine strategy can be formed by identifying the gene markers that is biological linked to certain drug response or resistance. In electronic health record analysis, the goal is to identify ’actionable’ factors for the purpose of reducing cost and improving the quality of care. In epidemiology or sociology studies, it is of interest to identify protection or risk factors, especially those that can be changed through public policy or civil planning to improve public welfare. In these scientific endeavors, the need is to identify the important factors associated certain outcome, so that further confirmatory investigation (e.g. randomized international studies) could be conducted to expand knowledge or change future actions for a better outcome. For this purpose, we are interested in procedures that control the Type I error, which is the chance of a false discovery. With a large amount of factors to test, controlling the chance of any false discovery (family-wise error) is too stringent. Therefore the objective is usually relaxed to control the False Discovery Rate (FDR)(Benjamini & Hochberg, 1995).
A recent breakthrough in the statistical theory, i.e. the Model-X framework(Barber & Candès, 2015; Candès et al., 2018), provided a general solution. It can be incorporated with any machine learning models to select true signals associated with the outcome, with rigorous control for FDR. This line of research showed light on expanding the horizon of supervised learning methods beyond prediction. However, the implementation of Model-X requires the generation of the so called ’knockoffs’, which has very limited existing methods. The goal of our paper is to fill in the gap by proposing a model-free method for generating knockoffs that is suitable for any data type. It can also be efficiently implemented leveraging on the power of of the recent development in deep generative models. Consequently, with the creation of such a generator, we equipped every supervised learning method the ability to conduct FDR controlled variable selection.
We consider the following problem (Candès et al., 2018). There are i.i.d. samples
from a population, where the predictors are random variables. The outcome only depend on a subset of predictors , i.e. given , is independent of the other ’s. The smallest set
that satisfy this requirement is considered as true factors. The goal is to find a good estimatorfor with control of the . Barber & Candès (2015)
first proposed the method for linear regression with Gaussian errors, where the design matrixis assumed to be fixed. Candès et al. (2018) greatly generalized it to the Model-X framework. In contrast to the traditional statistical models focusing on the conditional distribution , Model-X makes no assumption for the conditional distribution, but assume has a known distribution for the purpose of generating knockoffs that satisfies Definition 1.1.
Model-X knockoffs for the family of random variables are a new family of random variables constructed with the following two properties: (1) for any subset , (2) if there is a response Y. (2) is guaranteed if is constructed without looking at Y.
Proposition 1 (Candès 2018)
The random variables are model-X knockoffs for
if and only if , for any . And .
To implement Model-X with any machine learning models, one need to define the model-specific feature statistics that has the flip-sign property,
According to Thm 3.4 in Candès et al. (2018), the following procedure (Knockoff) can estimate the controlling for modified FDR, i.e. . First, compute the threshold then the selected set is
More conservatively, the following procedure (Knockoff+) can control for the FDR, choose a threshold and select the variables in the set
An example of the feature statistics is the signed max lambda statistics in penalized regressions Barber & Candès (2015): for the concatenated design matrix , define . For each feature and its knockoff , let and . The signed lambda statistics is defined as . Various feature statistics has been proposed for common learning methods (Candès et al., 2018; Gimenez et al., 2018), however the current knockoff generation methods are still limited to make Model-X generally applicable on any real data set.
Although Candès et al. (2018) provided a general algorithm for generating Model-X knockoffs, i.e. the Sequential Conditional Independent Pairs algorithm, it needs to sample from the conditional distribution of from which becomes intractable with slightly complex distributional assumption for . Candès et al. (2018)
also proposed a second-order approach by matching first two moments ofand , which satisfies Def. 1.1 when
is from Gaussian distribution.Sesia et al. (2018) proposed an algorithm to sample knockoffs when
is from Hidden Markov Model.Gimenez et al. (2018) proposed algorithm that applicable for
In this paper, we relaxes the distributional assumptions largely from all the existing methods. Our procedure assumes there is a (multivariate) latent variable , conditional on which ’s are mutually independent. We propose a procedure to generate ’s knockoff from conditional distributions of and . And we also provide a FDR bound when these estimated conditional probabilities is compatible with the distribution of . In practice, we assume
and the conditional distributions are from parametric families, which is adaptive to different real data application, and the parameters can be approximated by deep neural networks. In this manner, the algorithms developed in the line of research on variational autoencoder, provide a flexible and practical tool for implementing our proposed Knockoff generation method. In the next session, we provide the formal derivation and justification for this approach.
2 A Model-Free Knockoff Generator With Latent Encoding Variables.
The random vector of covariatesare from (X can be continuous, discrete or mixed ). There exists a vector of continuous latent random variables from distribution . Each component of the covariates , such that ’s are mutually independent given . For example, when the predictor is continuous, ; and when the predictor is binary, . Given the observed samples from , to estimate and may not be an identifiable problem. However, we just need to find one that approximately satisfy the above condition in practice for our purpose of knockoff generation. Then we propose to generate knockoff as Algorithm 1.
Algorithm 1 can be efficiently implemented with but not limited to the existing algorithms developed in the line of research on Variational Autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014; Maddison et al., 2017; Jang et al., 2017). For example, in the common VAE implementation, is assumed tobe from its prior , the working model is assumed to be from , and is , where and and are functions to be approximated by neural networks. Theis generated as . Another example is to approximate discrete ’s by the concrete distribution (Jang et al., 2017; Maddison et al., 2017). In this case is generated from and are element-wise independent. These two methods provide efficient algorithms for implementing Algorithm 1 in specific cases. However Algorithm 1 and the following justification are more general. We do not assume ’s from independent prior distribution, and do not need to be element-wise independent either. See more discussions in Section 5 paragraph 2.
Notice that the Step 3 in Algorithm 1 is different from the deep generative model, the latter generate new image by first generating a from the predetermined prior distribution . However, in step 3 of our algorithm, the new is generated from the conditional distribution . The following Theorem 2.1 justifies why we propose procedure as so.
For any vector of random variables , such that conditional on , ’s are mutually independent. If is from , then from is model-X knockoffs.
Proof: For any and sets , , , ,
We have shown and
are exchangeable in the joint distribution, thenis the Model-X knockoff by Proposition 1.
To provide the FDR bound for the knockoffs generated from the estimated encoder and decoder, we do not assume there exists a satisfies the condition of Theorem 2.1 exactly. Instead, we assume the working models and observed distribution of are approximately compatible as in the following theorem. We then apply the results per Barber et al. (2018) to show FDR control, where they showed the FDR derived from approximate knockoffsis bounded by a function of the observed KL divergence between and as below:
Assume the observed marginal distribution and the working models (, , estimated in Algorithm 1) are approximately compatible as following: there exists a random vector from certain distribution and , such that
Where the density of is denoted as , considering as a fixed function. And denote as a random variable generated from the decoder .
Then the FDR can be controlled at .
Proof: Consider which is another independent sample from when sampling using the same . Since is element-wise independent, conditions in Theorem 2.1 holds, thus is a model-X knockoff of and
Notice that by our assumptions
So and we can bound the log likelihood ratio by
So we can bound
And FDR has the bound in the statement by applying Lemma 2 in Barber et al. (2018). Specifically, if , then FDR will be asymptotically bounded by .
In the simulation, we compare three knockoff generation schemes: a) our proposed method implemented through VAE b) the fixed knockoff generation method in Barber & Candès (2015) c) the second order matching in Candès et al. (2018). We demonstrate the performance of these methods in two simulation scenarios. In both scenarios, the sample size is 200, and number of potential predictors is 100, where the first variables are the true signals. The coefficients for the true predictors are alternating and , where is the magnitude of the signal. In Figures 12, the power and FDR are drawn as curves with respect to . Results based on replications are presented for
, Gaussian and binary outcomes separately. In the Gaussian case, the error term is standard normally distributed. In the logistic regression case, binary outcome follows Bernoulli distribution with probabilities.
Setting 1. The first setting generates continuous predictors: a) Generate independent uniform distributed by random matrix. b) Let C be the Cholesky decomposition of the correlation matrix with on the off diagonal entries and compute . c) Even column is generated as
, and odd columnis generated as , where following the python array indexing. d) Rescale ’s to be within range by subtracting the column and divided by column . This generation scheme is an arbitrary representation of a set of lightly correlated non-normal distributed continuous variables. Empirically, the pairwise correlation ranges from to and the absolute value of the correlation has mean 0.07. The results of this setting are presented in Figure 1. The signal to noise ratios of the Gaussian case approximately has the range of for and for setting with . We present the results controlled for , because of the page limit, the binary case has low power ( although the comparison of methods are the same), thus we present results of binary outcome for . Setting 2. The second setting generates categorical predictors:a) Generate independent standard Gaussian distributed by random matrix. b) Let C be the Cholesky decomposition of the correlation matrix with on the off diagonal entries. Let . c) is the categorized matrix, where . Approximately, the signal to noise ratio for the Gaussian case is for and for . We present results for FDR controlled at .
Results According to Figures 1 2, our proposed method implemented through VAE has FDR controlled below the predetermined threshold, and it demonstrates higher power than its two competitors. Since the data were not Gaussian, the second-order matching method has the lowest power. The assumptions of the Fixed knockoff generations holds for the Gaussian cases, thus the power of increases to about for setting 1 and for setting 2 with large signal and with . Cases with more true signals demonstrate better power for Gaussian outcomes, since we control for the proportion of false discoveries, more errors are allowed with increasing number of true signals. In the binary case, although the Gaussian error assumption for using the fixed knockoff generation does not hold, FDR is still controlled in all three methods. However the powers are low.
Implementation details. For both settings, we implemented our method with VAE Kingma & Welling (2014), where the latent variables are multivariate normal with dimensions. The models were trained with batch size and for epochs with the default ‘adam’ optimizer in ‘keras’Chollet et al. (2015). The architecture of VAE for setting 1 is as following, the encoder networks for and have two hidden layers with and neurons with ‘tanh’ activation and regularization with tuning parameter
. The activation for the output layer is ‘linear’. The decoder network has two hidden layers with the same specifications, followed by a batch-normalization layer and output with linear activation. The architecture of VAE for setting 2 is as following, the encoder networks has two hidden layers withand
neurons with ‘relu’ activation andregularization with tuning parameter
. The activation for the output layer is ‘linear’. The decoder is an one layer network with the ‘sigmoid’ activation function and then threshold by.
For all simulation settings and methods, we are using the signed max lambda statistics in Barber & Candès (2015) (see definition on page 2). We implemented the penalized linear and logistic regression with R package ’glmnet’ with a finer grid of tuning parameter to break ties in and . We used the more conservative Knockoff+ as in (2).
4 Real Data Experiment
In the real data experiment, we demonstrate the performance of our proposed method on generating knockoffs for sparse genetic mutation data. The dataset and scientific problem is from a study aiming at detecting mutations associated with drug resistance in patients with Human Immunodeficiency Virus Type 1 (HIV-1) Rhee et al. (2006). Due to the space limit, here we present results for the 7 protease inhibitors. There is no ground truth for which subset of mutations caused the drug resistance. Nevertheless, there is a set of treatment-selected mutations (TSMs) identified from a separate study Rhee et al. (2005), which are selected as the mutations marginally correlated with the patient treatment history with protease inhibitors. Thus to evaluate the performance of the knockoffs, we first simulates the outcome for which we know the true signal, and evaluate the FDR control and power as shown in Figure 3, and then we used the real data outcome and compared our selected mutations with the TSM. Notice that the TSM is not drug specific and can not be considered as ground truth, by showing number of selected mutations in TSM set we are not showing the FDR control but demonstrate reproducibility across studies.
The normal distributed does not fit the discrete and sparse nature of the genetic mutation data. Thus we adopted the Categorical Variation Autoencoder (CAT-VAE) Jang et al. (2017); Maddison et al. (2017), where in our implementation, the latent variables are Gumbel-Softmax distributed variable with temperature 1, each with categories; both the encoder and decoder has one hidden layer of dimension .
Figure 3 demonstrates the results for simulated outcomes using the real data . Overall speaking, our proposed method still achieves the FDR control and demonstrates the highest power among the three. The correlation is smaller than the previous simulation settings (since the X is a sparse matrix), so the other two methods has better power. Controlled for a FDR level of , all three methods achieve a high power of 80% very large signal at in the Gaussian case. For the binary case, our proposed method shows greater advantage in achieving more than 2-times the power of the other two methods.
Table 1 presents the number of selected and matched selection of mutations with the TSMs. Here we present results controlled by both Knockoff+ (as in (2)) and Knockoff (as in (1)) controlled at FDR level . Similar to the simulated settings, our proposed method select more mutations than the other methods, the average proportion of selected mutation that is in the TSM set for our proposed method is 75.3%, which demonstrate reproducibility across studies. With less conservative Knockoff procedure, Fixed and Second-Order methods are able to select more mutations, where our proposed method seems be less sensitive to choice of Knockoff or Knockoff+.
Implementation details For these analysis, the CAT-VAE knockoffs are generated for mutations with () occurrence in the data set and the number of patients is . Since each drug resistance outcome is observed in a subset of samples, when conduct analysis for each drug, we only consider mutations with occurrence. Following analysis in Rhee et al. (2006), the outcome is a continuous variable, which is the log transformed drug susceptibility. For results in Figure 3, the linear signal is simulated as following: in each replication, randomly chose 50 predictors from 186 mutations , and compute as sum of ’s with alternating signs. The Gaussian outcome is plus a standard normal noise and the binomial outcome is generated from .
5 Discussion and Future Directions
The connection between representative learning and Model-X FDR control framework has two directions. One direction is that the VAE algorithms provide a way to efficiently implement the knock generating methods proposed in this paper. The other direction is that the Model-X can also enhance the interpretability of representative learning by equipping it with the ability for finite sample FDR controlled feature selection, see example inGimenez et al. (2018). With the natural connection between our proposed algorithm and the deep generative model for images, it is a promising future direction to investigate how to construct feature statistics to interpret image features.
The proposed knockoff generating method opens a venue for the development of new algorithms. For example, the VAE and CAT-VAE algorithms both assume are independent, this can be relaxed since our method does not assume this. Some minor changes also worth further investigation, for example, instead of implementing the existing VAE algorithm with infinitesimal , we can add a small noise to for generating , with predetermined or estimated variance.
For real data applications, the existing VAE algorithms has already offered a large class of working knockoff generation models to choose from. An open question is how to evaluate the goodness of knockoffs and pick a better model. One evaluation approach is from the performance in FDR control and power in simulated settings. To this end, one can assume a Bayesian prior for the signals patterns of the data, for each knockoff model estimate the FDR and power for a class of signal patterns from several simulated settings. Another approach is to evaluate the knockoff directly. This is a largely unsolved problem. An insight from Theorem 2.2 is the FDR is controlled when and the density of are close. Therefore it is more desirable that is element-wise independent.
To sum up, the method we proposed in the paper provides a practical method for model free knockoff generation. It will stimulate broad further research and software development towards a general applicable tool for the scientific community to identify important factors with rigorous control for false discovery.
- Barber & Candès (2015) Rina Foygel Barber and Emmanuel J. Candès. Controlling the false discovery rate via knockoffs. Ann. Statist., 43(5):2055–2085, 10 2015.
- Barber et al. (2018) Rina Foygel Barber, Emmanuel J. Candès, and Richard J. Samworth. Robust inference with knockoffs. arXiv:1801.03896 [stat.ME], 2018.
- Benjamini & Hochberg (1995) Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995.
- Candès et al. (2018) Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: ‘model-x’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551–577, 2018.
- Chollet et al. (2015) François Chollet et al. Keras. https://keras.io, 2015.
- Gimenez et al. (2018) Jaime Roquero Gimenez, Amirata Ghorbani, and James Zou. Knockoffs for the mass: new feature importance statistics with false discovery guarantees. arXiv:1807.06214 [stat.ML], 2018.
- Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ICLR, 2017.
- Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
Maddison et al. (2017)
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.ICLR, 2017.
Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.2014.
- Rhee et al. (2005) Soo-Yon Rhee, W. Jeffrey Fessel, Andrew R. Zolopa, Leo Hurley, Tommy Liu, Jonathan Taylor, Dong Phuong Nguyen, Sally Slome, Daniel Klein, Michael Horberg, Jason Flamm, Stephen Follansbee, Jonathan M. Schapiro, and Robert W. Shafer. Hiv-1 protease and reverse-transcriptase mutations: Correlations with antiretroviral therapy in subtype b isolates and implications for drug-resistance surveillance. The Journal of Infectious Diseases, 192(3):456–465, 2005.
- Rhee et al. (2006) Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L. Brutlag, and Robert W. Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355–17360, 2006.
- Sesia et al. (2018) M Sesia, C Sabatti, and E J Candès. Gene hunting with hidden markov model knockoffs. Biometrika, 2018.