1 Introduction
Feature selection for high-dimensional data is of fundamental importance for different applications across various scientific disciplines (Tang and Liu, 2014; Li et al., 2018). Grouping structure among features arises naturally in many statistical modeling problems. Common examples range from multilevel categorical features in a regression model to genetic markers from the same gene in genetic association studies. Incorporating the grouping structure information into the feature selection can take advantage of the scientifically meaningful prior knowledge, increase the feature selection accuracy and improve the interpretability of the feature selection results (Huang et al., 2012).
In this paper, we focus on group-feature selection as an approach for model interpretation and dimension reduction in both linear and nonlinear contexts. Our method can achieve stable feature selection results in a high dimensional setting when , which is usually a challenging problem for existing methods, where is the number of features and is the number of samples.
Group-feature selection has been studied from different perspectives. The group-Lasso, a generalization of the Lasso (Tibshirani, 1996), has been proposed as a mainstream approach to conduct group-wise feature selection (Yuan and Lin, 2006). To relax the linear constraint, Meier et al. (2008)
extended the group-Lasso from linear regression to logistic regression. To speed up the computation for group-Lasso,
Yang and Zou (2015) have further developed a more computationally tractable and efficient algorithm.However, researchers have found that the feature selection results by Lasso and group-Lasso are sensitive to the choices of tuning parameters (Tibshirani, 1996; Su et al., 2016). In practice, the tuning parameter is often chosen by cross-validation (CV). But it has been reported that in the high-dimensional settings the widely adopted CV typically tends to select a large number of false features (Bogdan et al., 2015). In order to ensure the selected features are correct and replicable, several methods have been proposed to preform feature selection while controlling the false discovery rate (FDR)—the expected fraction of false selections among all selections.
Among them, Sorted L-One Penalized Estimation (SLOPE)
(Bogdan et al., 2015) and Knockoffs (Barber et al., 2015; Candes et al., 2018) are the state-of-the-art methods and have received the most attention. SLOPE was proposed to control the FDR in the classical multiple linear regression setting. SLOPE is defined to be the solution to a penalized objective function:where , with , , and
is the vector of sorted absolute values of coordinates of
. Brzyski et al. (2018) extended SLOPE method as group-SLOPE to perform group-feature selection but it is limited to linear regression.The notion of Knockoffs was first introduced in Barber et al. (2015) and improved as model-X Knockoffs by Candes et al. (2018). The Knockoffs variables serve as negative controls and help identify the truly important features by comparing the feature importance between original and their Knockoffs counterpart. Originally, it is constrained to homoscedastic linear models with (Barber et al., 2015) and later extended to a group-sparse linear regression setting by Dai and Barber (2016).
In the state-of-the-art directions of SLOPE and Knockoffs, Group-SLOPE (Brzyski et al., 2018) and group-Knockoffs (Dai and Barber, 2016) are the only solution for group-feature selection. However, they suffer from the following limitations. (1) group-Knockoffs can only handle linear regression and are restricted to the setting. (2) group-SLOPE can only deal with linear regression and can not achieve robust feature selection results in a high dimensional setting when . (3) group-SLOPE does not provide end-to-end group-wise feature selection and requires groups of features to be orthogonal to each other.
To resolve all the limitations, we propose Deep-gKnock (Deep group-feature selection using Knockoffs), which combines model-X Knockoffs and Deep neural networks (DNNs) to perform model-free group-feature selection in both linear and nonlinear contexts while controlling the group-wise FDR. DNNs are a natural choice to modeling complex nonlinear relationships and performing end-to-end deep representation learning (Kingma and Welling, 2013) for high dimensional data. However, DNNs are often treated as black-box due to its lack of interpretability and reproducibility. Based on Chen et al. (2018)’s work on individual level feature selection on DNNs, Deep-gKnock constructs group Knockoffs features to perform group-feature selection for DNNs.
Figure 1 provides an overview for our Deep-gKnock procedure, which includes (1) generate Group Knockoffs features; (2) incorporate original features and Group Knockoffs features into a DNN architecture to compute Knockoffs statistic; and (3) filtering out the unimportant group-feature using Knockoffs statistic. Experiment results demonstrate that our method achieves superior power and accurate FDR control compared with state-of-the-art methods.

To summarize, we make the following contributions: (1) end-to-end group-wise feature selection and deep representations for a setting; (2) flexible modeling framework in a DNN context with enhanced interpretability and reproducibility; (3) superior performance in terms of power and controlled group-wise false discovery rate for sythetic and real data analysis in both linear and nonlinear settings.
2 Background
2.1 Problem statement
In our problem, we have independent and identically distributed (i.i.d) observations , where , , . We use to denote the feature vector and
to denote the scalar response variable. Denote
. We assume there exists group structure within the features, which can be partitioned into groups with group sizes . The index of the features in the th group is denoted as , where . It satisfies that for , and . Assume that there exists a subset such that conditional on the groups of features in , the response is independent of groups of features in the complement . Denote as the set of all the selected groups of features. Our goal is to ensure high true positive rate (TPR) defined as while controlling the group-wise false discover rate (gFDR), which is the expected proportion of irrelevant groups among all groups of features selected and defined as2.2 Model-X Knockoffs framework review
The Knockoffs features are constructed as negative controls to help identify the truly important features by comparing the feature importance between the original and their Knockoffs counterpart. Model-X Knockoffs features are generated to perfectly mimic the arbitrary dependence structure among the original features but are conditionally independent of the response given the original features. However, model-X Knockoffs procedure (Candes et al., 2018) is only able to construct Knockoffs variables for individual feature selection. Our deep-gKnock procedure described in Section 3 extends model-X Knockoffs procedure to generate group Knockoffs features, which allows group structure among features.
For better understanding, we review the model-X Knockoffs method first. Model-X Knockoffs is designed for the individual feature selection and does not consider the grouping structure among features. So the are defined as the indices of individual features, which are different from definitions in Section 2.1 . Model-X Knockoffs method assume that there exists a subset such that conditional on the features in , the response is independent of features in the complement . We denote as the set of all the selected individual features.
We start this section with the model-X Knockoffs feature definition, followed by the Knockoffs feature generation process and end with the filtering process for feature selection.
Definition 1 (Candes et al. (2018)).
Suppose the family of random features . Model-X Knockoffs features for are a new family of random features that satisfies two properties: (1) for any subset , where means swapping and for each and denotes equal in distribution, and (2) , i.e., is independent of response given feature .
From this definition, we can see that model-X Knockoffs feature ’s mimic dependency structure among the original features ’s and are independent of the response given ’s. By comparing the original features with the Knockoffs features , FDR can be controlled at target level . When with the covariance matrix , we can construct the model-X Knockoffs features characterized in Definition 1 as
(1) |
Here with all components of being positive is a diagonal matrix with requirement that the conditional covariance matrix in Equation 1
is positive definite. Following the above Knockoffs construction, the joint distribution of the original features and the model-X Knockoffs features is
(2) |
To ensure high power in distinguishing and , it is desired that the constructed Knockoffs features deviate from the original features while maintaining the same correlation structure as . This indicates larger components of are preferred since . In a setting where the features are normalized, i.e. for all , we would like to have as close to zero as possible. One way to choose is the equicorrelated construction (Barber and Candès, 2016), which uses
Then we define the Knockoffs statistic for each feature , , which is used in the filtering process to perform feature selection. A large positive value of provides evidence that is important. This statistic depends on and , i.e. for some function . This function must satisfy the following flip-sign property:
(3) |
Candes et al. (2018) construct the Knockoffs statistic by performing Lasso on the original features augmented with Knockoffs
which provides Lasso coefficients . The statistic is set to be the Lasso coefficient difference given by
After obtaining the Knockoffs statistic satisfying (3), Theorem 2 from Candes et al. (2018) provides feature selection procedure with controlled FDR.
3 Deep group-feature selection using Knockoffs
3.1 Constructing Group Knockoffs features
The original Knockoffs construction (Candes et al., 2018) does not take group structure among different features into account and requires stronger constraints. When there exists high correlation between features and , Candes et al. (2018)’s method requires that the values of to be extremely small in order to ensure the covariance matrix in Equation (2) is positive semi-definite. However, smaller values of will fail to detect the difference between and , which will lead to a decrease in the power of detecting the true positive features. In a group-sparse setting, we relax this requirement by proposing our Group Knockoffs features in Definition 3 to increase the power.
Definition 3 (Group Knockoffs features).
Suppose the family of random features has group structure, where the features are partitioned into groups, , with group sizes , and . Group Knockoffs features for are a new family of random features that satisfies two properties: (1) for any subset , where means swapping and for each and denotes equal in distribution, and (2) , i.e., is independent of the response given feature .
We see from this definition, that the Group Knockoffs features ’s mimic the group-wise dependency structure among the original features ’s and are independent of the response given ’s. When , the joint distribution obeying Definition 3 is
(4) |
where is a group-block-diagonal matrix. Here we use to denote is positive definite.
We construct the Group Knockoffs features by sampling the Knockoffs vector from the conditional distribution
(5) |
Follow Dai and Barber (2016), the group-block-diagonal matrix satisfying can be constructed with
3.2 Deep neural networks for Group Knockoffs features
Once the Group Knockoffs features are constructed, following similar idea in DeepPINK (Lu et al., 2018), we feed them into a new DNN structure to obtain gKnock statistic. The structure of the network is shown in Figure 2.
In the first layer, we feed into a Group-feature Competing Layer containing filters, . The th filter connects group-feature and its Knockoffs counterpart
. We use a linear activation function in this layer to encourage the competition between group-feature and its Knockoffs counterpart. Intuitively, if the group-feature
is important, we expect the magnitude of to be much larger than , and if the the group-feature is not important, we expect the magnitude of and to be similar.We then feed the output of the Group-feature Competing Layer into a fully connected multilayer perceptron (MLP) to learn a non-linear mapping to the response
. We use to denote the weight vector connecting the Group-features Competing Layer to the MLP. The MLP has two hidden layers, each containing neurons, and ReLU activation and -regularization are used, as shown in Figure 2. We use to denote the weight matrix connecting the input vector to the first hidden layer. Similarly, we use as the weight matrix connecting two hidden layers and as the weight matrix connecting second hidden layer to the output .
3.3 gKnock statistic
After the DNN is trained, we compute gKnock statistic based on the weights to evaluate the importance of group-feature. Firstly, we use and to represent the relative importance between and . Secondly, we assess the relative importance of the th group-feature among all group-feature by , where denotes the Schur (entrywise) matrix product. Thirdly, the importance measures for and are provided by
Finally, we define the gKnock statistic as
and the same filtering process as Theorem 2 is applied to the ’s to select group-feature.
4 Simulation studies
We evaluate the performance of our method both in Gaussian linear regression model (6) and Single-Index model (7).
(6) |
(7) |
where is the th response, is the feature vector of the th observation, is the coefficient vector, is the noise of th observation, and is some unknown link function.
To generate the sythetic data, we set the number of feature and the number of groups with the number of features per group as . The true regression coefficient vector is group sparse with groups of nonzero signals, and the nonzero coefficients are randomly chosen from . We draw
independently from a multivariate normal distribution with mean
and covariance matrix , with diagonal entries , within-group correlations for in the same group, between-group correlations for , in the different groups. The error are i.i.d. from standard normal distribution. The true link function is .In our default setting, we set , . To study the effects of sample size, between-group correlation and within-group correlation, we vary one setting and keep the others remain at their default level in each experiment.
-
Sample size: we vary the number of observations from 500,750,1000,1250 to 1500.
-
Group correlation: we fix the within-group correlation , and set the between-group correlation to be , with .
-
Within-group correlation: we vary within-group correlation with and fix .
We compare the performance of Deep-gKnock with group-SLOPE available in the R package grpSLOPE (Gossmann et al., 2016). For each setting, we run each experiment for 100 replications and set the target gFDR level . The empirical gFDR and power are reported in Table 1 & 2.
In the linear model setting shown in Table 1, group-SLOPE fails to control the gFDR at the target level in each of the following three situations: (1) ; (2) between-group correlation is large; (3) within group correlation is large. In contrast, Deep-gKnock can precisely control the gFDR in all settings.
In the single-index model setting shown in Table 2, Deep-gKnock achieves higher power and consistently controls gFDR in all settings, which demonstrates the advantages of our Deep-gKnock by using DNN to model the non-linear relationship between features and the response.
Varying Sample size | Varying Between-group correlation | Varying Within-group correlation | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Deep-gKnock | group-SLOPE | Deep-gKnock | group-SLOPE | Deep-gKnock | group-SLOPE | |||||||||
gFDR | Power | gFDR | Power | gFDR | Power | gFDR | Power | gFDR | Power | gFDR | Power | |||
500 | 0.19 | 0.98 | 0.36 | 0.73 | 0.00 | 0.18 | 0.98 | 0.20 | 1.00 | 0.00 | 0.17 | 1.00 | 0.21 | 1.00 |
750 | 0.21 | 0.99 | 0.30 | 0.99 | 0.20 | 0.18 | 0.99 | 0.23 | 1.00 | 0.20 | 0.19 | 1.00 | 0.22 | 1.00 |
1000 | 0.20 | 0.99 | 0.21 | 1.00 | 0.40 | 0.20 | 0.99 | 0.26 | 1.00 | 0.40 | 0.14 | 1.00 | 0.24 | 1.00 |
1250 | 0.23 | 0.99 | 0.17 | 1.00 | 0.60 | 0.17 | 0.99 | 0.30 | 1.00 | 0.60 | 0.14 | 1.00 | 0.27 | 1.00 |
1500 | 0.21 | 0.99 | 0.15 | 1.00 | 0.80 | 0.18 | 0.99 | 0.40 | 1.00 | 0.80 | 0.11 | 0.95 | 0.30 | 1.00 |
Varying Sample size | Varying Between-group correlation | Varying Within-group correlation | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Deep-gKnock | group-SLOPE | Deep-gKnock | group-SLOPE | Deep-gKnock | group-SLOPE | |||||||||
gFDR | Power | gFDR | Power | gFDR | Power | gFDR | Power | gFDR | Power | gFDR | Power | |||
500 | 0.22 | 0.71 | 0.08 | 0.03 | 0.00 | 0.14 | 0.53 | 0.12 | 0.17 | 0.00 | 0.20 | 0.78 | 0.12 | 0.18 |
750 | 0.18 | 0.72 | 0.14 | 0.15 | 0.20 | 0.19 | 0.74 | 0.30 | 0.28 | 0.20 | 0.25 | 0.79 | 0.31 | 0.31 |
1000 | 0.18 | 0.72 | 0.12 | 0.21 | 0.40 | 0.20 | 0.82 | 0.46 | 0.35 | 0.40 | 0.17 | 0.83 | 0.42 | 0.34 |
1250 | 0.18 | 0.73 | 0.12 | 0.32 | 0.60 | 0.21 | 0.88 | 0.52 | 0.40 | 0.60 | 0.17 | 0.88 | 0.48 | 0.35 |
1500 | 0.19 | 0.75 | 0.14 | 0.45 | 0.80 | 0.19 | 0.86 | 0.57 | 0.43 | 0.80 | 0.17 | 0.94 | 0.53 | 0.34 |
5 Real data analysis
In addition to the simulation studies presented in Section 4, we also demonstrate the performance of Deep-gKnock on two real data sets. The gFDR level is set to .
5.1 Application to prostate cancer data
The prostate cancer data contains clinical measurements for 97 male patients who were about to receive a radical prostatectomy. It was analyzed in Hastie et al. (2013) to study the correlation between the response , the level of prostate-specific antigen (lpsa) and other eight features. The features are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45).
For the categorical variable svi with two levels, we coded it by one dummy variable and treated it as one group. For each continuous variable, we used five B-Spline basis functions to represent its effect and treated those five basis functions as a group. This provides us eight groups with a total of 36 features. We summarize the group-feature selection results in Table
3. The features selected by Deep-gKnock are the same as using Lasso in Hastie et al. (2013).Method | group-feature selected |
---|---|
group-SLOPE | lcavol, lweight, svi, gleason |
Deep-gKnock | lcavol, lweight |
5.2 Application to yeast cell cycle data
We apply Deep-gKonck to the task of identifying the important transcription factors (TFs), which are related to regulation of the cell cycle. TFs belong to a class of proteins called binding proteins, and control the rate at which DNA is transcribed into mRNA. We utilize a yeast cell cycle data set from Spellman et al. (1998) and Lee et al. (2002). The response is the messenger ribonucleic acid (mRNA) levels on genes, and are measured at 28 minutes during a cell cycle. The features is the measurements of binding information of TFs . Out of the 106 TFs, 21 TFs are known and experimentally confirmed cell cycle related TFs (Wang et al., 2007).
It has been studied that groups of TFs function in a coordinated fashion to direct cell division, growth and death (Latchman, 1997). Following Ma et al. (2007)
, we use the K-means method to cluster the 106 TFs, and determine the optimal number of clusters using the Gap statistic
(Tibshirani et al., 1999). The Gap statistic suggests the 106 TFs can be clustered into 20 groups. To visulize the clustering results, we use Principal Component Analysis (PCA) algorithm to reduce the dimensionality to its first two principal components, which results in a scatter plot of data points colored by their cluster labels in Figure
3. One of the clusters contains four TFs and all of them are experimentally verified.
Group-SLOPE identified 7 groups which contains 41 TFs. including 12 confirmed TFs. Deep-gKnock identified 5 groups which contains 26 TFs including 11 confirmed TFs. To demonstrate the selection performance, following Zhu and Su (2019)
, we also compute the probability of containing at least
confirmed TFs from arandomly chosen TFs from a hypergeometric distribution in Table
4. We included the results for the Lasso in Table 4 as a benchmark. Smaller probability values suggest better feature selection performance. The small probability of Deep-gKnock suggests that the large number of confirmed TFs selected is not due to chance. Deep-gKnock also outperforms group-SLOPE.Method | |||
---|---|---|---|
Lasso | |||
group-SLOPE | |||
Deep-gKnock |
6 Conclusion
We have introduced a novel group-feature selection method Deep-gKnock combining Knockoffs with DNNs. It provides an end-to-end group-wise feature selection with controlled gFDR for high dimensional data. With the flexibility of DNN, we also provide deep representations with enhanced interpretability and reproducibility. Both synthetic and real data analysis is provided to demonstrate that Deep-gKnock can achieve superior power and accurate gFDR control compared with state-of-the-art methods. Moreover, Deep-gKnock achieves scientifically meaningful group-feature selection results for real data sets.
References
- Barber and Candès (2016) Barber, R. F. and Candès, E. J. (2016). A knockoff filter for high-dimensional selective inference. arXiv preprint arXiv:1602.03574.
- Barber et al. (2015) Barber, R. F., Candès, E. J., et al. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055–2085.
- Bogdan et al. (2015) Bogdan, M., Van Den Berg, E., Sabatti, C., Su, W., and Candès, E. J. (2015). Slope—adaptive variable selection via convex optimization. The annals of applied statistics, 9(3), 1103.
- Brzyski et al. (2018) Brzyski, D., Gossmann, A., Su, W., and Bogdan, M. (2018). Group slope–adaptive selection of groups of predictors. Journal of the American Statistical Association, pages 1–15.
- Candes et al. (2018) Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3), 551–577.
- Chen et al. (2018) Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814.
-
Dai and Barber (2016)
Dai, R. and Barber, R. (2016).
The knockoff filter for fdr control in group-sparse and multitask
regression.
In
International Conference on Machine Learning
, pages 1851–1859. - Gossmann et al. (2016) Gossmann, A., Brzyski, D., Su, W., and Bogdan, M. (2016). grpSLOPE: Group Sorted L1 Penalized Estimation. R package version 0.2.1.
- Hastie et al. (2013) Hastie, T., Tibshirani, R., and Friedman, J. (2013). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York.
- Huang et al. (2012) Huang, J., Breheny, P., and Ma, S. (2012). A selective review of group selection in high-dimensional models. Statistical science: a review journal of the Institute of Mathematical Statistics, 27(4).
- Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Latchman (1997) Latchman, D. S. (1997). Transcription factors: an overview. The international journal of biochemistry & cell biology, 29(12), 1305–1312.
- Lee et al. (2002) Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et al. (2002). Transcriptional regulatory networks in saccharomyces cerevisiae. science, 298(5594), 799–804.
- Li et al. (2018) Li, Z., Xie, W., and Liu, T. (2018). Efficient feature selection and classification for microarray data. PloS one, 13(8), e0202167.
- Lu et al. (2018) Lu, Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pages 8690–8700.
- Ma et al. (2007) Ma, S., Song, X., and Huang, J. (2007). Supervised group lasso with applications to microarray data analysis. BMC bioinformatics, 8(1), 60.
- Meier et al. (2008) Meier, L., Van De Geer, S., and Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.
- Spellman et al. (1998) Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell, 9(12), 3273–3297.
- Su et al. (2016) Su, Z., Zhu, G., Chen, X., and Yang, Y. (2016). Sparse envelope model: efficient estimation and response variable selection in multivariate linear regression. Biometrika, 103(3), 579–593.
- Tang and Liu (2014) Tang, J. and Liu, H. (2014). Feature selection for social media data. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(4), 19.
- Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
- Tibshirani et al. (1999) Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P., et al. (1999). Clustering methods for the analysis of dna microarray data. Dept. Statist., Stanford Univ., Stanford, CA, Tech. Rep.
-
Wang et al. (2007)
Wang, L., Chen, G., and Li, H. (2007).
Group scad regression analysis for microarray time course gene expression data.
Bioinformatics, 23(12), 1486–1494. - Yang and Zou (2015) Yang, Y. and Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 25(6), 1129–1141.
- Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
- Zhu and Su (2019) Zhu, G. and Su, Z. (2019). Envelope-based sparse partial least squares. The Annals of Statistics, (in press).
Comments
There are no comments yet.