1 Introduction
Feature selection for highdimensional data is of fundamental importance for different applications across various scientific disciplines (Tang and Liu, 2014; Li et al., 2018). Grouping structure among features arises naturally in many statistical modeling problems. Common examples range from multilevel categorical features in a regression model to genetic markers from the same gene in genetic association studies. Incorporating the grouping structure information into the feature selection can take advantage of the scientifically meaningful prior knowledge, increase the feature selection accuracy and improve the interpretability of the feature selection results (Huang et al., 2012).
In this paper, we focus on groupfeature selection as an approach for model interpretation and dimension reduction in both linear and nonlinear contexts. Our method can achieve stable feature selection results in a high dimensional setting when , which is usually a challenging problem for existing methods, where is the number of features and is the number of samples.
Groupfeature selection has been studied from different perspectives. The groupLasso, a generalization of the Lasso (Tibshirani, 1996), has been proposed as a mainstream approach to conduct groupwise feature selection (Yuan and Lin, 2006). To relax the linear constraint, Meier et al. (2008)
extended the groupLasso from linear regression to logistic regression. To speed up the computation for groupLasso,
Yang and Zou (2015) have further developed a more computationally tractable and efficient algorithm.However, researchers have found that the feature selection results by Lasso and groupLasso are sensitive to the choices of tuning parameters (Tibshirani, 1996; Su et al., 2016). In practice, the tuning parameter is often chosen by crossvalidation (CV). But it has been reported that in the highdimensional settings the widely adopted CV typically tends to select a large number of false features (Bogdan et al., 2015). In order to ensure the selected features are correct and replicable, several methods have been proposed to preform feature selection while controlling the false discovery rate (FDR)—the expected fraction of false selections among all selections.
Among them, Sorted LOne Penalized Estimation (SLOPE)
(Bogdan et al., 2015) and Knockoffs (Barber et al., 2015; Candes et al., 2018) are the stateoftheart methods and have received the most attention. SLOPE was proposed to control the FDR in the classical multiple linear regression setting. SLOPE is defined to be the solution to a penalized objective function:where , with , , and
is the vector of sorted absolute values of coordinates of
. Brzyski et al. (2018) extended SLOPE method as groupSLOPE to perform groupfeature selection but it is limited to linear regression.The notion of Knockoffs was first introduced in Barber et al. (2015) and improved as modelX Knockoffs by Candes et al. (2018). The Knockoffs variables serve as negative controls and help identify the truly important features by comparing the feature importance between original and their Knockoffs counterpart. Originally, it is constrained to homoscedastic linear models with (Barber et al., 2015) and later extended to a groupsparse linear regression setting by Dai and Barber (2016).
In the stateoftheart directions of SLOPE and Knockoffs, GroupSLOPE (Brzyski et al., 2018) and groupKnockoffs (Dai and Barber, 2016) are the only solution for groupfeature selection. However, they suffer from the following limitations. (1) groupKnockoffs can only handle linear regression and are restricted to the setting. (2) groupSLOPE can only deal with linear regression and can not achieve robust feature selection results in a high dimensional setting when . (3) groupSLOPE does not provide endtoend groupwise feature selection and requires groups of features to be orthogonal to each other.
To resolve all the limitations, we propose DeepgKnock (Deep groupfeature selection using Knockoffs), which combines modelX Knockoffs and Deep neural networks (DNNs) to perform modelfree groupfeature selection in both linear and nonlinear contexts while controlling the groupwise FDR. DNNs are a natural choice to modeling complex nonlinear relationships and performing endtoend deep representation learning (Kingma and Welling, 2013) for high dimensional data. However, DNNs are often treated as blackbox due to its lack of interpretability and reproducibility. Based on Chen et al. (2018)’s work on individual level feature selection on DNNs, DeepgKnock constructs group Knockoffs features to perform groupfeature selection for DNNs.
Figure 1 provides an overview for our DeepgKnock procedure, which includes (1) generate Group Knockoffs features; (2) incorporate original features and Group Knockoffs features into a DNN architecture to compute Knockoffs statistic; and (3) filtering out the unimportant groupfeature using Knockoffs statistic. Experiment results demonstrate that our method achieves superior power and accurate FDR control compared with stateoftheart methods.
To summarize, we make the following contributions: (1) endtoend groupwise feature selection and deep representations for a setting; (2) flexible modeling framework in a DNN context with enhanced interpretability and reproducibility; (3) superior performance in terms of power and controlled groupwise false discovery rate for sythetic and real data analysis in both linear and nonlinear settings.
2 Background
2.1 Problem statement
In our problem, we have independent and identically distributed (i.i.d) observations , where , , . We use to denote the feature vector and
to denote the scalar response variable. Denote
. We assume there exists group structure within the features, which can be partitioned into groups with group sizes . The index of the features in the th group is denoted as , where . It satisfies that for , and . Assume that there exists a subset such that conditional on the groups of features in , the response is independent of groups of features in the complement . Denote as the set of all the selected groups of features. Our goal is to ensure high true positive rate (TPR) defined as while controlling the groupwise false discover rate (gFDR), which is the expected proportion of irrelevant groups among all groups of features selected and defined as2.2 ModelX Knockoffs framework review
The Knockoffs features are constructed as negative controls to help identify the truly important features by comparing the feature importance between the original and their Knockoffs counterpart. ModelX Knockoffs features are generated to perfectly mimic the arbitrary dependence structure among the original features but are conditionally independent of the response given the original features. However, modelX Knockoffs procedure (Candes et al., 2018) is only able to construct Knockoffs variables for individual feature selection. Our deepgKnock procedure described in Section 3 extends modelX Knockoffs procedure to generate group Knockoffs features, which allows group structure among features.
For better understanding, we review the modelX Knockoffs method first. ModelX Knockoffs is designed for the individual feature selection and does not consider the grouping structure among features. So the are defined as the indices of individual features, which are different from definitions in Section 2.1 . ModelX Knockoffs method assume that there exists a subset such that conditional on the features in , the response is independent of features in the complement . We denote as the set of all the selected individual features.
We start this section with the modelX Knockoffs feature definition, followed by the Knockoffs feature generation process and end with the filtering process for feature selection.
Definition 1 (Candes et al. (2018)).
Suppose the family of random features . ModelX Knockoffs features for are a new family of random features that satisfies two properties: (1) for any subset , where means swapping and for each and denotes equal in distribution, and (2) , i.e., is independent of response given feature .
From this definition, we can see that modelX Knockoffs feature ’s mimic dependency structure among the original features ’s and are independent of the response given ’s. By comparing the original features with the Knockoffs features , FDR can be controlled at target level . When with the covariance matrix , we can construct the modelX Knockoffs features characterized in Definition 1 as
(1) 
Here with all components of being positive is a diagonal matrix with requirement that the conditional covariance matrix in Equation 1
is positive definite. Following the above Knockoffs construction, the joint distribution of the original features and the modelX Knockoffs features is
(2) 
To ensure high power in distinguishing and , it is desired that the constructed Knockoffs features deviate from the original features while maintaining the same correlation structure as . This indicates larger components of are preferred since . In a setting where the features are normalized, i.e. for all , we would like to have as close to zero as possible. One way to choose is the equicorrelated construction (Barber and Candès, 2016), which uses
Then we define the Knockoffs statistic for each feature , , which is used in the filtering process to perform feature selection. A large positive value of provides evidence that is important. This statistic depends on and , i.e. for some function . This function must satisfy the following flipsign property:
(3) 
Candes et al. (2018) construct the Knockoffs statistic by performing Lasso on the original features augmented with Knockoffs
which provides Lasso coefficients . The statistic is set to be the Lasso coefficient difference given by
After obtaining the Knockoffs statistic satisfying (3), Theorem 2 from Candes et al. (2018) provides feature selection procedure with controlled FDR.
3 Deep groupfeature selection using Knockoffs
3.1 Constructing Group Knockoffs features
The original Knockoffs construction (Candes et al., 2018) does not take group structure among different features into account and requires stronger constraints. When there exists high correlation between features and , Candes et al. (2018)’s method requires that the values of to be extremely small in order to ensure the covariance matrix in Equation (2) is positive semidefinite. However, smaller values of will fail to detect the difference between and , which will lead to a decrease in the power of detecting the true positive features. In a groupsparse setting, we relax this requirement by proposing our Group Knockoffs features in Definition 3 to increase the power.
Definition 3 (Group Knockoffs features).
Suppose the family of random features has group structure, where the features are partitioned into groups, , with group sizes , and . Group Knockoffs features for are a new family of random features that satisfies two properties: (1) for any subset , where means swapping and for each and denotes equal in distribution, and (2) , i.e., is independent of the response given feature .
We see from this definition, that the Group Knockoffs features ’s mimic the groupwise dependency structure among the original features ’s and are independent of the response given ’s. When , the joint distribution obeying Definition 3 is
(4) 
where is a groupblockdiagonal matrix. Here we use to denote is positive definite.
We construct the Group Knockoffs features by sampling the Knockoffs vector from the conditional distribution
(5) 
Follow Dai and Barber (2016), the groupblockdiagonal matrix satisfying can be constructed with
3.2 Deep neural networks for Group Knockoffs features
Once the Group Knockoffs features are constructed, following similar idea in DeepPINK (Lu et al., 2018), we feed them into a new DNN structure to obtain gKnock statistic. The structure of the network is shown in Figure 2.
In the first layer, we feed into a Groupfeature Competing Layer containing filters, . The th filter connects groupfeature and its Knockoffs counterpart
. We use a linear activation function in this layer to encourage the competition between groupfeature and its Knockoffs counterpart. Intuitively, if the groupfeature
is important, we expect the magnitude of to be much larger than , and if the the groupfeature is not important, we expect the magnitude of and to be similar.We then feed the output of the Groupfeature Competing Layer into a fully connected multilayer perceptron (MLP) to learn a nonlinear mapping to the response
. We use to denote the weight vector connecting the Groupfeatures Competing Layer to the MLP. The MLP has two hidden layers, each containing neurons, and ReLU activation and regularization are used, as shown in Figure 2. We use to denote the weight matrix connecting the input vector to the first hidden layer. Similarly, we use as the weight matrix connecting two hidden layers and as the weight matrix connecting second hidden layer to the output .3.3 gKnock statistic
After the DNN is trained, we compute gKnock statistic based on the weights to evaluate the importance of groupfeature. Firstly, we use and to represent the relative importance between and . Secondly, we assess the relative importance of the th groupfeature among all groupfeature by , where denotes the Schur (entrywise) matrix product. Thirdly, the importance measures for and are provided by
Finally, we define the gKnock statistic as
and the same filtering process as Theorem 2 is applied to the ’s to select groupfeature.
4 Simulation studies
We evaluate the performance of our method both in Gaussian linear regression model (6) and SingleIndex model (7).
(6) 
(7) 
where is the th response, is the feature vector of the th observation, is the coefficient vector, is the noise of th observation, and is some unknown link function.
To generate the sythetic data, we set the number of feature and the number of groups with the number of features per group as . The true regression coefficient vector is group sparse with groups of nonzero signals, and the nonzero coefficients are randomly chosen from . We draw
independently from a multivariate normal distribution with mean
and covariance matrix , with diagonal entries , withingroup correlations for in the same group, betweengroup correlations for , in the different groups. The error are i.i.d. from standard normal distribution. The true link function is .In our default setting, we set , . To study the effects of sample size, betweengroup correlation and withingroup correlation, we vary one setting and keep the others remain at their default level in each experiment.

Sample size: we vary the number of observations from 500,750,1000,1250 to 1500.

Group correlation: we fix the withingroup correlation , and set the betweengroup correlation to be , with .

Withingroup correlation: we vary withingroup correlation with and fix .
We compare the performance of DeepgKnock with groupSLOPE available in the R package grpSLOPE (Gossmann et al., 2016). For each setting, we run each experiment for 100 replications and set the target gFDR level . The empirical gFDR and power are reported in Table 1 & 2.
In the linear model setting shown in Table 1, groupSLOPE fails to control the gFDR at the target level in each of the following three situations: (1) ; (2) betweengroup correlation is large; (3) within group correlation is large. In contrast, DeepgKnock can precisely control the gFDR in all settings.
In the singleindex model setting shown in Table 2, DeepgKnock achieves higher power and consistently controls gFDR in all settings, which demonstrates the advantages of our DeepgKnock by using DNN to model the nonlinear relationship between features and the response.
Varying Sample size  Varying Betweengroup correlation  Varying Withingroup correlation  

DeepgKnock  groupSLOPE  DeepgKnock  groupSLOPE  DeepgKnock  groupSLOPE  
gFDR  Power  gFDR  Power  gFDR  Power  gFDR  Power  gFDR  Power  gFDR  Power  
500  0.19  0.98  0.36  0.73  0.00  0.18  0.98  0.20  1.00  0.00  0.17  1.00  0.21  1.00 
750  0.21  0.99  0.30  0.99  0.20  0.18  0.99  0.23  1.00  0.20  0.19  1.00  0.22  1.00 
1000  0.20  0.99  0.21  1.00  0.40  0.20  0.99  0.26  1.00  0.40  0.14  1.00  0.24  1.00 
1250  0.23  0.99  0.17  1.00  0.60  0.17  0.99  0.30  1.00  0.60  0.14  1.00  0.27  1.00 
1500  0.21  0.99  0.15  1.00  0.80  0.18  0.99  0.40  1.00  0.80  0.11  0.95  0.30  1.00 
Varying Sample size  Varying Betweengroup correlation  Varying Withingroup correlation  

DeepgKnock  groupSLOPE  DeepgKnock  groupSLOPE  DeepgKnock  groupSLOPE  
gFDR  Power  gFDR  Power  gFDR  Power  gFDR  Power  gFDR  Power  gFDR  Power  
500  0.22  0.71  0.08  0.03  0.00  0.14  0.53  0.12  0.17  0.00  0.20  0.78  0.12  0.18 
750  0.18  0.72  0.14  0.15  0.20  0.19  0.74  0.30  0.28  0.20  0.25  0.79  0.31  0.31 
1000  0.18  0.72  0.12  0.21  0.40  0.20  0.82  0.46  0.35  0.40  0.17  0.83  0.42  0.34 
1250  0.18  0.73  0.12  0.32  0.60  0.21  0.88  0.52  0.40  0.60  0.17  0.88  0.48  0.35 
1500  0.19  0.75  0.14  0.45  0.80  0.19  0.86  0.57  0.43  0.80  0.17  0.94  0.53  0.34 
5 Real data analysis
In addition to the simulation studies presented in Section 4, we also demonstrate the performance of DeepgKnock on two real data sets. The gFDR level is set to .
5.1 Application to prostate cancer data
The prostate cancer data contains clinical measurements for 97 male patients who were about to receive a radical prostatectomy. It was analyzed in Hastie et al. (2013) to study the correlation between the response , the level of prostatespecific antigen (lpsa) and other eight features. The features are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45).
For the categorical variable svi with two levels, we coded it by one dummy variable and treated it as one group. For each continuous variable, we used five BSpline basis functions to represent its effect and treated those five basis functions as a group. This provides us eight groups with a total of 36 features. We summarize the groupfeature selection results in Table
3. The features selected by DeepgKnock are the same as using Lasso in Hastie et al. (2013).Method  groupfeature selected 

groupSLOPE  lcavol, lweight, svi, gleason 
DeepgKnock  lcavol, lweight 
5.2 Application to yeast cell cycle data
We apply DeepgKonck to the task of identifying the important transcription factors (TFs), which are related to regulation of the cell cycle. TFs belong to a class of proteins called binding proteins, and control the rate at which DNA is transcribed into mRNA. We utilize a yeast cell cycle data set from Spellman et al. (1998) and Lee et al. (2002). The response is the messenger ribonucleic acid (mRNA) levels on genes, and are measured at 28 minutes during a cell cycle. The features is the measurements of binding information of TFs . Out of the 106 TFs, 21 TFs are known and experimentally confirmed cell cycle related TFs (Wang et al., 2007).
It has been studied that groups of TFs function in a coordinated fashion to direct cell division, growth and death (Latchman, 1997). Following Ma et al. (2007)
, we use the Kmeans method to cluster the 106 TFs, and determine the optimal number of clusters using the Gap statistic
(Tibshirani et al., 1999). The Gap statistic suggests the 106 TFs can be clustered into 20 groups. To visulize the clustering results, we use Principal Component Analysis (PCA) algorithm to reduce the dimensionality to its first two principal components, which results in a scatter plot of data points colored by their cluster labels in Figure
3. One of the clusters contains four TFs and all of them are experimentally verified.GroupSLOPE identified 7 groups which contains 41 TFs. including 12 confirmed TFs. DeepgKnock identified 5 groups which contains 26 TFs including 11 confirmed TFs. To demonstrate the selection performance, following Zhu and Su (2019)
, we also compute the probability of containing at least
confirmed TFs from arandomly chosen TFs from a hypergeometric distribution in Table
4. We included the results for the Lasso in Table 4 as a benchmark. Smaller probability values suggest better feature selection performance. The small probability of DeepgKnock suggests that the large number of confirmed TFs selected is not due to chance. DeepgKnock also outperforms groupSLOPE.Method  

Lasso  
groupSLOPE  
DeepgKnock 
6 Conclusion
We have introduced a novel groupfeature selection method DeepgKnock combining Knockoffs with DNNs. It provides an endtoend groupwise feature selection with controlled gFDR for high dimensional data. With the flexibility of DNN, we also provide deep representations with enhanced interpretability and reproducibility. Both synthetic and real data analysis is provided to demonstrate that DeepgKnock can achieve superior power and accurate gFDR control compared with stateoftheart methods. Moreover, DeepgKnock achieves scientifically meaningful groupfeature selection results for real data sets.
References
 Barber and Candès (2016) Barber, R. F. and Candès, E. J. (2016). A knockoff filter for highdimensional selective inference. arXiv preprint arXiv:1602.03574.
 Barber et al. (2015) Barber, R. F., Candès, E. J., et al. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055–2085.
 Bogdan et al. (2015) Bogdan, M., Van Den Berg, E., Sabatti, C., Su, W., and Candès, E. J. (2015). Slope—adaptive variable selection via convex optimization. The annals of applied statistics, 9(3), 1103.
 Brzyski et al. (2018) Brzyski, D., Gossmann, A., Su, W., and Bogdan, M. (2018). Group slope–adaptive selection of groups of predictors. Journal of the American Statistical Association, pages 1–15.
 Candes et al. (2018) Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘modelx’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3), 551–577.
 Chen et al. (2018) Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An informationtheoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814.

Dai and Barber (2016)
Dai, R. and Barber, R. (2016).
The knockoff filter for fdr control in groupsparse and multitask
regression.
In
International Conference on Machine Learning
, pages 1851–1859.  Gossmann et al. (2016) Gossmann, A., Brzyski, D., Su, W., and Bogdan, M. (2016). grpSLOPE: Group Sorted L1 Penalized Estimation. R package version 0.2.1.
 Hastie et al. (2013) Hastie, T., Tibshirani, R., and Friedman, J. (2013). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York.
 Huang et al. (2012) Huang, J., Breheny, P., and Ma, S. (2012). A selective review of group selection in highdimensional models. Statistical science: a review journal of the Institute of Mathematical Statistics, 27(4).
 Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 Latchman (1997) Latchman, D. S. (1997). Transcription factors: an overview. The international journal of biochemistry & cell biology, 29(12), 1305–1312.
 Lee et al. (2002) Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., BarJoseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et al. (2002). Transcriptional regulatory networks in saccharomyces cerevisiae. science, 298(5594), 799–804.
 Li et al. (2018) Li, Z., Xie, W., and Liu, T. (2018). Efficient feature selection and classification for microarray data. PloS one, 13(8), e0202167.
 Lu et al. (2018) Lu, Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pages 8690–8700.
 Ma et al. (2007) Ma, S., Song, X., and Huang, J. (2007). Supervised group lasso with applications to microarray data analysis. BMC bioinformatics, 8(1), 60.
 Meier et al. (2008) Meier, L., Van De Geer, S., and Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.
 Spellman et al. (1998) Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell, 9(12), 3273–3297.
 Su et al. (2016) Su, Z., Zhu, G., Chen, X., and Yang, Y. (2016). Sparse envelope model: efficient estimation and response variable selection in multivariate linear regression. Biometrika, 103(3), 579–593.
 Tang and Liu (2014) Tang, J. and Liu, H. (2014). Feature selection for social media data. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(4), 19.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
 Tibshirani et al. (1999) Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P., et al. (1999). Clustering methods for the analysis of dna microarray data. Dept. Statist., Stanford Univ., Stanford, CA, Tech. Rep.

Wang et al. (2007)
Wang, L., Chen, G., and Li, H. (2007).
Group scad regression analysis for microarray time course gene expression data.
Bioinformatics, 23(12), 1486–1494.  Yang and Zou (2015) Yang, Y. and Zou, H. (2015). A fast unified algorithm for solving grouplasso penalize learning problems. Statistics and Computing, 25(6), 1129–1141.
 Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
 Zhu and Su (2019) Zhu, G. and Su, Z. (2019). Envelopebased sparse partial least squares. The Annals of Statistics, (in press).
Comments
There are no comments yet.