Deep-gKnock: nonlinear group-feature selection with deep neural network

05/24/2019
by   Guangyu Zhu, et al.
0

Feature selection is central to contemporary high-dimensional data analysis. Grouping structure among features arises naturally in various scientific problems. Many methods have been proposed to incorporate the grouping structure information into feature selection. However, these methods are normally restricted to a linear regression setting. To relax the linear constraint, we combine the deep neural networks (DNNs) with the recent Knockoffs technique, which has been successful in an individual feature selection context. We propose Deep-gKnock (Deep group-feature selection using Knockoffs) as a methodology for model interpretation and dimension reduction. Deep-gKnock performs model-free group-feature selection by controlling group-wise False Discovery Rate (gFDR). Our method improves the interpretability and reproducibility of DNNs. Experimental results on both synthetic and real data demonstrate that our method achieves superior power and accurate gFDR control compared with state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/04/2018

DeepPINK: reproducible feature selection in deep neural networks

Deep learning has become increasingly popular in both supervised and uns...
03/17/2019

Deep Feature Selection using a Teacher-Student Network

High-dimensional data in many machine learning applications leads to com...
12/21/2018

Feature-Wise Bias Amplification

We study the phenomenon of bias amplification in classifiers, wherein a ...
10/30/2017

Contextual Regression: An Accurate and Conveniently Interpretable Nonlinear Model for Mining Discovery from Scientific Data

Machine learning algorithms such as linear regression, SVM and neural ne...
08/28/2021

Feature Selection in High-dimensional Space Using Graph-Based Methods

High-dimensional feature selection is a central problem in a variety of ...
05/29/2020

Unsupervised Feature Selection via Multi-step Markov Transition Probability

Feature selection is a widely used dimension reduction technique to sele...
06/03/2021

Normalizing Flows for Knockoff-free Controlled Feature Selection

The goal of controlled feature selection is to discover the features a r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Feature selection for high-dimensional data is of fundamental importance for different applications across various scientific disciplines (Tang and Liu, 2014; Li et al., 2018). Grouping structure among features arises naturally in many statistical modeling problems. Common examples range from multilevel categorical features in a regression model to genetic markers from the same gene in genetic association studies. Incorporating the grouping structure information into the feature selection can take advantage of the scientifically meaningful prior knowledge, increase the feature selection accuracy and improve the interpretability of the feature selection results (Huang et al., 2012).

In this paper, we focus on group-feature selection as an approach for model interpretation and dimension reduction in both linear and nonlinear contexts. Our method can achieve stable feature selection results in a high dimensional setting when , which is usually a challenging problem for existing methods, where is the number of features and is the number of samples.

Group-feature selection has been studied from different perspectives. The group-Lasso, a generalization of the Lasso (Tibshirani, 1996), has been proposed as a mainstream approach to conduct group-wise feature selection (Yuan and Lin, 2006). To relax the linear constraint, Meier et al. (2008)

extended the group-Lasso from linear regression to logistic regression. To speed up the computation for group-Lasso,

Yang and Zou (2015) have further developed a more computationally tractable and efficient algorithm.

However, researchers have found that the feature selection results by Lasso and group-Lasso are sensitive to the choices of tuning parameters (Tibshirani, 1996; Su et al., 2016). In practice, the tuning parameter is often chosen by cross-validation (CV). But it has been reported that in the high-dimensional settings the widely adopted CV typically tends to select a large number of false features (Bogdan et al., 2015). In order to ensure the selected features are correct and replicable, several methods have been proposed to preform feature selection while controlling the false discovery rate (FDR)—the expected fraction of false selections among all selections.

Among them, Sorted L-One Penalized Estimation (SLOPE)

(Bogdan et al., 2015) and Knockoffs (Barber et al., 2015; Candes et al., 2018) are the state-of-the-art methods and have received the most attention. SLOPE was proposed to control the FDR in the classical multiple linear regression setting. SLOPE is defined to be the solution to a penalized objective function:

where , with , , and

is the vector of sorted absolute values of coordinates of

. Brzyski et al. (2018) extended SLOPE method as group-SLOPE to perform group-feature selection but it is limited to linear regression.

The notion of Knockoffs was first introduced in Barber et al. (2015) and improved as model-X Knockoffs by Candes et al. (2018). The Knockoffs variables serve as negative controls and help identify the truly important features by comparing the feature importance between original and their Knockoffs counterpart. Originally, it is constrained to homoscedastic linear models with (Barber et al., 2015) and later extended to a group-sparse linear regression setting by Dai and Barber (2016).

In the state-of-the-art directions of SLOPE and Knockoffs, Group-SLOPE (Brzyski et al., 2018) and group-Knockoffs (Dai and Barber, 2016) are the only solution for group-feature selection. However, they suffer from the following limitations. (1) group-Knockoffs can only handle linear regression and are restricted to the setting. (2) group-SLOPE can only deal with linear regression and can not achieve robust feature selection results in a high dimensional setting when . (3) group-SLOPE does not provide end-to-end group-wise feature selection and requires groups of features to be orthogonal to each other.

To resolve all the limitations, we propose Deep-gKnock (Deep group-feature selection using Knockoffs), which combines model-X Knockoffs and Deep neural networks (DNNs) to perform model-free group-feature selection in both linear and nonlinear contexts while controlling the group-wise FDR. DNNs are a natural choice to modeling complex nonlinear relationships and performing end-to-end deep representation learning (Kingma and Welling, 2013) for high dimensional data. However, DNNs are often treated as black-box due to its lack of interpretability and reproducibility. Based on Chen et al. (2018)’s work on individual level feature selection on DNNs, Deep-gKnock constructs group Knockoffs features to perform group-feature selection for DNNs.

Figure 1 provides an overview for our Deep-gKnock procedure, which includes (1) generate Group Knockoffs features; (2) incorporate original features and Group Knockoffs features into a DNN architecture to compute Knockoffs statistic; and (3) filtering out the unimportant group-feature using Knockoffs statistic. Experiment results demonstrate that our method achieves superior power and accurate FDR control compared with state-of-the-art methods.

Figure 1: A graphical illustration of three steps of Deep-gKnock. This figure is best viewed in color.

To summarize, we make the following contributions: (1) end-to-end group-wise feature selection and deep representations for a setting; (2) flexible modeling framework in a DNN context with enhanced interpretability and reproducibility; (3) superior performance in terms of power and controlled group-wise false discovery rate for sythetic and real data analysis in both linear and nonlinear settings.

2 Background

2.1 Problem statement

In our problem, we have independent and identically distributed (i.i.d) observations , where , , . We use to denote the feature vector and

to denote the scalar response variable. Denote

. We assume there exists group structure within the features, which can be partitioned into groups with group sizes . The index of the features in the th group is denoted as , where . It satisfies that for , and . Assume that there exists a subset such that conditional on the groups of features in , the response is independent of groups of features in the complement . Denote as the set of all the selected groups of features. Our goal is to ensure high true positive rate (TPR) defined as while controlling the group-wise false discover rate (gFDR), which is the expected proportion of irrelevant groups among all groups of features selected and defined as

2.2 Model-X Knockoffs framework review

The Knockoffs features are constructed as negative controls to help identify the truly important features by comparing the feature importance between the original and their Knockoffs counterpart. Model-X Knockoffs features are generated to perfectly mimic the arbitrary dependence structure among the original features but are conditionally independent of the response given the original features. However, model-X Knockoffs procedure (Candes et al., 2018) is only able to construct Knockoffs variables for individual feature selection. Our deep-gKnock procedure described in Section 3 extends model-X Knockoffs procedure to generate group Knockoffs features, which allows group structure among features.

For better understanding, we review the model-X Knockoffs method first. Model-X Knockoffs is designed for the individual feature selection and does not consider the grouping structure among features. So the are defined as the indices of individual features, which are different from definitions in Section 2.1 . Model-X Knockoffs method assume that there exists a subset such that conditional on the features in , the response is independent of features in the complement . We denote as the set of all the selected individual features.

We start this section with the model-X Knockoffs feature definition, followed by the Knockoffs feature generation process and end with the filtering process for feature selection.

Definition 1 (Candes et al. (2018)).

Suppose the family of random features . Model-X Knockoffs features for are a new family of random features that satisfies two properties: (1) for any subset , where means swapping and for each and denotes equal in distribution, and (2) , i.e., is independent of response given feature .

From this definition, we can see that model-X Knockoffs feature ’s mimic dependency structure among the original features ’s and are independent of the response given ’s. By comparing the original features with the Knockoffs features , FDR can be controlled at target level . When with the covariance matrix , we can construct the model-X Knockoffs features characterized in Definition 1 as

(1)

Here with all components of being positive is a diagonal matrix with requirement that the conditional covariance matrix in Equation 1

is positive definite. Following the above Knockoffs construction, the joint distribution of the original features and the model-X Knockoffs features is

(2)

To ensure high power in distinguishing and , it is desired that the constructed Knockoffs features deviate from the original features while maintaining the same correlation structure as . This indicates larger components of are preferred since . In a setting where the features are normalized, i.e. for all , we would like to have as close to zero as possible. One way to choose is the equicorrelated construction (Barber and Candès, 2016), which uses

Then we define the Knockoffs statistic for each feature , , which is used in the filtering process to perform feature selection. A large positive value of provides evidence that is important. This statistic depends on and , i.e. for some function . This function must satisfy the following flip-sign property:

(3)

Candes et al. (2018) construct the Knockoffs statistic by performing Lasso on the original features augmented with Knockoffs

which provides Lasso coefficients . The statistic is set to be the Lasso coefficient difference given by

After obtaining the Knockoffs statistic satisfying (3), Theorem 2 from Candes et al. (2018) provides feature selection procedure with controlled FDR.

Theorem 2 (Candes et al. (2018)).

Let . Given statistic, satifying (3), let

Then the procedure selecting the features , controls the FDR at level .

3 Deep group-feature selection using Knockoffs

3.1 Constructing Group Knockoffs features

The original Knockoffs construction (Candes et al., 2018) does not take group structure among different features into account and requires stronger constraints. When there exists high correlation between features and , Candes et al. (2018)’s method requires that the values of to be extremely small in order to ensure the covariance matrix in Equation (2) is positive semi-definite. However, smaller values of will fail to detect the difference between and , which will lead to a decrease in the power of detecting the true positive features. In a group-sparse setting, we relax this requirement by proposing our Group Knockoffs features in Definition 3 to increase the power.

Definition 3 (Group Knockoffs features).

Suppose the family of random features has group structure, where the features are partitioned into groups, , with group sizes , and . Group Knockoffs features for are a new family of random features that satisfies two properties: (1) for any subset , where means swapping and for each and denotes equal in distribution, and (2) , i.e., is independent of the response given feature .

We see from this definition, that the Group Knockoffs features ’s mimic the group-wise dependency structure among the original features ’s and are independent of the response given ’s. When , the joint distribution obeying Definition 3 is

(4)

where is a group-block-diagonal matrix. Here we use to denote is positive definite.

We construct the Group Knockoffs features by sampling the Knockoffs vector from the conditional distribution

(5)

Follow Dai and Barber (2016), the group-block-diagonal matrix satisfying can be constructed with

3.2 Deep neural networks for Group Knockoffs features

Once the Group Knockoffs features are constructed, following similar idea in DeepPINK (Lu et al., 2018), we feed them into a new DNN structure to obtain gKnock statistic. The structure of the network is shown in Figure 2.

In the first layer, we feed into a Group-feature Competing Layer containing filters, . The th filter connects group-feature and its Knockoffs counterpart

. We use a linear activation function in this layer to encourage the competition between group-feature and its Knockoffs counterpart. Intuitively, if the group-feature

is important, we expect the magnitude of to be much larger than , and if the the group-feature is not important, we expect the magnitude of and to be similar.

We then feed the output of the Group-feature Competing Layer into a fully connected multilayer perceptron (MLP) to learn a non-linear mapping to the response

. We use to denote the weight vector connecting the Group-features Competing Layer to the MLP. The MLP has two hidden layers, each containing neurons, and ReLU activation and -regularization are used, as shown in Figure 2. We use to denote the weight matrix connecting the input vector to the first hidden layer. Similarly, we use as the weight matrix connecting two hidden layers and as the weight matrix connecting second hidden layer to the output .

Figure 2: A graphical demonstration of the DNN structure for Deep-gKnock. This figure is best viewed in color.

3.3 gKnock statistic

After the DNN is trained, we compute gKnock statistic based on the weights to evaluate the importance of group-feature. Firstly, we use and to represent the relative importance between and . Secondly, we assess the relative importance of the th group-feature among all group-feature by , where denotes the Schur (entrywise) matrix product. Thirdly, the importance measures for and are provided by

Finally, we define the gKnock statistic as

and the same filtering process as Theorem 2 is applied to the ’s to select group-feature.

4 Simulation studies

We evaluate the performance of our method both in Gaussian linear regression model (6) and Single-Index model (7).

(6)
(7)

where is the th response, is the feature vector of the th observation, is the coefficient vector, is the noise of th observation, and is some unknown link function.

To generate the sythetic data, we set the number of feature and the number of groups with the number of features per group as . The true regression coefficient vector is group sparse with groups of nonzero signals, and the nonzero coefficients are randomly chosen from . We draw

independently from a multivariate normal distribution with mean

and covariance matrix , with diagonal entries , within-group correlations for in the same group, between-group correlations for , in the different groups. The error are i.i.d. from standard normal distribution. The true link function is .

In our default setting, we set , . To study the effects of sample size, between-group correlation and within-group correlation, we vary one setting and keep the others remain at their default level in each experiment.

  • Sample size: we vary the number of observations from 500,750,1000,1250 to 1500.

  • Group correlation: we fix the within-group correlation , and set the between-group correlation to be , with .

  • Within-group correlation: we vary within-group correlation with and fix .

We compare the performance of Deep-gKnock with group-SLOPE available in the R package grpSLOPE (Gossmann et al., 2016). For each setting, we run each experiment for 100 replications and set the target gFDR level . The empirical gFDR and power are reported in Table 1 & 2.

In the linear model setting shown in Table 1, group-SLOPE fails to control the gFDR at the target level in each of the following three situations: (1) ; (2) between-group correlation is large; (3) within group correlation is large. In contrast, Deep-gKnock can precisely control the gFDR in all settings.

In the single-index model setting shown in Table 2, Deep-gKnock achieves higher power and consistently controls gFDR in all settings, which demonstrates the advantages of our Deep-gKnock by using DNN to model the non-linear relationship between features and the response.

Varying Sample size Varying Between-group correlation Varying Within-group correlation
Deep-gKnock group-SLOPE Deep-gKnock group-SLOPE Deep-gKnock group-SLOPE
gFDR Power gFDR Power gFDR Power gFDR Power gFDR Power gFDR Power
500 0.19 0.98 0.36 0.73 0.00 0.18 0.98 0.20 1.00 0.00 0.17 1.00 0.21 1.00
750 0.21 0.99 0.30 0.99 0.20 0.18 0.99 0.23 1.00 0.20 0.19 1.00 0.22 1.00
1000 0.20 0.99 0.21 1.00 0.40 0.20 0.99 0.26 1.00 0.40 0.14 1.00 0.24 1.00
1250 0.23 0.99 0.17 1.00 0.60 0.17 0.99 0.30 1.00 0.60 0.14 1.00 0.27 1.00
1500 0.21 0.99 0.15 1.00 0.80 0.18 0.99 0.40 1.00 0.80 0.11 0.95 0.30 1.00
Table 1: Simulation results for linear model.
Varying Sample size Varying Between-group correlation Varying Within-group correlation
Deep-gKnock group-SLOPE Deep-gKnock group-SLOPE Deep-gKnock group-SLOPE
gFDR Power gFDR Power gFDR Power gFDR Power gFDR Power gFDR Power
500 0.22 0.71 0.08 0.03 0.00 0.14 0.53 0.12 0.17 0.00 0.20 0.78 0.12 0.18
750 0.18 0.72 0.14 0.15 0.20 0.19 0.74 0.30 0.28 0.20 0.25 0.79 0.31 0.31
1000 0.18 0.72 0.12 0.21 0.40 0.20 0.82 0.46 0.35 0.40 0.17 0.83 0.42 0.34
1250 0.18 0.73 0.12 0.32 0.60 0.21 0.88 0.52 0.40 0.60 0.17 0.88 0.48 0.35
1500 0.19 0.75 0.14 0.45 0.80 0.19 0.86 0.57 0.43 0.80 0.17 0.94 0.53 0.34
Table 2: Simulation results for Single-Index model.

5 Real data analysis

In addition to the simulation studies presented in Section 4, we also demonstrate the performance of Deep-gKnock on two real data sets. The gFDR level is set to .

5.1 Application to prostate cancer data

The prostate cancer data contains clinical measurements for 97 male patients who were about to receive a radical prostatectomy. It was analyzed in Hastie et al. (2013) to study the correlation between the response , the level of prostate-specific antigen (lpsa) and other eight features. The features are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45).

For the categorical variable svi with two levels, we coded it by one dummy variable and treated it as one group. For each continuous variable, we used five B-Spline basis functions to represent its effect and treated those five basis functions as a group. This provides us eight groups with a total of 36 features. We summarize the group-feature selection results in Table 

3. The features selected by Deep-gKnock are the same as using Lasso in Hastie et al. (2013).

Method group-feature selected
group-SLOPE lcavol, lweight, svi, gleason
Deep-gKnock lcavol, lweight
Table 3: Group-feature selection results for prostate cancer data

5.2 Application to yeast cell cycle data

We apply Deep-gKonck to the task of identifying the important transcription factors (TFs), which are related to regulation of the cell cycle. TFs belong to a class of proteins called binding proteins, and control the rate at which DNA is transcribed into mRNA. We utilize a yeast cell cycle data set from Spellman et al. (1998) and Lee et al. (2002). The response is the messenger ribonucleic acid (mRNA) levels on genes, and are measured at 28 minutes during a cell cycle. The features is the measurements of binding information of TFs . Out of the 106 TFs, 21 TFs are known and experimentally confirmed cell cycle related TFs (Wang et al., 2007).

It has been studied that groups of TFs function in a coordinated fashion to direct cell division, growth and death (Latchman, 1997). Following Ma et al. (2007)

, we use the K-means method to cluster the 106 TFs, and determine the optimal number of clusters using the Gap statistic

(Tibshirani et al., 1999)

. The Gap statistic suggests the 106 TFs can be clustered into 20 groups. To visulize the clustering results, we use Principal Component Analysis (PCA) algorithm to reduce the dimensionality to its first two principal components, which results in a scatter plot of data points colored by their cluster labels in Figure

3. One of the clusters contains four TFs and all of them are experimentally verified.

Figure 3: Cluster plot for 106 TFs in Yeast Cell Cycle data

Group-SLOPE identified 7 groups which contains 41 TFs. including 12 confirmed TFs. Deep-gKnock identified 5 groups which contains 26 TFs including 11 confirmed TFs. To demonstrate the selection performance, following Zhu and Su (2019)

, we also compute the probability of containing at least

confirmed TFs from a

randomly chosen TFs from a hypergeometric distribution in Table

4. We included the results for the Lasso in Table 4 as a benchmark. Smaller probability values suggest better feature selection performance. The small probability of Deep-gKnock suggests that the large number of confirmed TFs selected is not due to chance. Deep-gKnock also outperforms group-SLOPE.

Method
Lasso
group-SLOPE
Deep-gKnock
Table 4: Probability of containing at least confirmed TFs out of 85 unconfirmed and 21 confirmed TFs in a random draw of TFs.

6 Conclusion

We have introduced a novel group-feature selection method Deep-gKnock combining Knockoffs with DNNs. It provides an end-to-end group-wise feature selection with controlled gFDR for high dimensional data. With the flexibility of DNN, we also provide deep representations with enhanced interpretability and reproducibility. Both synthetic and real data analysis is provided to demonstrate that Deep-gKnock can achieve superior power and accurate gFDR control compared with state-of-the-art methods. Moreover, Deep-gKnock achieves scientifically meaningful group-feature selection results for real data sets.

References

  • Barber and Candès (2016) Barber, R. F. and Candès, E. J. (2016). A knockoff filter for high-dimensional selective inference. arXiv preprint arXiv:1602.03574.
  • Barber et al. (2015) Barber, R. F., Candès, E. J., et al. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055–2085.
  • Bogdan et al. (2015) Bogdan, M., Van Den Berg, E., Sabatti, C., Su, W., and Candès, E. J. (2015). Slope—adaptive variable selection via convex optimization. The annals of applied statistics, 9(3), 1103.
  • Brzyski et al. (2018) Brzyski, D., Gossmann, A., Su, W., and Bogdan, M. (2018). Group slope–adaptive selection of groups of predictors. Journal of the American Statistical Association, pages 1–15.
  • Candes et al. (2018) Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3), 551–577.
  • Chen et al. (2018) Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814.
  • Dai and Barber (2016) Dai, R. and Barber, R. (2016). The knockoff filter for fdr control in group-sparse and multitask regression. In

    International Conference on Machine Learning

    , pages 1851–1859.
  • Gossmann et al. (2016) Gossmann, A., Brzyski, D., Su, W., and Bogdan, M. (2016). grpSLOPE: Group Sorted L1 Penalized Estimation. R package version 0.2.1.
  • Hastie et al. (2013) Hastie, T., Tibshirani, R., and Friedman, J. (2013). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York.
  • Huang et al. (2012) Huang, J., Breheny, P., and Ma, S. (2012). A selective review of group selection in high-dimensional models. Statistical science: a review journal of the Institute of Mathematical Statistics, 27(4).
  • Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Latchman (1997) Latchman, D. S. (1997). Transcription factors: an overview. The international journal of biochemistry & cell biology, 29(12), 1305–1312.
  • Lee et al. (2002) Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et al. (2002). Transcriptional regulatory networks in saccharomyces cerevisiae. science, 298(5594), 799–804.
  • Li et al. (2018) Li, Z., Xie, W., and Liu, T. (2018). Efficient feature selection and classification for microarray data. PloS one, 13(8), e0202167.
  • Lu et al. (2018) Lu, Y., Fan, Y., Lv, J., and Noble, W. S. (2018). Deeppink: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pages 8690–8700.
  • Ma et al. (2007) Ma, S., Song, X., and Huang, J. (2007). Supervised group lasso with applications to microarray data analysis. BMC bioinformatics, 8(1), 60.
  • Meier et al. (2008) Meier, L., Van De Geer, S., and Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.
  • Spellman et al. (1998) Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell, 9(12), 3273–3297.
  • Su et al. (2016) Su, Z., Zhu, G., Chen, X., and Yang, Y. (2016). Sparse envelope model: efficient estimation and response variable selection in multivariate linear regression. Biometrika, 103(3), 579–593.
  • Tang and Liu (2014) Tang, J. and Liu, H. (2014). Feature selection for social media data. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(4), 19.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
  • Tibshirani et al. (1999) Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P., et al. (1999). Clustering methods for the analysis of dna microarray data. Dept. Statist., Stanford Univ., Stanford, CA, Tech. Rep.
  • Wang et al. (2007) Wang, L., Chen, G., and Li, H. (2007).

    Group scad regression analysis for microarray time course gene expression data.

    Bioinformatics, 23(12), 1486–1494.
  • Yang and Zou (2015) Yang, Y. and Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 25(6), 1129–1141.
  • Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
  • Zhu and Su (2019) Zhu, G. and Su, Z. (2019). Envelope-based sparse partial least squares. The Annals of Statistics, (in press).