Deep supervised feature selection using Stochastic Gates

10/09/2018 ∙ by Yutaro Yamada, et al. ∙ Yale University 0

In this study, we propose a novel non-parametric embedded feature selection method based on minimizing the ℓ_0 norm of the vector of an indicator variable, whose point-wise product of an input selects a subset of features. Our approach relies on the continuous relaxation of Bernoulli distributions, which allows our model to learn the parameters of the approximate Bernoulli distributions via tractable methods. Using these tools we present a general neural network that simultaneously minimizes a loss function while selecting relevant features. We also provide an information-theoretic justification of incorporating Bernoulli distribution into our approach. Finally, we demonstrate the potential of the approach on synthetic and real-life applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Technological advances are leading to the generation of large data sets both in sample size and dimensionality. Scientists are collecting complex high dimensional measurements. The collected data sets encapsulate both opportunities and challenges. For instance, in biology, we have access to tremendous amounts of biological markers on patients and wish to model complex interactions for prediction purposes. Unfortunately, that requires far more data than is generally available from clinical trials or in testing settings where experiments can be expensive. A method to mitigate this challenge is to identify the key set of features that influence prediction. Finding a subset of meaningful features, might not only improve the analytic task but also provide new scientific findings and improve the interpretability of machine-based models [Ribeiro et al., 2016]. In bio-medical research, identifying a small set of relevant predictive bio-markers is essential [He and Yu, 2010]. There has been numerous works on non-parametric feature selection in high-dimensional settings [Friedman, 1991, Tibshirani, 1996, Lafferty and Wasserman, 2008, Ravikumar et al., 2007] and
[Meinshausen and Yu, 2009, Raskutti et al., 2012]. However, the scalability of these methods on huge datasets is limited in practice.

Feature selection involves finding a subset of features that are sufficient for building a model. In the supervised setting, selected features might improve classification or regression performance. Furthermore, reducing the number of features has computational advantages and has been shown to improve model generalization on unseen data [Chandrashekar and Sahin, 2014].

Feature selection methods may be classified into three major categories: filter methods, wrapper methods, and embedded methods. Filter methods are task-independent; they attempt to filter out the nonrelevant features prior to classification. Typically, a relevance score is created for each feature using some statistical comparison between each feature and the response (or class label). Filter methods have been demonstrated to be useful for various applications in

[Koller and Sahami, 1996, Bekkerman et al., 2003, Davidson and Jalan, 2010]. More recent filter methods, such as [Song et al., 2007, Song et al., 2012, Chen et al., 2017] use kernels or the Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 2005] to extract the most relevant features. Wrapper methods use the classifiers outcome to determine the relevance of each feature. As it is usually unfeasible to check all subset of features, various works have been proposed to select subsets of features. Tree-based feature selection methods such as [Kohavi and John, 1997, Stein et al., 2005], sequential wrapper methods such as [Zhu et al., 2007, Reunanen, 2003] an an iterative kernel based wrapper method was proposed in [Allen, 2013]. The main disadvantage of all mentioned wrapper methods is that they require recomputing the classifier for each subset of features.

Embedded methods aim at relieving the computational burden. This is done by simultaneously learning the classifier and the subset of most relevant features. The Mutual Information (MI) is used by [Battiti, 1994, Peng et al., 2005, Estévez et al., 2009]

, to select the important features by incorporating it in the classification procedure. Decision trees and random forest are naturally used for feature selection. A feature may be ranked based on the number of improvements it achieves as a splitter node in the tree. Perhaps the most common embedded method is the least absolute shrinkage and selection operator (LASSO)

[Tibshirani, 1996]. LASSO minimizes an objective function while enforcing an constraint on the weights of the features. This method results in very efficient and scalable feature selection procedure. Although LASSO has been extended in various works [Hans, 2009, Li et al., 2011, Li et al., 2006], it remains a linear method, which limits the classification capabilities. [Yamada et al., 2014] extends to nonlinear functions by developing a feature wise kernelized LASSO.

There have been a few attempts to use a neural network for feature selection. [Verikas and Bacauskiene, 2002] trains different networks on subsets of features then selects them based on the performance of the network. [Kabir et al., 2010]

propose a heuristical procedure for adding features based on their contribution to a partially trained network.

[Roy et al., 2015] use the activation values of the first layer to rank the importance of the features. All these methods are wrapper methods and do not provide an approach to simultaneously train a network and select the features.

To the best of our knowledge, this study provides the first practical embedded feature selection method based on a neural network. The proposed learning procedure aims at simultaneously minimizing the norm of a randomly selected subset of features along with a general loss function. Our method relies on the recent efforts of developing a continuous differentiable approximation to categorical distribution called Concrete Relaxation or Gumbel-softmax trick [Jang et al., 2017], [Maddison et al., 2016]

. This enables us to introduce stochastic gates on the inputs, whose probability being active will be jointly learned with the model parameters via gradient descent. Our formulation naturally extends from linear models to neural networks by introducing indicator variables in the input layer of a network. We also provide an information-theoretic interpretation on the Bernoulli relaxation of the best subset selection, which justifies the introduction of randomness for feature selection in our risk minimization. Finally, we apply our feature selection method to various artificial and real data sets to demonstrate its effectiveness.

2 Background

Feature selection can be considered as finding the subset of features that leads to the largest possible generalization or equivalently to minimal risk. Every subset of features is modeled by a vector of indicator variables , where the point-wise product with an input provides a subset of features.

Given a parameterized family of regression functions, we try to find a vector of indicator variables and a vector of parameters that minimizes the expected risk:

where denotes the point-wise product, is a loss function and is a probability measure on the domain of the training data . Where

, is a realization of the input random variable and

, is a realization of the response random variable or class label. In some cases we will also have the additional constraint , where measures the sparsity of a given indicator variable .

2.1 Feature Selection as an Optimization Problem

Most penalized linear models expressed are in the terms of the following minimization:

(1)

where measures the loss on the training point and is a penalizing term. The hyper-parameter is a trade-off coefficient balancing the empirical error with this penalizing term. Examples of empirical errors are the hinge loss, the loss, and the logistic loss. Common penalizing terms include the norm and the norm, both have been shown to enforce sparsity.

The minimization in Eq. (1) inspired the least absolute shrinkage and selection operator (LASSO). LASSO enables a computationally efficient feature selection procedure. LASSO may be described as follows; let be the sample size, be the number of features,

be the response variable,

, and be the vector of weight parameters. The objective function of LASSO can be written as follows:

(2)

The above minimization problem has been applicable in numerous scientific fields and in other domains.

3 Proposed Method

The rise of

norm constraints in sparse estimation was partially due to its computational efficiency - the

norm is the closest convex function to the norm [Hastie et al., 2015]. By adding the norm as a regularization term LASSO’s minimization procedure selects the most relevant features. We take a different approach to approximate the norm, which can extend to non-linear models while remaining computationally efficient. In this section, we describe how to incorporate a continuous approximation of the Bernoulli distribution (presented in [Jang et al., 2017], [Maddison et al., 2016]

) into a general differentiable loss function. Applying backpropagation on the regularized loss function provides a natural network-based feature selection procedure. We denote this method as Deep Neural Network Feature Selection (DNN-FS).

Penalizing the

norm in the empirical risk minimization can be understood as the best subset selection. In logistic regression, for example, the objective can be written as follows:

(3)

where is a tuning parameter and .

By introducing a vector of indicator variables, where indicates that -th feature is included and indicates that -th feature is absent. Then, by taking the pointwise product of and , we can represent any subset of features. To view the above optimization from a probabilistic perspective one can introduce a Bernoulli distribution over each indicator variable . Then, using a logistic regression loss the expected risk from Eq. (1) becomes

where is a Bernoulli distribution, and are the realization of input and response variables from the data distribution. The goal of empirical risk minimization is to find and that minimizes ; However, due to the discrete nature of , optimizing the above objective with respect to the distribution parameters doesn’t admit tractable methods such as gradient descent.

Following the recent work [Louizos et al., 2017], we review how to take gradients of the objective with respect to by relaxing the discrete nature of . This allows us to compute Monte Carlo estimates to the expected risk and enable gradient descent to minimize the loss function.

Our goal is to introduce a distribution that is continuous and differentiable with respect to its parameter as a replacement of Bernoulli distribution. To do this, we approximate a Bernoulli random variable represented as one-hot 2-dimensional vector with parameters and such that . This approximation is called Binary Concrete [Jang et al., 2017], [Maddison et al., 2016]; here we briefly describe its derivation.

First, a random perturbation is independently drawn for each category from a Gumbel(0,1) distribution:

(4)

Then we apply a temperature-dependent sigmoid function as a surrogate to finding a one-hot vector through an argmax operator:

(5)

The resulting random 2-dimensional vector is called Binary Concrete vector, which we denote as

(6)

Since , we set to be the first coordinate of . We will introduce this variable for each feature, forming the vector of approximate indicator variables, whose point-wise product provides a subset of features.

The recent paper [Louizos et al., 2017] proposed a hard-concrete distribution, which extends the above Binary Concrete distribution so that more probability mass can concentrate around , this increases sparsity in the final solution. The derivation of the hard-concrete distribution appears in the Appendix of this manuscript. By replacing the Bernoulli distribution with the hard-concrete distribution, we can now compute the derivative of the objective with respect to the distribution parameters :

where can be computed via back-propagation since is a deterministic function of , and the randomness only comes from (defined in Eq. 4). The expectation over is approximated based on the empirical per batch realizations.

4 Related Work

The two most related works to this study are [Louizos et al., 2017] and [Chen et al., 2018]. [Louizos et al., 2017] introduces the hard concrete distribution. They show that incorporating it into a deep neural network sparsifies the learned weights. The authors demonstrate how this sparsification leads to fast convergence and improved generalization. In this study, we focus on a different goal. By applying the stochastic gates only to the input layer, we demonstrate how the hard concrete distribution is useful for parametric feature selection. This allows us to increase the feature size up to a number of thousands of features (as shown in the Section 6.4); this high dimensional regime is common in a field such as bioinformatics. In [Chen et al., 2018], the Gumbel-softmax trick is used to develop a framework for interpreting a pre-trained model [Chen et al., 2018]. The main difference between our work and [Chen et al., 2018] is that their method is focused on finding a subset of features given a particular instance, whereas our method aims to construct a subset of features based on all the training examples. This approach is more appropriate when we want to apply feature selection methods to data that are known to have consistent important features such as gene data.

Furthermore, their method aims to optimize over a family of distributions, whose cardinality is

. In order to reduce the number of possible candidates, they employ a subset sampling procedure based on the approximate categorical distribution for the purpose of instance-wise feature selection. This is not suitable for feature selection in general since when the dimension of the feature space is large, the variance of the samples from the categorical distribution becomes large, whereas our method employs an approximate Bernoulli for each dimension of the features, whose variance of the samples is independent of the the dimension of the feature space.

Some authors tackle embedded feature selection problems by extending LASSO and group LASSO to neural network models. Although [Scardapane et al., 2017] and [Feng and Simon, 2017] have a similar goal as ours, their empirical result does not achieve enough sparsity as practitioners would like. Our approach, which utilizes stochastic gates instead of relying on regularizing norms, has an advantage in terms of achieving high sparsity level while maintaining good performance.

5 Connection to Mutual Information

In this section, we demonstrate that replacing the original subset selection problem with the Bernoulli probabilistic setting can be justified from a mutual information perspective in the feature selection setting. This is motivated by, but different than the work done by  [Chen et al., 2018]. Recall that the mutual information between two random variables can be defined as

(7)

where are the entropy of and the conditional entropy of , respectively [Cover and Thomas, 2006]. Next, we present our two assumptions for this section:

  • Assumption 1: There exists a subset of indices with cardinality equal to such that for any we have .

  • Assumption 2: .

The first assumption means that if we do not include an element from , then we can improve our prediction accuracy by adding it. The second assumption means that we only need the variables in to predict . Any additional variables are superfluous. The assumptions are quite benign. For instance they are satisfied if is drawn Gaussian with a non-degenerate covariance matrix and , where is noise independent of and is not degenerate. With this in place we can present our results.

Proposition 5.1.

Suppose that the above assumptions hold for the model. Then, solving the optimization

(8)

is equivalent to solving the optimization

(9)

where the coordinates of are drawn according to a Bernoulli distribution with parameter .

Due to length constraints, we leave the proof of this proposition to the appendix.

6 Experiments

In this section, we perform a variety of experiments to evaluate the potential of using the proposed method (DNN-FS). In the first two experiments, we generate artificial samples by randomly sampling points based on some parametric distribution; then we assign a label based on a non-linear function. Each data set is concatenated with nuisance noisy coordinates; these coordinates do not hold information regarding the class identity. We compute both classification accuracy and feature weights depicted by our proposed approach. To evaluate the strength of DNN-FS, we compare it to recursive feature elimination (RFE) [Gysels et al., 2005], feature ranking using support vectors classification (SVC) [Chang and Lin, 2008], LASSO [Tibshirani, 1996], and two tree-based methods [Rastogi and Shim, 2000] and [Strobl et al., 2008] which we denote Tree and RF respectively. We also compare our method to a neural network with the same architecture without the feature selection layer, we denote this method as DNN. Each method extracts a weight for the relevance of each feature, we denote this weight by . The wrapper methods, such as RFE, SVC, Tree, and RF are retrained based on the features with highest extracted weights. In the following experiments, we compare classification accuracy as well as a metric which we denote as FEAT. FEAT is defined as the sum of weights over the informative features divided by the sum over all weights.

The architecture of the deep neural network is similar for all experiments. We use 3 layers with a hyperbolic tangent (Tanh) activation function. For the classification experiments, the final layer performs logistic regression. Whereas for the Cox hazard model (Section

6.5) we use the scaled exponential linear units (Selu) and partial likelihood as the loss.

Setting the hyperparameter

effects the portion of selected features. In the artificial experiments (Sections 6.1, 6.2 and 6.3) we evaluate the effect of on the convergence of the network. For the biological experiments (Sections 6.4, 6.5) we vary lambda until the networks selects roughly of original features and evaluate the networks generalization on the test set.

6.1 Two Moons classification with nuisance features

Figure 1: LABEL: Realizations from the ”Two moons” shaped binary classification class. and are the relevant features, are noisy features drawn from a Gaussian with zero mean and variance of . LABEL: Classification accuracy vs. the number of irrelevant noisy dimension. LABEL: The portion of relevant weights attributed to the informative feature, this metric is denoted as Feat.

In the first experiment, we construct a dataset based on ”two moons” shape classes, concatenated with noisy features. The first two coordinates are generated by adding Gaussian noise with zero mean to two nested half circles. An example of one realization of the first two coordinates is presented in Fig. 1. Each addition nuisance features

, is drawn from a Gaussian distribution with zero mean and variance of

. Classification accuracy and the portion of weights assigned to the real signal (FEAT) are presented in Fig. 1 and 1. Based on the classification accuracies, it is evident that all methods correctly identify the most relevant features. As expected, DNN-FS and Random Forest are the only methods that achieve near perfect classification. DNN-FS and LASSO are the only methods which naturally sparsify the feature space. The parameter was set to ; this value seems to hold for a wide range of dimensions. Moreover, the performance seemed stable for a wide range of ’s.

6.2 Noisy binary XOR classification

The binary XOR is a challenging classification problem. The first two coordinates are drawn from a binary ”fair” Bernoulli distribution. The response variable is set as an XOR of the first coordinates, such that . As in the previous experiment are noisy features drawn from a Gaussian with zero mean and variance of . The number of points is , and are used as training of all methods. The experiment is repeated 50 times for each value of , the average test classification accuracies are presented in Fig. 2. In Fig. 2 the average FEAT metric is presented. It seems that although every single feature is statistically independent of , the proposed method manages to select the relevant feature even for high and reasonably small .

Figure 2: LABEL: Classification accuracy vs. the number of irrelevant noisy dimension for the XOR problem. LABEL: The portion of relevant weights attributed to the informative feature, this metric is denoted as Feat.

6.3 Sparse Handwritten digits classification

In the following toy example, we attempt to classify between images of handwritten digits of 3’s and 8’s. Image features represent spatial information, therefore for this binary classification, we expect that some of the left side pixels would be sufficient for the separation. The experiment is performed as followed. We omit of the data as the test set, and train on the remaining . We then apply DNN-FS and evaluate the classification accuracy and selected features. The experiment was repeated 10 times, the extracted features and accuracies were consistent over 20 trials. We noticed a relatively small number of selected features which are positioned Southwest and close to the center of the images achieve very high classification accuracy. An example of 9 randomly selected samples overlaid with the weights of the selected features is presented in Fig. 3. In this experiment, we further evaluate the effect of on the sparsification and accuracy of the method. We apply the approach to a randomly sampled training set of size and change in the range of . In Fig. 3 we compare the accuracy and sparsity level for a wide range of ’s. It seems that within this wide range DNN-FS converges to a sparse solution with reasonable classification accuracy.

Figure 3: LABEL: 9 samples from MNIST (black) overlaid with the subset of features (red) selected by DNN-FS. Based on these feature classification accuracy reaches . The lighter red color indicates that the selected feature overlay the digit. For these 9 randomly selected samples all the 8’s have values within the support of the selected features, whereas for the 3’s there is no intersection. Note that at the intersection the color of the selected pixels are white. LABEL: Accuracy and sparsity ratio ( of selected features) for values of in the range of .

6.4 Purified populations of peripheral blood monocytes (PBMCs)

Single-cell RNA sequencing (scRNA-seq) is a novel technology that measures gene expression levels of hundreds of thousands of individual cells, simultaneously. This new tool is revolutionizing our understanding of cellular biology as it enables, among other things, the discovery of new cell types as well as the detection of subtle differences between similar but distinct cells. [Zheng et al., 2017] have subjected more than 90,000 purified populations of peripheral blood monocytes (PBMCs) to scRNA-seq analysis. Such blood cells have been thoroughly characterized and studied, some of the cells are well separated based on their functionality. Here we focus on B-cells and cytotoxic T-cells. There are various marker genes that are known as unique to these cell types. In the following experiment, we attempt to extract these known markers by applying DNN-FS to a set of labeled PBMCs. We first filter out the genes that are lowly expressed in the cells, this leaves us with genes (features). The number of cells in these two classes is , of which we only use of the data. This again is a challenging regime for a neural network. DNN-FS achieves classification accuracy of using genes only. Intriguingly, when inspecting these selected genes there are various know markers of B and T cells or genes related to proteins which bind with them. Among some of these selected genes are CD3E, which is unique to T cells, CD79A and CD79B which are unique to B cells, and CD37 that is more abundant in B cells than in cytotoxic T-cells.

6.5 Cox Proportional Hazard Deep Network Models for Survival Analysis

In survival analysis, we are interested in building a predictive model for the survival time of an individual based on the covariates . Survival times are assumed to follow a distribution, which is characterized by survival function . A hazard function, which measures the instantaneous rate of death, is defined by

We can relate the two functions in the following way: .

Proportional hazard models assume a multiplicative effect of the covariates on the hazard function such that

(10)

where is a baseline hazard function, which is often the exponential or Weibull distribution, and is the parameter of interests.

One of the difficulties in estimating in survival analysis is that a large portion of the available data is censored. However, in order to obtain estimates, Cox observed that it is sufficient to maximize the partial-likelihood, which is defined as follows:

(11)

The previous work proposed DeepSurv [Katzman et al., 2018], which uses a deep neural network model to replace the linear relations between the covariate and , demonstrating improvements of survival time prediction over existing models such as CPH and the random survival forest [Ishwaran and Kogalur, 2007], [Ishwaran et al., 2008].

We apply our feature selection method in DeepSurv to see how our feature selection procedure improves the performance on the breast cancer dataset called METABRIC. The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) uses gene and protein expression profiles to determine new breast cancer subgroups in order to help physicians provide better treatment recommendations. The METABRIC dataset consists of gene expression data and clinical features for patients, and have an observed death due to breast cancer with a median survival time of 116 months.

We have picked 16 genes that were used in the Oncotype DX test, a genomic test that analyzes the activity of genes that could affect breast cancer behavior. We joined these genes with the patient’s clinical features (hormone treatment indicator, radiotherapy indicator, chemotherapy indicator, ER-positive indicator, age at diagnosis). We then reserved of the patients as the test set.

We first used DeepSurv on this dataset without feature selection. We tested 4 different feed-forward neural network architectures and repeated the same experiment 5 times. Then, we evaluated the predictive ability of the learned models based on the concordance index (CI), which measures the quality of rankings. The concordance index is a standard performance measure for model assessment in survival analysis. The best average CI is 0.639.

To see how our method performs, we randomly sample genes to construct nuisance features. In total, we obtain 221 features, where 200 features are randomly sampled. We have applied DNN-FS to this data. With , the model picks 3 features, which is roughly of the feature size. It achieves the average CI of 0.649 over 5 runs. This demonstrates that with just three features (GRB7, age at diagnosis, chemotherapy indicator), our model is able to perform better than the original DeepSurv model.

7 Conclusion

In this paper, we propose a novel embedded feature selection method based on stochastic gates. It has an advantage over previous LASSO based methods in terms of achieving a high level of sparsity in non-linear models such as neural networks. We relate the norm minimization of the input weight to Mutual Information Maximization under the specific choice of variational family. In experiments, we observe that our method consistently outperforms existing embedded feature selection methods in both synthetic datasets and real biological datasets.

7.1 Acknowledgements

We thank Ronen Basri for helpful discussions. O.L. and Y.K. are supported by NIH grant 1R01HG008383-01A1.

7.2 Appendix

Proof of Proposition 5.1 We now give the proof of the proposition showing the equivalence. Let be a subset such that . That is there exists some element in that is not in . For any such set we have that . Indeed, if we let then we have

where the final inequality follows by Assumption 1. Assumption 2 also yields that for any set such that , we have . Now, when we consider the Bernoulli optimization problem we have

The mutual information can be expanded as

where we have used the fact that is independent of everything else. The current optimization requires that

is a product distribution. However, if we can show that the optimal unconstrained distribution is also a product distribution, then the two are equivalent. The reason for this relaxation is that the unconstrained problems becomes a linear program. Now, from above we know that the optimal value of the optimization

for any set . Hence, any unconstrained distribution should place all of its mass on such subsets in order to maximize. However, there is an additional constraint that where . Thus, since the probability measure will always place probability on selecting the coordinates in , the marginal sum will always be greater than , unless no other indices are selected. Hence, the optimal solution is to select the distribution so that all of the mass is placed on the subset and no mass elsewhere. As this is also a product distribution, this complete the proof of the claim.

The Hard Concrete distribution

[Louizos et al., 2017] introduces a modification of Binary Concrete, whose sampling procedure is as follows:

where is an interval, with and . This induces a new distribution, whose support is instead of . With , the probability density concentrates its mass near the end points, since values larger than are rounded to one, whereas values smaller than are rounded to zero.

The CDF of is defined by

(12)

from which the CDF of is defined as

(13)

where . Now, the probability of being the gate active, which is , can be written as

(14)

As shown in [Louizos et al., 2017], this function has improved characteristics compared to the concrete distribution.

References

  • [Allen, 2013] Allen, G. I. (2013). Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics, 22(2):284–299.
  • [Battiti, 1994] Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks, 5(4):537–550.
  • [Bekkerman et al., 2003] Bekkerman, R., El-Yaniv, R., Tishby, N., and Winter, Y. (2003). Distributional word clusters vs. words for text categorization.

    Journal of Machine Learning Research

    , 3(Mar):1183–1208.
  • [Chandrashekar and Sahin, 2014] Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.
  • [Chang and Lin, 2008] Chang, Y.-W. and Lin, C.-J. (2008). Feature ranking using linear svm. In Causation and Prediction Challenge, pages 53–64.
  • [Chen et al., 2018] Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814.
  • [Chen et al., 2017] Chen, J., Stern, M., Wainwright, M. J., and Jordan, M. I. (2017). Kernel feature selection via conditional covariance minimization. In Advances in Neural Information Processing Systems, pages 6946–6955.
  • [Cover and Thomas, 2006] Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA.
  • [Davidson and Jalan, 2010] Davidson, J. L. and Jalan, J. (2010). Feature selection for steganalysis using the mahalanobis distance. In Media Forensics and Security Ii, volume 7541, page 754104. International Society for Optics and Photonics.
  • [Estévez et al., 2009] Estévez, P. A., Tesmer, M., Perez, C. A., and Zurada, J. M. (2009). Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2):189–201.
  • [Feng and Simon, 2017] Feng, J. and Simon, N. (2017). Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification. ArXiv e-prints.
  • [Friedman, 1991] Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics, pages 1–67.
  • [Gretton et al., 2005] Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer.
  • [Gysels et al., 2005] Gysels, E., Renevey, P., and Celka, P. (2005). Svm-based recursive feature elimination to compare phase synchronization computed from broadband and narrowband eeg signals in brain–computer interfaces. Signal Processing, 85(11):2178–2189.
  • [Hans, 2009] Hans, C. (2009).

    Bayesian lasso regression.

    Biometrika, 96(4):835–845.
  • [Hastie et al., 2015] Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC.
  • [He and Yu, 2010] He, Z. and Yu, W. (2010). Stable feature selection for biomarker discovery. Computational biology and chemistry, 34(4):215–225.
  • [Ishwaran et al., 2008] Ishwaran, H., Kogalur, U., Blackstone, E., and Lauer, M. (2008). Random survival forests. Annals of Applied Statistics, 2(3):841–860.
  • [Ishwaran and Kogalur, 2007] Ishwaran, H. and Kogalur, U. B. (2007). Random survival forests for r.
  • [Jang et al., 2017] Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax.
  • [Kabir et al., 2010] Kabir, M. M., Islam, M. M., and Murase, K. (2010). A new wrapper feature selection approach using neural network. Neurocomputing, 73(16-18):3273–3283.
  • [Katzman et al., 2018] Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y. (2018). Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Medical Research Methodology, 18.
  • [Kohavi and John, 1997] Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324.
  • [Koller and Sahami, 1996] Koller, D. and Sahami, M. (1996). Toward optimal feature selection. Technical report, Stanford InfoLab.
  • [Lafferty and Wasserman, 2008] Lafferty, J. and Wasserman, L. (2008). Rodeo: Sparse, greedy nonparametric regression. Ann. Statist., 36(1):28–63.
  • [Li et al., 2006] Li, F., Yang, Y., and Xing, E. P. (2006). From lasso regression to feature vector machine. In Advances in Neural Information Processing Systems, pages 779–786.
  • [Li et al., 2011] Li, W., Feng, J., and Jiang, T. (2011). Isolasso: a lasso regression approach to rna-seq based transcriptome assembly. In International Conference on Research in Computational Molecular Biology, pages 168–188. Springer.
  • [Louizos et al., 2017] Louizos, C., Welling, M., and Kingma, D. P. (2017). Learning sparse neural networks through l0 regularization. CoRR, abs/1712.01312.
  • [Maddison et al., 2016] Maddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712.
  • [Meinshausen and Yu, 2009] Meinshausen, N. and Yu, B. (2009).

    Lasso-type recovery of sparse representations for high-dimensional data.

    Ann. Statist., 37(1):246–270.
  • [Peng et al., 2005] Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226–1238.
  • [Raskutti et al., 2012] Raskutti, G., Wainwright, M. J., and Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res., 13:389–427.
  • [Rastogi and Shim, 2000] Rastogi, R. and Shim, K. (2000). Public: A decision tree classifier that integrates building and pruning. Data Mining and Knowledge Discovery, 4(4):315–344.
  • [Ravikumar et al., 2007] Ravikumar, P., Liu, H., Lafferty, J. D., and Wasserman, L. A. (2007). Spam: Sparse additive models. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 1201–1208.
  • [Reunanen, 2003] Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3(Mar):1371–1382.
  • [Ribeiro et al., 2016] Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why should i trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1135–1144, New York, NY, USA. ACM.
  • [Roy et al., 2015] Roy, D., Murty, K. S. R., and Mohan, C. K. (2015). Feature selection using deep neural networks. In Neural Networks (IJCNN), 2015 International Joint Conference on, pages 1–6. IEEE.
  • [Scardapane et al., 2017] Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomput., 241(C):81–89.
  • [Song et al., 2012] Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. (2012). Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May):1393–1434.
  • [Song et al., 2007] Song, L., Smola, A., Gretton, A., Borgwardt, K. M., and Bedo, J. (2007). Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pages 823–830. ACM.
  • [Stein et al., 2005] Stein, G., Chen, B., Wu, A. S., and Hua, K. A. (2005). Decision tree classifier for network intrusion detection with ga-based feature selection. In Proceedings of the 43rd annual Southeast regional conference-Volume 2, pages 136–141. ACM.
  • [Strobl et al., 2008] Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307.
  • [Tibshirani, 1996] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
  • [Verikas and Bacauskiene, 2002] Verikas, A. and Bacauskiene, M. (2002). Feature selection with neural networks. Pattern Recognition Letters, 23(11):1323–1335.
  • [Yamada et al., 2014] Yamada, M., Jitkrittum, W., Sigal, L., Xing, E. P., and Sugiyama, M. (2014). High-dimensional feature selection by feature-wise kernelized lasso. Neural computation, 26(1):185–207.
  • [Zheng et al., 2017] Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J., et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nature communications, 8:14049.
  • [Zhu et al., 2007] Zhu, Z., Ong, Y.-S., and Dash, M. (2007). Wrapper–filter feature selection algorithm using a memetic framework. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(1):70–76.