Feature selection is defined as a preprocessing technique that determines and ranks the significance of features to eliminate features that are insignificant to the task at hand. As examined by Yu and Liu (2003)
, it is a powerful tool for alleviating the curse of dimensionality, reducing the training time and increasing the accuracy of learning algorithms, as well as improving data comprehensibility. For classification problems,Weston et al. (2001) divide feature selection problems into two types: given a fixed , where is the total number of features, find the features that lead to the least classification error and given a maximum expected classification error, find the smallest possible . In this paper, we will be focusing on problems of type . Weston et al. (2001) formalize this type of feature selection problems as follows. Given a set of functions , find a mapping of data , , along with the parameters for the function that lead to the minimization of
subject to , where the distribution - that determines how samples are generated - is unknown, and can be inferred only from the training set, is an elementwise product,
is a loss function andis the -norm.
Feature selection algorithms can be divided into three types as elaborated in Chandrashekar and Sahin (2014): Filter, Wrapper, and Embedded methods. Filter methods do not make use of the underlying learning algorithms and fully rely on intrinsic characteristics of the dataset to compute feature importance, while wrapper methods iteratively compute the learning performance of a classifier to rank the importance of features. Li et al. (2017) assert that filter methods are typically more computationally efficient than wrapper methods, but due to the absence of a learning algorithm that supervises the selection of features, the features selected by filter methods are often not as good as those selected by wrapper methods. Embedded methods make use of the intrinsic structure of a learning algorithm to embed feature selection into the underlying model as a means of reconciling the computational efficiency advantage of filter methods with the learning algorithm interaction advantage of wrapper methods. As examined by Saeys et al. (2007), embedded methods perform feature selection during the training of the learning algorithm that they are employed on and thus, they are model dependent, which means that the high saliency of the selected features is only applicable to that specific model. There is hence a need for fast feature selection methods that select features that apply to other models than the one employed, which leads to our motivation for efficient wrapper feature selection methods that are not model dependent. Weston et al. (2001)
define wrapper methods as an exploration of the feature space, where the saliency of subsets of features are ranked using the estimated accuracy of a learning algorithm. Hence,in (1) can be approximated by minimizing
subject to , where is a classifier having estimates of
. Wrapper methods can further be divided into three types: Exhaustive Search Wrappers, Random Search Wrappers, and Heuristic Search Wrappers. We will focus on Heuristic Search Wrappers that iteratively select or eliminate one feature at each iteration because unlike Exhaustive Search Wrappers, they are computationally more efficient and unlike Random Search Wrappers, they have deterministic guarantees on the set of selected salient features, as illustrated inHira and Gillies (2015).
Relevance and Redundancy
The saliency of features is determined by two factors: Relevance and Redundancy. Irrelevant features are insignificant because their direct removal does not result in a drop in classification accuracy, while redundant features are insignificant because they are linearly or non-linearly dependent on other features and can be inferred - or approximated - from them as long as these other features are not removed. As shown in Fig. 3 and detailed by Guyon et al. (2008), one does not necessarily imply the other. Fig. 3 (a) shows redundant features (represented by and values) that are both relevant, as removal of any of the features will lead to an inability to classify the data and (b) shows features, where the removal of any one feature does not deter the classification ability, and thus any of them is irrelevant given the other. However, they are not redundant as the value of any of them cannot be well approximated using the other.
Filter methods are better at identifying redundant features while wrapper methods are better at identifying irrelevant features, and this highlights the power of embedded methods as they utilize aspects of both in feature selection as mentioned in Bolón-Canedo et al. (2013). Since most wrapper methods do not take advantage of filter method based identification of redundant features, there is a need to incorporate a filter based technique to identify redundant features into wrapper methods, which we address using autoencoders.
Training the Classifier multiple times
Although wrapper methods often deliver higher classification accuracies compared to filter methods, their computational complexity is often significantly higher because the classifier needs to be trained for every considered feature set at every iteration. For greedy backward elimination wrappers, the removal of one out of features requires removing each feature separately and training the classifier with the remaining features and testing its performance on the cross-validation set. The feature whose removal results in the highest classification accuracy is removed because its removal caused the least impact on performance. This is the procedure followed by most backward feature selection algorithms such as the Recursive Feature Elimination (RFE) method proposed by Guyon et al. (2002). For iterative greedy elimination of features from a set of features, the classifier has to be trained for times, which poses a practical limitation when the number of features is large. Also, the saliency of the features selected is only as good as the classifier that ranks the features and as such, we need to use state-of-the-art classifiers for ranking the features (CNNs for image data, etc.). These models are often complex and thus, consume a lot of training time which implies a trade-off between speed and the saliency of selected features. We address this issue by training the the feature ranker model only once.
2 State of the art
In this section, we will describe some of the top-notch fast/efficient feature selection methods that we will be comparing our proposed method to. With the exception of FQI, the implementations of these methods can be found in the scikit-feature package created by Li et al. (2017) .
The Fisher Score encourages selection of features where feature values within the same class are similar and feature values belonging to different classes are distinct. Duda et al. (2012) define the Fisher Score for feature as
where is the number of classes, represents the number of training examples in class , represents the mean value of feature , represents the mean value of feature for training examples in class , and
represents the variance of featurefor training examples in class .
Conditional Mutual Information Maximization (CMIM) is a fast feature selection method proposed in Vidal-Naquet and Ullman (2003) and Fleuret (2004) that iteratively selects features while maximizing the Shannon mutual information function between the feature being selected and class labels, given already selected features. Li et al. (2017) define the CMIM score for feature as
where is the set of currently selected features,
is the random variable representing the value of feature, and
is the conditional mutual information between discrete random variablesand given a random variable . Also, we use empirical distributions to compute mutual information functions based on the training set.
Efficient and Robust Feature Selection (RFS) is an efficient feature selection method proposed by Nie et al. (2010) that exploits the noise robustness property of the joint -norm loss function, by applying the -norm minimization on both the loss function and its associated regularization function. Li et al. (2017) define RFS’s objective function as
where is the data matrix, is the one-hot label indicator matrix, is a matrix indicating feature contributions to classes, and is the regularization parameter. Features are then ranked by the norm values of the corresponding row in the optimal matrix . The value of for our experiments was chosen by performing RFS on a wide range of values and picking the value that led to the highest accuracy on the cross-validation set.
Feature Quality Index (FQI) is a feature selection method proposed by De Rajat et al. (1997) that utilizes the output sensitivity of the considered model to changes in the input to rank features. FQI serves as the main inspiration for our proposed method and as elaborated in Verikas and Bacauskiene (2002), the FQI of feature is computed as
where is the total number of training examples,
is the output of the neural network when thetraining example is the input, and is the output of the neural network when the training example, with the value of the feature set to , is the input.
3.1 Simulating the Removal of a Feature
Is it possible to simulate the effects of the removal of a feature using a model that has already been trained on a training set consisting of all the features? During backpropagation, higher losses in the output layer tend to manifest as a result of larger changes, from the optimal, in the values of the weights in the neural network. Generally, the magnitudes of weights connected to the neurons in the input layer that correspond to more salient features tend to be larger and this has been extensively documented byBauer Jr et al. (2000), Belue and Bauer Jr (1995), and Priddy et al. (1993). Similar to FQI, we simulate the removal of a feature by setting the input to the neuron corresponding to that feature to
. This essentially means that the input neuron is dead because all the weights/synapses from that neuron to the next layer will not have an impact on the output of the neural network. If these weights influence the output of the neural network, then the model that has been trained with all features will experience a degradation in its ability to classify the input data. Since more salient features possess weights of higher magnitude, these weights influence the output to a greater extent and setting the values of more salient features toin the input will result in a greater degradation in the ability of the neural network to classify the input compared to when the same is done for less salient features. This can be measured using the loss (Mean Squared Error or Cross-Entropy Error) given in the output layer, where a greater loss corresponds to a greater degradation. This is the basis of the Weight Based Analysis feature selection methods outlined by Lal et al. (2006). We also observed that normalizing the training set by setting the mean to and the variance to helps this process because when we set the input of a feature to , we are effectively setting the input to the mean, and lower variances in the data now manifest as lower weights in the input layer. To summarize, the pre-trained model prioritizes the removal of features that are non-relevant to the classification task by simulating the removal of a feature and computing the resulting loss of the model. Features whose removal results in a lower loss are less relevant and we will refer to the loss value of this model as a feature’s Relevance Score.
3.2 Autoencoders Reveal Non-Linear Correlations
In some cases, the weights connected to less salient features also possessed high magnitudes. This is because these features were redundant in presence of other salient features as described in Sec. 1.1. Hence, we use a filter based technique that is independent of a learning algorithm to detect these redundant features. We experimented with methods like PCA as detailed by Witten et al. (2009) and correlation coefficients as detailed by Mitra et al. (2002) but these methods revealed only linear correlations in data, which is why we introduced autoencoders into the proposed method because they have the ability to reveal non-linear correlations as examined by Sakurada and Yairi (2014). To eliminate one feature from a set of features, we train the autoencoder with one hidden layer consisting of hidden neurons using the normalized training set. This hidden layer can either be dense, LSTM, or of other types depending on the data we are dealing with. To evaluate a feature, we set its corresponding values in the training set to and pass the set into the autoencoder. We then take the Mean Squared Error (MSE) of the output and the input before the values corresponding to the evaluated feature were set to and perform this for each of the features separately. The feature with the lowest MSE is the least salient feature because the other features in the latent space consisting of neurons were able to compensate for the loss of this feature. We refer to this MSE as the feature’s Redundancy Score. We found that features that are linearly correlated possessed the least MSEs followed by features that possessed higher orders of correlation.
3.3 Using Transfer Learning to prevent retraining
To eliminate out of features, we first pick a state-of-the-art model depending on the data type and train it on the training set using part of it as the cross-validation set. We call this model the Ranker Model (RM) as it allows us to rank the saliency of the features. Next, we set the input for each of the features in all the examples of the training set to one at a time in a round-robin fashion to obtain a list of Relevance Scores after evaluating the modified training sets on the RM. Additionally, we train the autoencoder with one hidden layer consisting of hidden neurons and pass the same modified training sets through the autoencoder to obtain Redundancy Score for each of the features. We then divide the Relevance and Redundancy Scores by their corresponding ranges so that they both contribute equally to the final decision and add the corresponding Relevance and Redundancy Scores to obtain the Saliency Score. The feature with the lowest Saliency Score is then eliminated from the training set. In the context of the RM, elimination means that that feature is permanently set to for all the examples in the training set. Thus, we can reuse the same RM on the next iteration of AMBER. In the context of the autoencoder, elimination means that that feature is permanently removed from the training set for all the examples. This process is done iteratively times to eliminate features. Note how AMBER is able to reuse the same RM while only having to train a single autoencoder at every iteration, which is not very computationally expensive to train because of its simple architecture. AMBER, as its name suggests, uses the RM and autoencoders to examine both relevance and redundancy relationships among features in the training data that they are already familiar with to iteratively eliminate features.
The testing set is separated and never seen by any of the components of AMBER. Once the final set of features that were eliminated are determined, they are completely removed from both the training and testing sets. The final architecture is then trained on the training set consisting of features and tested on the testing set also consisting of features.
4.1 Experimental Setup
We used a GPU server equipped with Nvidia Tesla P100 GPUs, each with GB of memory. With the exception of the RadioML2016.10b dataset for which we used alltimes and plotted the average accuracies at each feature count for the comparisons in Fig. 13. The source code for AMBER, links to the datasets considered, and the error bars for the comparison plots are available at https://github.com/sharanramjee/AMBER.
Each dataset corresponds to a different domain of data and encompasses both low and high dimentional data to demonstrate the versatility of AMBER. The state-of-the-art models used for feature selection in both AMBER and FQI are illustrated in Fig. 8. The final models that are trained on the set of selected features are common across all the feature selection methods that are compared and are trained until early stopping is achieved with a patience value of
to ensure that the comparisons are fair. Moreover, the final models used for the Reuters and Wisconsin Breast Cancer dataset are the same as the state-of-the-art models used by AMBER for feature selection. For all the datasets, the softmax activation function is applied to the output layer with the cross-entropy loss function. The test split used for the Reuters and the Wisconsin Breast Cancer dataset iswhile the test split used for the RadioML2016.10b dataset is . The comparison plots in Fig. 13 were not smooth when feature counts for all features in decrements of one were plotted and thus, in the interest of readability, we plotted them in larger feature count decrements as specified for each dataset. Finally, to demonstrate that the final model does not necessarily have to be the same as the RM used by AMBER, we used different models as the final model and the RM for the MNIST and RadioML2016.10b datasets.
This is an image dataset created by LeCun et al. (1998) that consists of x grayscale images with classes, each belonging to one of the digits, along with a test set that contains images of the same dimensions. The total number of features here is . The final model used is an MLP model consisting of fully connected layers with , , and
(output layer) neurons respectively. ReLU is applied to each of the layers withneurons and these layers are followed by dropout layers with a dropout rate of .
This is a text dataset from the Keras built-in datasets that consists of newswires from Reuters with classes, each representing a different topic. Each wire is encoded as a sequence of word indices, where the index corresponds to a word’s frequency in the dataset. For our demonstration, the most frequent words will be used and thus, the total number of features is .
Wisconsin Breast Cancer
This is a biological dataset created by Street et al. (1993) that consists of features that represent characteristics of cell nuclei that have been measured from an image of Fine Needle Aspirates (FNAs) of breast mass. The dataset consists of examples that belong to classes: malignant or benign. The total number of features here is .
This is a datset of signal samples used by O’Shea et al. (2016) that consists of
-sample complex time-domain vectors withclasses, representing different modulation types. It consists of Signal to Noise Ratio (SNR) values ranging from - dB to dB in increments of dB; we only choose the results of the dB data to better illustrate the results. Each of the samples consists of a real part and a complex part and thus, the input dimensions are x, where the total number of features is . This dataset is unique because only pairs of features (belonging to the same sample) can be eliminated. AMBER, like FQI, is powerful in such situations as the pairs of features can be set to to evaluate their collective rank. This is also useful in the case of GANs, where sets of pixels/features in a -D pool need to be evaluated to craft adversarial attacks as elaborated by Papernot et al. (2016). The other feature selection methods fail in this case because they account for feature interactions between the pairs of features as well, which is why AMBER outperforms them as it does not. For the other methods, to eliminate pairs of features belonging to the same sample, we simply added the scores belonging to the two features to obtain a single score for the pairs of features belonging to the same sample before eliminating the sample. The final model used here is the ResNet as detailed by Ramjee et al. (2019).
4.3 Classification Accuracies
The final model classification accuracy plots across the selected features for the compared methods can be observed in Fig. 13. We observe the impressive performance delivered by AMBER that generally outperforms that of all considered methods, particularly when the number of selected features becomes very low (about average accuracy with out of features for the Cancer dataset and about average accuracy with out of samples for the RadioML dataset).
5.1 Feature Selection leads to higher accuracies
In some cases, like in the cases of the Wisconsin Breast Cancer and the RadioML datasets, we observed that with AMBER, the accuracy of the final model trained using the selected subset of features is higher than the model trained using all the features. In most cases where this happens, Kohavi and Sommerfield (1995) justify that it is because the model was overfitting the data because some training examples belonging to two different classes were compactly packed in the feature space consisting of all the features. However, once they were projected onto the feature space consisting of the selected subset of features, the same data could be better distinguished as the decision boundary divided these training examples better.
For instance, in the case of the RadioML2016.10b dataset, the accuracy with all the samples ( features) was about , where the main source of error was the AM-DSM and WBFM classes that were often misclassified. After AMBER is used to reduce the number of samples to , the accuracy increased to about . To illustrate this, we used PCA to reduce the dimensions of the training set that belong to these two classes to -D and in accordance to Metsalu and Vilo (2015), plotted the training set before and after AMBER.
5.2 Overfitting the RM facilitates better Feature Selection
In Sec. 3 and as evidenced by Wang et al. (2004), we elaborated on how more salient features possess higher magnitudes of weights in the input layer than features that are less salient, which is the property of neural networks that serves as the basis for AMBER. The performance of AMBER heavily depends on the performance of the RM that ranks the features. In some cases, however, even the state-of-the-art models do not have high classification accuracies. In such cases, we can obtain better feature selection results with AMBER by overfitting the RM on the training set.
We will demonstrate this using the toy example illustrated in Fig. 20
that portrays the architecture used for the RM along with the corresponding hyperparameters. Here, featureis and feature is . Feature is more salient than feature as it is able to form a decision boundary that allows for better classification of the data (shown in (c)), while feature cannot (shown in (b)). Each of these features have weights in the input layer and as expected, the weights connected to feature manifest into weights of higher average magnitude than those belonging to feature as shown in Fig. 24
. As the number of training epochs increases, the difference in the average magnitudes of the weights increases. This implies that the RM will be able to better rank the saliency of features as the difference between the Relevance Scores of more and less salient features increases. Thus, we can overfit the RM on the training set by training it for a large number of epochs without regularization to enable better feature selection.
6 Concluding Remarks
AMBER presents a valuable balance in the trade-off between computational efficiency in feature selection, in which filter-based methods excel at, and performance (i.e. classification accuracy), in which traditional wrapper methods excel at. It is inspired by FQI with two major differences: - Instead of making the final selection of the desired feature set based on simulating the model’s performance with elimination of only a single feature, the final model’s performance in AMBER is simulated with candidate combinations of selected features, - The autoencoder is used to capture redundant features; a property that is missing in FQI as well as most wrapper feature selection methods. However, we found AMBER to require slightly larger computational time than the considered 4 state-of-the-art methods, and we also found it to require far less time than state-of-the-art wrapper feature selection methods, as it does not require retraining the RM in each iteration. It is also worth mentioning that the final values that the weights connected to the input features manifest after training are dependent on the initialization of these weights. We believe - and plan to investigate in future work - that following different initialization schemes could allow us to create a larger difference between the magnitudes of the weights of more salient and less salient features rather than always relying on overfitting the training set.
- Bauer Jr et al.  Kenneth W Bauer Jr, Stephen G Alsing, and Kelly A Greene. Feature screening using signal-to-noise ratios. Neurocomputing, 31(1-4):29–44, 2000.
Belue and Bauer Jr 
Lisa M Belue and Kenneth W Bauer Jr.
Determining input features for multilayer perceptrons.Neurocomputing, 7(2):111–121, 1995.
- Bolón-Canedo et al.  Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. A review of feature selection methods on synthetic data. Knowledge and information systems, 34(3):483–519, 2013.
- Chandrashekar and Sahin  Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
- De Rajat et al.  K De Rajat, Nikhil R Pal, and Sankar K Pal. Feature analysis: Neural network and fuzzy set theoretic approaches. Pattern Recognition, 30(10):1579–1590, 1997.
- Duda et al.  Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012.
Fast binary feature selection with conditional mutual information.
Journal of Machine learning research, 5(Nov):1531–1555, 2004.
Guyon et al. 
Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik.
Gene selection for cancer classification using support vector machines.Machine learning, 46(1-3):389–422, 2002.
- Guyon et al.  Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti A Zadeh. Feature extraction: foundations and applications, volume 207. Springer, 2008.
- Hira and Gillies  Zena M Hira and Duncan F Gillies. A review of feature selection and feature extraction methods applied on microarray data. Advances in bioinformatics, 2015, 2015.
- Kohavi and Sommerfield  Ron Kohavi and Dan Sommerfield. Feature subset selection using the wrapper method: Overfitting and dynamic search space topology. In KDD, pages 192–197, 1995.
- Lal et al.  Thomas Navin Lal, Olivier Chapelle, Jason Weston, and André Elisseeff. Embedded methods. In Feature extraction, pages 137–165. Springer, 2006.
LeCun et al. 
Yann LeCun, Corinna Cortes, and J.C. Burges, Christopher.
The MNIST database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
- Li et al.  Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. Feature selection. ACM Computing Surveys, 50(6):1–45, Dec 2017. ISSN 0360-0300. doi: 10.1145/3136625. URL http://dx.doi.org/10.1145/3136625.
- Liu et al.  Xiaoyu Liu, Diyu Yang, and Aly El Gamal. Deep neural network architectures for modulation classification. In 2017 IEEE 51st Asilomar Conference on Signals, Systems, and Computers, pages 915–919, 2017.
Metsalu and Vilo 
Tauno Metsalu and Jaak Vilo.
Clustvis: a web tool for visualizing clustering of multivariate data using principal component analysis and heatmap.Nucleic acids research, 43(W1):W566–W570, 2015.
- Mitra et al.  Pabitra Mitra, CA Murthy, and Sankar K. Pal. Unsupervised feature selection using feature similarity. IEEE transactions on pattern analysis and machine intelligence, 24(3):301–312, 2002.
- Nie et al.  Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. Efficient and robust feature selection via joint ℓ2, 1-norms minimization. In Advances in neural information processing systems, pages 1813–1821, 2010.
- O’Shea et al.  T. O’Shea, J. Corgan, and T. Clancy. Convolutional radio modulation recognition networks. In Proc. International conference on engineering applications of neural networks, 2016.
Papernot et al. 
Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay
Celik, and Ananthram Swami.
The limitations of deep learning in adversarial settings.In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387. IEEE, 2016.
- Priddy et al.  Kevin L Priddy, Steven K Rogers, Dennis W Ruck, Gregory L Tarr, and Matthew Kabrisky. Bayesian selection of important features for feedforward neural networks. Neurocomputing, 5(2-3):91–103, 1993.
- Ramjee et al.  Sharan Ramjee, Shengtai Ju, Diyu Yang, Xiaoyu Liu, Aly El Gamal, and Yonina C Eldar. Fast deep learning for automatic modulation classification. arXiv preprint arXiv:1901.05850, 2019.
- Saeys et al.  Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.
- Sakurada and Yairi  Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2Nd Workshop on Machine Learning for Sensory Data Analysis, MLSDA’14, pages 4:4–4:11, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-3159-3. doi: 10.1145/2689746.2689747. URL http://doi.acm.org/10.1145/2689746.2689747.
- Street et al.  W Nick Street, William H Wolberg, and Olvi L Mangasarian. Nuclear feature extraction for breast tumor diagnosis. In Biomedical image processing and biomedical visualization, volume 1905, pages 861–871. International Society for Optics and Photonics, 1993.
- Verikas and Bacauskiene  Antanas Verikas and Marija Bacauskiene. Feature selection with neural networks. Pattern Recognition Letters, 23(11):1323–1335, 2002.
- Vidal-Naquet and Ullman  Michel Vidal-Naquet and Shimon Ullman. Object recognition with informative features and linear classification. In ICCV, volume 3, page 281, 2003.
- Wang et al.  Xizhao Wang, Yadong Wang, and Lijuan Wang. Improving fuzzy c-means clustering based on feature-weight learning. Pattern recognition letters, 25(10):1123–1132, 2004.
- Weston et al.  Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, and Vladimir Vapnik. Feature selection for SVMs. In Advances in neural information processing systems, pages 668–674, 2001.
- Witten et al.  Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009.
Yu and Liu 
Lei Yu and Huan Liu.
Feature selection for high-dimensional data: A fast correlation-based filter solution.In Proceedings of the 20th international conference on machine learning (ICML-03), pages 856–863, 2003.