Sparse Centroid-Encoder: A Nonlinear Model for Feature Selection

by   Tomojit Ghosh, et al.
Colorado State University

We develop a sparse optimization problem for the determination of the total set of features that discriminate two or more classes. This is a sparse implementation of the centroid-encoder for nonlinear data reduction and visualization called Sparse Centroid-Encoder (SCE). We also provide a feature selection framework that first ranks each feature by its occurrence, and the optimal number of features is chosen using a validation set. The algorithm is applied to a wide variety of data sets including, single-cell biological data, high dimensional infectious disease data, hyperspectral data, image data, and speech data. We compared our method to various state-of-the-art feature selection techniques, including two neural network-based models (DFS, and LassoNet), Sparse SVM, and Random Forest. We empirically showed that SCE features produced better classification accuracy on the unseen test data, often with fewer features.


A novel embedded min-max approach for feature selection in nonlinear SVM classification

In recent years, feature selection has become a challenging problem in s...

Feature Selection Based on Sparse Neural Network Layer with Normalizing Constraints

Feature selection is important step in machine learning since it has sho...

Fast forward feature selection for the nonlinear classification of hyperspectral images

A fast forward feature selection algorithm is presented in this paper. I...

forgeNet: A graph deep neural network model using tree-based ensemble classifiers for feature extraction

A unique challenge in predictive model building for omics data has been ...

Network-Guided Biomarker Discovery

Identifying measurable genetic indicators (or biomarkers) of a specific ...

Feature selection revisited in the single-cell era

Feature selection techniques are essential for high-dimensional data ana...

Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data

Machine learning methods are used to discover complex nonlinear relation...

1 Introduction

Technological advancement has made high-dimensional data readily available. For example, in bioinformatics, the researchers seek to understand the gene expression level with microarray or next-generation sequencing techniques where each point consists of over 50,000 measurements

[Pease5022, shalon1996dna, metzker2010sequencing, reuter2015high]

. The abundance of features demands the development of feature selection algorithms to improve a Machine Learning task, e.g., classification. Another important aspect of feature selection is knowledge discovery from data. Which biomarkers are important to characterize a biological process, e.g., the immune response to infection by respiratory viruses such as influenza

[o2013iterative]? Additional benefits of feature selection include improved visualization and understanding of data, reducing storage requirements, and faster algorithm training times.

Feature selection can be accomplished in various ways that can be broadly categorized into the filter, wrapper, and embedded methods. In a filter method, each variable is ordered based on a score. After that, a threshold is used to select the relevant features [lazar2012survey]. Variables are usually ranked using correlation [guyon2003introduction, yu2003feature], and mutual information [vergara2014review, fleuret2004fast]. In contrast, a wrapper method uses a model and determines the importance of a feature or a group of features by the generalization performance of the predetermined model [el2016review, hsu2002annigma]

. Since evaluating every possible combination of features becomes an NP-hard problem, heuristics are used to find a subset of features. Wrapper methods are computationally intensive for larger data sets, in which case search techniques like Genetic Algorithm (GA)


or Particle Swarm Optimization (PSO)

[kennedy1995particle] are used. In embedded methods, feature selection criteria are incorporated within the model, i.e., the variables are picked during the training process [lal2006embedded]. Iterative Feature Removal (IFR) uses the absolute weight of a Sparse SVM model as a criterion to extract features from the high dimensional biological data set [o2013iterative].

This paper proposes a new embedded variable selection approach called Sparse Centroid-Encoder (SCE) to extract features when class labels are available. Our method extends the Centroid-Encoder model [GHOSH201826, GhKi2020], where we applied a penalty to a sparsity promoting layer between the input and the first hidden layer. We evaluate this Sparse Centroid-Encoder on diverse data sets and show that the selected features produce better generalization than other state-of-the-art techniques. Our results showed that SCE picked fewer features to obtain high classification accuracy. As a feature selection tool, SCE uses a single model for the multi-class problem without the need to create multiple one-against-one binary models typical of linear methods, e.g., Lasso [tibshirani1996regression], or Sparse SVM [chepushtanova2014band]. Although SCE can be used both in binary and multi-class problems, we focused on the multi-class feature selection problem in this paper. The work of [li2016deep]

also uses a similar sparse layer between the input and the first hidden with an Elastic net penalty while minimizing the classification error with a softmax layer. The authors used Theano’s symbolic differentiation

[bergstra2010theano] to impose sparsity. In contrast, our approach minimizes the centroid-encoder loss with an explicit differentiation of the function using the sub-gradient. Unlike DFS, our model can capture the intra-class variability by using multiple centroids per class. This property is beneficial for multi-modal data sets.

The article is organized as follows: In Section 2 we present the Sparse Centroid-Encoder algorithm. In Section 3 we apply SCE to a range of bench-marking data sets taken from the literature. In Section 4, we review related work, for both linear and non-linear feature selection. In Section 5, we present our discussion and conclusion.

2 Sparse Centroid-Encoder

Centroid-encoder (CE) neural networks are the starting point of our approach [GhKi2020, GHOSH201826, aminian2021early]. We present a brief overview of CEs and demonstrate how they can be extended to perform non-linear feature selection.

2.1 Centroid-encoder

The CE neural network is a variation of an autoencoder and can be used for both visualization and classification tasks. Consider a data set with

samples and classes. The classes denoted where the indices of the data associated with class are denoted . We define centroid of each class as where is the cardinality of class . Unlike autoencoder, which maps each point to itself, the CE maps each point to its class centroid by minimizing the following cost function over the parameter set :


The mapping is composed of a dimension reducing mapping (encoder) followed by a dimension increasing reconstruction mapping (decoder). The output of the encoder is used as a supervised visualization tool [GhKi2020, GHOSH201826]

, and attaching another layer to map to the one-hot encoded labels performs robust classification


2.2 Sparse Centroid-encoder for feature selection

The Sparse Centroid-encoder (SCE) is a modification to the centroid-encoder architecture as shown in Figure 1. Unlike centroid-encoder, we haven’t used a bottleneck architecture as visualization is not our aim here. The input layer is connected to the first hidden layer via the sparsity promoting layer (SPL). Each node of the input layer has a weighted one-to-one connection to each node of the SPL. The number of nodes in these two layer are the same. The nodes in SPL don’t have any bias or non-linearity. The SPL is fully connected to the first hidden layer, therefore the weighted input from the SPL will be passed to the hidden layer in the same way that of a standard feed forward network. During training, a penalty will be applied to the weights connecting the input layer and SPL layer. The sparsity promoting penalty will drive most of the weights to near zero and the corresponding input nodes/features can be discarded. Therefore, the purpose of the SPL is to select important features from the original input. Note we only apply the penalty to the parameters of the SPL.

Figure 1: The architecture of Centroid-encoder and Sparse Centroid-encoder. Notice the Centroid-encoder uses a bottleneck architecture which is helpful for visualization. In contrast, the Sparse Centroid-encoder doesn’t use any bottleneck architecture; instead, it employs a sparse layer between the input and the first hidden layer to promote feature sparsity.

Denote to be the parameters (weights) of the SPL and to be the parameters of the rest of the network. The cost function of sparse centroid-encoder is given by


where is the hyper-parameter which controls the sparsity. A larger value of will promote higher sparsity resulting more near-zero weights in SPL. In other words, is a knob that controls the number of features selected from the input data.

Like centroid-encoder, we trained sparse centroid-encoder using error backpropagation, which requires the gradient of the cost function of Equation

2. As function is not differentiable at , we implement this term using the sub-gradient.

2.3 Empirical Analysis of SCE

In this section we present an empirical analysis of our model. The results of feature selection for the digits 5 and 6 from the MNIST set are displayed in Figure

2. In panel (a), we compare the two terms that contribute to Equation 2, i.e., the centroid-encoder and costs, weighted with different values of . As expected, we observe that the CE cost monotonically decreases with , while the cost increases as decreases. For larger values of , the model focuses more on minimizing the -norm of the sparse layer, which results in smaller values. In contrast, the model pays more attention to minimizing the CE cost for small s; hence we notice smaller CE cost and higher cost.

Figure 2: Analysis of Sparse Centroid-encoder. (a) Change of the two costs over . (b) Change of validation accuracy over . (c) Sparsity plot of the weight of for . (d) Same as (c) but .

Panel (b) of Figure 2 shows the accuracy on a validation set as a function nine different values of ; the validation accuracy reached its peak for . In panels (c) and (d), we plotted the magnitude of the feature weights of the sparse layer in descending order. The sharp decrease in the magnitude of the weights demonstrates the promotion of sparsity by SCE. The model effectively ignores features by setting their weight to approximately zero. Notice the model produced a sparser solution for , selecting only 32 features compared to 122 chosen variables for . Figure 3 shows the position of the selected features, i.e., pixels, on the digits 5 and 6. The intensity of the color represents the feature’s importance. Dark blue signifies a higher absolute value of the weight, whereas light blue means a smaller absolute weight.

Figure 3: Demonstration of the sparsity of Sparse Centroid-encoder on MNIST digits 5 and 6. The digits are shown in white, and the selected pixels are marked using blue—the darkness of blue indicates the relative importance of the pixel to distinguish the two digits. We showed the selected pixels for two choices of . Notice that for , the model chose the lesser number of features, whereas it picked more pixels for . is the nob which controls the sparsity of the model.

Our last analysis shows how SCE extracts informative features from a multi-modal dataset, i.e., data sets whose classes appear to have multiple clusters. In this case, one center per class may not be optimal, e.g., ISOLET data. To this end, we trained SCE using a different number of centers per class where the centers were determined using standard -Means algorithm [lloyd1982least, macqueen1967some]. After the feature selection, we calculated the validation accuracy and plotted it against the number of centers per class in Figure 4. The validation accuracy jumped significantly from one center to two centers per class. The increased accuracy indicates that the speech classes are multi-modal, further validated by the two-dimensional PCA plot of the three classes shown in panel (b)-(d).

Figure 4: Sparse Centroid-encoder for multi-modal data set. Panel (a) shows the increase in validation accuracy over the number of centroids per class. Panel (b)-(d) shows the two-dimensional PCA plot of the three speech classes.

2.4 Feature Selection Workflow Using Sparse Centroid-Encoder

By design, sparse methods identify a small number of features that accomplish a classification task. If one is interested in all the discriminatory features that can be used to separate multiple classes, then one can repeat the process of removing good features. This section describes how sparse centroid-encoder (SCE) can be used iteratively to extract all discriminatory features from a data set; see [o2013iterative]

for an application of this approach to sparse support vector machines.

SCE is a model based on neural network architecture; hence, it’s a non-convex optimization. As a result, multiple runs will produce different solutions, i.e., different feature sets on the same training set. These features may not be optimal given an unseen test set. To find out the robust features from a training set, we resort to frequency-based feature pruning. In this strategy, first, we divide the entire training set into folds. On each of these folds, we ran the SCE and picked the top (user select) number of features. We repeat the process times to get feature sets. Then we count the number of occurrences of each feature and call this number the frequency of a feature. We ordered the features based on the frequency and picked the optimum number from a validation set. We present the feature selection work flow in Figure 5. We trained SCE using Scaled Conjugate Gradient Descent [moller1993scaled]. The architectural details are kept in Appendices.

Figure 5: Feature selection workflow using Sparse Centroid-encoder. a First, the data set has been partitioned into training and validation. b We further partitioned the training set into n splits. c On each of the training splits, we ran Sparse Centroid-encoder to get n feature sets. d We calculated the occurrence of each feature among the n sets and called it the frequency of the feature. We ranked features from high to a low frequency to get an ordered set. e At last, we picked the optimum number of features using a validation set.

3 Experimental Results

Here we present a range of comparative benchmarking experiments on a variety of data sets and feature selection models. We used nine data sets from a variety of domains to run experiments.

3.1 Experimental Details

We did bench-marking experiments to compare the sparse centroid-encoder with other state-of-the-art feature selection methods. To make the evaluation objective, we compared the classification accuracy on the unseen data using the selected features of different models. All experiments share the following workflow:

  • SCE is used to select an optimal number of features on the training samples. We chose the penalty from a set of values using a validation set.

  • Build classification models with these features on the training set. We used centroid-encoder and one hidden layer artificial neural networks as the classification model [aminian2021early].

  • Compute the accuracy on the sequestered test set using the

    trained models and report the mean accuracy with standard deviation.

3.2 Quantitative and Qualitative Analysis

Now we present the results from a comprehensive analysis across nine data sets.

3.2.1 Comparison with LassoNet on UCI data sets

In our first set of experiments, we compared SCE with LassoNet [lemhadri2021lassonet] on six publicly data sets taken from different domains, including image (MNIST, Fashion MNIST, COIL-20), speech (ISOLET), activity recognition (Human Activity Recognition Using Smartphones), and a biological data set (Mice Protein data). These data sets have been used in the literature for benchmarking [lemhadri2021lassonet, balin2019concrete]

. Following the experimental protocol of Lemhadri et al., we randomly partitioned each data set into a 70:10:20 split of training, validation, and test set. We normalized the training partition by subtracting the mean and dividing each feature by its corresponding standard deviation. We used the mean and standard deviation of the training to standardize the test samples. We used the training set for feature selection and the validation set for hyperparameter tuning. After the feature selection, we used a one hidden layer neural network classifier to predict the class label on the test set. We ran the classifier ten times (

) and presented the mean accuracy with standard deviation in Table 1. Apart from showing the accuracy with top-50 features, we also give classification results with an optimum number of features selected using the validation set.

As we can see from Table 1, the features of the Sparse Centroid-encoder produce better classification accuracy than LassoNet in most of the cases. Especially for Mice Protein, Activity, and MNIST, our model has better accuracy by a margin of . On COIL-20, Lemhadri et al. preprocessed the data by resizing them to 20 x 20 images. We ran our experiment on the original images. Apart from comparing with LassoNet, we also presented accuracies using all features and the optimum number of features. In most cases, the optimum number of features gave a reasonably good performance compared to all features. It’s noteworthy that MNIST and COIL-20 classification using the optimum number of variables is slightly better than all; we used only and of total features, respectively. The superior performance of our model on a wide range of data sets suggests that Sparse Centroid-encoder may be applied as a feature detector in many domains.

Data set Top 50 features Optimum no. of features Centers / Class All features
LassoNet SCE No. of features SCE for SCE
Mice Protein 22 1
MNIST 377 1
FMNIST 355 3
ISOLET 245 5
COIL-20 99.1 90 1
Activity 90 4
Table 1: Classification results using LassoNet and SCE features on six publicly available data sets. We compared LassoNet with SCE using the top 50 variables, and we also reported accuracy using the optimum number and all features for each data set. Numbers for LassoNet are reported from [lemhadri2021lassonet]. All the reported accuracies were measured on the test set.

3.2.2 Single Cell Data

GM12878 is a single cell data set that has been previously used to test multiclass feature selection algorithms [li2016deep]. The samples were collected from the annotated DNA region of lymphoblastoid cell line. Each sample is represented by a 93 dimensional features sampled from three classes: active enhancer (AE), active promoter (AP) and background (BG) where each class contains number of samples. The data set is split equally into a separate training, validation and test sets. We used the validation set to tune hyper-parameters and to pick the optimal number of features. After the feature selection step, we merged the training and validation set and trained centroid-encoder classifiers with the selected features, and reported classification accuracy on the test set.

We use the published results for deep feature selection (DFS), shallow feature selection, LASSO, and Random Forest (RF) from the work of Li et al. to evaluate SCE as shown in Table

2. To compare with Li et al., we used the top 16 features to report the mean accuracy of the test samples. We also report the test accuracy using the top 48 features picked from the validation set, this was the best result in our experiment. When restricted to the top 16, we see that the SCE features still outperform all the other models.

Models No. of Features Accuracy
SCE 48
SCE 16
Deep DFS
Shallow DFS
Random Forest
Table 2: Classification accuracies using the top 16 features by various techniques. Results of Deep DFS, Shallow DFS, LASSO, and Random Forest are reported from [li2016deep]. We present accuracy with the top 48 features which were selected from a validation set.

Among all the models, LASSO features exhibit the worst performance with an accuracy of . This relatively low accuracy is not surprising, given LASSO is a linear model.

The classification performance gives a quantitative measure that doesn’t reveal the biological significance of the selected genes. We did a literature survey of the top genes selected by sparse centroid-encoder and provided the detailed description in the appendix. Some of these genes play an essential role in transcriptional activation, e.g., H4K20ME1 [barski2007high], TAF1 [wang2014crystal], H3K27ME3 [cai2021h3k27me3], etc. Gene H3K27AC [creyghton2010histone] plays a vital role in separating active enhances from inactive ones. Besides that, many of these genes are related to the proliferation of the lymphoblastoid cancer cells, e.g., POL2 [yamada2013nuclear], NRSF/REST [kreisler2010regulation], GCN5 [yin2015histone], PML [salomoni2002role], etc. This survey confirms the biological significance of the selected genes.

3.2.3 Indian Pines Hyperspectral Imagery

Here we compare SCE with sparse support vector machines on the well-known Indian Pines hyperspectral imagery of a variety of crops following the experiments in (

[chepushtanova2014band]). The task is to identify the frequency bands which are essential to assign a test sample correctly to one of the sixteen classes. We took all the 220 bands in the feature selection step, including the twenty water absorption bands. Note that in literature these noisy water absorption bands are often excluded before the experiments [reshma2016dimensionality, cao2017hyperspectral]. We wanted to check whether our model was able to reject them. In fact, our model did discard them as no water absorption bands were in the top features. It appeared that SSVM included some of these noisy bands as described in[chepushtanova2014band].

We followed the experimental protocol in [chepushtanova2014band]. The entire data set is split in half into a training and test set. Because of the small size of the training set, we did a 5-fold cross-validation on the training samples to tune hyper-parameters. After the feature selection on the training set, we took top features to build a CE classifier on the training set to predict the class labels of the test samples. For each we repeat the classification task times. Note that we compared the performance of SCE and SSVM features without spatial smoothing for a more direct comparison of the classification rates. Figure 6 presents the accuracy on the test data using the top bands () which were calculated on the training set. Classification using SCE features generally produces better accuracy. Notice that SCE features yield better classification performance using fewer bands. In particular, the accuracy of the top SCE feature (band 13) is at least higher than the top SSVM feature (band 1). See the Appendix for additional details on the feature sets selected.

Figure 6: Comparison of classification accuracy using SCE and SSVM features.

3.2.4 Respiratory Infections in Humans

GSE73072 This microarray data set is a collection of gene expressions taken from human blood samples as part of multiple clinical challenge studies [liu2016individualized] where individuals were infected with the following respiratory viruses HRV, RSV, H1N1, and H3N2. In our experiment we excluded the RSV study. Blood samples were taken from the individuals before and after the inoculation. RMA normalization [irizarry2003exploration] is applied to the entire data set, and the LIMMA [ritchie2015limma] is used to remove the subject-specific batch effect. Each sample is represented by 22,277 probes associated with gene expression. The data is publically available on the NCBI GeneExpression Omnibus (GEO) with identifier GSE73072.

We conducted our last experiment on the GSE73072 human respiratory infection data where the goal is to predict the classes control, shedders, and non-shedders at the very early phase of the infection, i.e., at time bin spanning hours 1-8. Controls are the pre-infection samples, whereas shedders and non-shedders are post-infection samples picked from the time bin 1-8 hr. Shedders actually disseminate virus while non-shedders do not. We considered six studies, including two H1N1 (DEE3, DEE4), two H3N2 (DEE2, DEE5), and two HRV (Duke, UVA) studies. We used training samples as a validation set—the training set comprised all the studies except for the DEE5, which was kept out for testing. We did a leave-one-subject-out (LOSO) cross-validation on the test set using the selected features from the training set. In this experiment we compared SCE with Random Forest (RF).

The results on this data set in shown in Table 3. The top 35 features of SCE produce the best Balanced Success Rate (BSR) of on the test study DEE5. For the Random Forest model, the best result is achieved with 30 features. We also included the results with 35 biomarkers, but the BSR didn’t improve. Note both the models picked a relatively small number of features, 30 and 35 out of the 22,277 genes, but SCE features outperform RF by a margin of . Although RF selects features with multiple classes using a single model, it weighs a single feature by measuring the decrease of out-of-bag error. In contrast, SCE looks for a group of features while minimizing its cost. We think the multivariate approach of SCE makes it a better feature detector than RF.

Time Bin Model No. of Features BSR
RF 30
RF 35
Table 3: Balanced success rate (BSR) of LOSO cross-validation on the DEE5 test set. The selected features from training set is used to predict the classes of control, shedder, and non-shedder.

4 Related Work

Feature selection has a long history spread across many fields, including bioinformatics, document classification, data mining, hyperspectral band selection, computer vision, etc. It’s an active research area, and numerous techniques exist to accomplish the task. We describe the literature related to the embedded methods where the selection criteria are part of a model. The model can be either linear or non-linear.

4.1 Feature Selection using Linear Models

Adding an penalty to classification and regression methods naturally produce feature selectors. For example, least absolute shrinkage and selection operator or Lasso [tibshirani1996regression] has been used extensively for feature selection on various data sets [fonti2017feature, muthukrishnan2016lasso, kim2004gradient]. Elastic net, proposed by Zou et al. [zou2005regularization]

, combined the Lasso penalty with the Ridge Regression penalty

[hoerl1970ridge] to overcome some limitations of Lasso. Elastic net has been widely applied, e.g., [marafino2015efficient, shen2011identifying, sokolov2016pathway]. Note both Lasso and Elastic net are convex in the parameter space. Support Vector Machines (SVM) [cortes1995support] is a state-of-the-art model for classification, regression and feature selection. SVM-RFE is a linear feature selection model which iteratively removes the least discriminative features until a parsimonious set of predictive features are selected [guyon2002gene]. IFR [o2013iterative], on the other hand, selects a group of discriminatory features at each iteration and eliminates them from the data set. The process repeats until the accuracy of the model starts to drop significantly. Note IFR uses Sparse SVM (SSVM), which minimizes the norm of the model parameters. Lasso, Elastic Net, and SVM-based techniques are mainly applied to binary problems. These models are extended to the multi-class problem by combining multiple binary one-against-one (OAO) or one-against-all (OAA) models. [chepushtanova2014band] used 120 Sparse SVM models to select discriminative bands from the Indian Pine data set, which has 16 classes. On the other hand, Random forest [breiman2001random]

, a decision tree-based technique, finds features from multi-class data using a single model. The model doesn’t use Lasso or Elastic net penalty for feature selection. Instead, the model weighs the importance of each feature by measuring the out-of-bag error.

4.2 Feature Selection using Deep Neural Networks

While the linear models are fast and convex, they don’t capture the non-linear relationship among the input features (unless a kernel trick is applied). Because of the shallow architecture, these models don’t learn a high-level representation of input features. Moreover, there is no natural way to incorporate multi-class data in a single model. Non-linear models based on deep neural networks overcome these limitations. In this section, we will briefly discuss a handful of such models.

[scardapane2017group] used group Lasso [tibshirani1996regression] to impose the sparsity on a group of variables instead of a single variable. They applied the group sparsity simultaneously on the input and the hidden layers to remove features from the input data and the hidden activation. Li et al. proposed deep feature selection (DFS), which is a multilayer neural network-based feature selection technique [li2016deep]. DFS uses a one-to-one linear layer between the input and the first hidden layer. As a sparse regularization, the authors used elastic-net [zou2005regularization] on the variables of the one-to-one layer to induce sparsity. The standard soft-max function is used in the output layer for classification. With this setup, the network is trained in an end-to-end fashion by error backpropagation. Despite the deep architecture, its accuracy is not competitive, and experimental results have shown that the method did not outperform the random forest (RF) method. [kim2016opening]

proposed a heuristics based technique to assign importance to each feature. Using the ReLU activation,

[roy2015feature] provided a way to measure the contribution of an input feature towards hidden activation of next layer. [han2018autoencoder] developed an unsupervised feature selection technique based on the autoencoder architecture. Using a -norm to the weights emanating from each input node, they measure the contribution of each feature while reconstructing the input. The model excavates the input features, which have a minimum contribution. [taherkhani2018deep] proposed a RBM [Hinton:2006:FLA:1161603.1161605, hinton2006reducing] based feature selection model. This algorithm runs the risk of combinatorial explosion for data set with features (e.g., microarray gene expression data set).

5 Discussion and Conclusion

In this paper, we presented Sparse Centroid-Encoder as an effective feature selection tool for multiclass classification problems. The benchmarking results span nine diverse data sets and six methods providing evidence that the features of SCE produce better generalization performance compared to other models. We compared SCE with deep feature selection (deep DFS), a deep neural network-based model, and found that the features of SCE significantly improved the classification rate on test data. The survey of the known functionality of these genes indicates plausible biological significance. In addition to extracting the most robust features, the model shows the ability to discard irrelevant features. E.g., on the Indian Pines hyperspectral image data set, our model didn’t pick any of the water absorption bands considered to be noisy. We also compare our model to multi-class extensions sparse binary classifiers. Chepushtanova et al. used 120 binary class SSVM models on the Indian Pine data set. Similarly, Lasso needed three models for the GM12878 data. These models will suffer a combinatorial explosion when the number of classes increases. In contrast, SCE uses a single model to extract features from a multi-class data set.

SCE compares favorably to the neural network-based models deep DFS, LassoNet, where samples are mapped to class labels. SCE employs multiple centroids to capture the variability within a class, improving the prediction rate of unknown test samples. In particular, the prediction rate on the ISOLET improved significantly from one centroid to multiple centroids suggesting the speech classes are multi-modal. The two-dimensional PCA of ISOLET classes further confirms the multi-modality of the classes. We also observed an enhanced classification rate on FMNIST and Activity data with multiple centroids. In contrast, single-center per class performed better for other data sets (e.g., COIL-20, Mice Protein, GM12878, etc.). Hence, apart from producing an improved prediction rate using features that capture intra-class variance, our model can provide extra information about whether the data is unimodal or multi-modal. This aspect of sparse centroid-encoder distinguishes it from the LassoNet and DFS, which do not model the multi-modal nature of the data.

We also presented a feature selection workflow to determine the optimal number of robust features. Our experimental results showed that the prediction rate using the optimal number of features produces the best result. These results are often very competitive compared to using all features. For example, on Mice Protein data, we got an accuracy of using only 22 variables. On MNIST and COIL-20, our approach used only and , respectively, of the total feature sets and actually outperform the prediction rate of all features. On the GSE73072 data set, we got more than accuracy on the holdout test set using only 35 out of 22,277 features.