A Compressive Classification Framework for High-Dimensional Data

by   Muhammad Naveed Tabassum, et al.

We propose a compressive classification framework for settings where the data dimensionality is significantly higher than the sample size. The proposed method, referred to as compressive regularized discriminant analysis (CRDA) is based on linear discriminant analysis and has the ability to select significant features by using joint-sparsity promoting hard thresholding in the discriminant rule. Since the number of features is larger than the sample size, the method also uses state-of-the-art regularized sample covariance matrix estimators. Several analysis examples on real data sets, including image, speech signal and gene expression data illustrate the promising improvements offered by the proposed CRDA classifier in practise. Overall, the proposed method gives fewer misclassification errors than its competitors, while at the same time achieving accurate feature selection results. The open-source R package and MATLAB toolbox of the proposed method (named compressiveRDA) is freely available.



There are no comments yet.


page 5


Compressive Regularized Discriminant Analysis of High-Dimensional Data with Applications to Microarray Studies

We propose a modification of linear discriminant analysis, referred to a...

Compressing Large Sample Data for Discriminant Analysis

Large-sample data became prevalent as data acquisition became cheaper an...

Classification of high-dimensional data with spiked covariance matrix structure

We study the classification problem for high-dimensional data with n obs...

Improved Design of Quadratic Discriminant Analysis Classifier in Unbalanced Settings

The use of quadratic discriminant analysis (QDA) or its regularized vers...

Randomized Iterative Algorithms for Fisher Discriminant Analysis

Fisher discriminant analysis (FDA) is a widely used method for classific...

Divide-and-Conquer Hard-thresholding Rules in High-dimensional Imbalanced Classification

In binary classification, imbalance refers to situations in which one cl...

Regularized K-means through hard-thresholding

We study a framework of regularized K-means methods based on direct pena...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

High-dimensional (HD) classification is at the core of numerous contemporary statistical studies. An increasingly common occurrence is the collection of large amounts of information on each individual sample point, even though the number of sample points themselves may remain relatively small. Typical examples are gene expression and protein mass spectrometry data, and other areas of computational biology. Regularization and shrinkage are commonly used tools in many applications such as regression or classification to overcome significant statistical challenges posed particularly due to the huge-dimension, low-sample-size (HDLSS) data settings in which the number of features, , is often several magnitudes larger than the sample size, (i.e., ). In this paper, we propose a novel classification framework for such HD data.

In many applications, the ideal classifier of an HD input vector should have three characteristics: (i) the proportion of correctly classified observations is high, (ii) the method is computationally feasible, allowing the training of the classifier in real-time with a personal computer, and (iii) the classifier selects only the significant features, thus yielding interpretable models. The last property is important for example in gene expression studies, where it is important to identify which genes influence the trait (e.g., disease). This is obviously important for the biological understanding of the disease process but can help in the development of clinical tests for early diagnosis

[1]. We propose a HD classification method that takes into account all the aforementioned three key aspects.

Some of the standard off-the-shelf classification methods either lack feature elimination or require that it is done beforehand. For example, the support vector machine (SVM) classifier uses all features. Some other classifiers on the other hand have higher computational cost, e.g., Logit-ASPLS approach (with R-package: plsgenomics)


depends on three hyperparameters. This increases the computational complexity of the method, especially when finer grids of hyperparameters are used and tuned using cross-validation (CV). In this paper, we propose a classification framework referred to as

compressive regularized discriminant analysis (CRDA) method, which is computationally efficient and performs feature selection during the learning process. Our simulation studies and real data analysis examples illustrate that it often achieves superior classification performance compared to off-the-shelf classification approaches in the HDLSS settings.

Many classification techniques assign a -dimensional observation to one of the classes (groups or populations) based on the following rule:


where is called the discriminant function for population . In linear discriminant analysis (LDA), is a linear function of , , for some constant and vector , . The vector depends on the unknown covariance matrix of the populations (via its inverse matrix) which is commonly estimated by the pooled sample covariance matrix (SCM). In the HD setting, the SCM is no-longer invertible, and therefore regularized SCM (RSCM) is used for constructing an estimated discriminant function . Such approaches are commonly referred to as regularized LDA methods, which we refer shortly as RDA. See e.g., [3, 1, 4, 5, 6].

Next note that if the -th entry of is zero, then the -th feature does not contribute in the classification of -th population. To eliminate unnecessary features, one can shrink using element-wise soft-thresholding as in the shrunken centroids (SC)RDA method [3]. These methods are often difficult to tune because the shrinkage threshold parameter is the same across all groups, but different populations would often benefit from different shrinkage intensity. Consequently, they tend to yield higher false-positive rates (FPR).

Element-wise shrinkage does not achieve simultaneous feature elimination as the eliminated feature from group may still affect the discriminant function of group . The proposed CRDA method uses joint-sparsity via -norm based hard-thresholding. Another advantage is of having a shrinkage parameter that is much easier to tune and interpret: namely, the joint-sparsity level instead of a shrinkage threshold as in SCRDA. CRDA also uses two recent state-of-the-art regularized covariance matrix estimators proposed in [7] and [8, 9], respectively. The first estimator, called the Ell-RSCM estimator [7], is designed to attain the minimum mean squared error and is fast to compute due to automatic data-adaptive tuning of the involved regularization parameter. The second estimator is a penalized SCM (PSCM) proposed by [8], which we refer to as Rie-PSCM, that minimizes a penalized negative Gaussian likelihood function with a Riemannian distance penalty. Although a closed-form solution of the optimization problem is not possible to find, the estimator can be computed using a fast iterative Newton-Raphson algorithm developed in [9].

We test the effectiveness of the proposed CRDA methods using partially synthetic genomic data as well as on nine real data sets and compare the obtained results against seven state-of-the-art HD classifiers that also perform feature selection. The results illustrate that overall, the proposed CRDA method gives fewer misclassification errors than its competitors while achieving, at the same time, accurate feature selection with minimal computational efforts. Preliminary results of the CRDA approach has appeared in [10].

The paper is structured as follows. Section II describes the regularized LDA procedure and the regularized covariance matrix estimators used. In Section III, CRDA is introduced as a compressive version of regularized LDA performing feature selection simultaneously with classification. Classification results for 3 different types (image, speech, genomics) of real data sets are also presented. Section IV illustrates the promising performance of CRDA in feature selection for partially synthetic data set while Section V provides analysis results for six real gene expression data sets against a large number of benchmark classifiers. Section VI provides discussion and concludes the paper. MATLAB and R toolboxes are available at http://users.spa.aalto.fi/esollila/crda and https://github.com/mntabassm/compressiveRDA).

Notations: We use to denote the Frobenius (matrix) norm, defined as for all matrices . An -norm is defined as

for any vector . The soft-thresholding (ST)-operator is defined as

where for and is a threshold constant. Finally, denotes the indicator function which takes value 1 when true and value 0 otherwise. denotes expectation.

Ii Regularized LDA

We are given a -variate random vector which we need to classify into one of the groups or populations. In LDA, one assumes that class populations are -variate multivariate normal (MVN) with a common positive definite symmetric covariance matrix over each class but having distinct class mean vectors , . The problem is then to classify to one of the MVN populations, , . Sometimes prior knowledge is available on proportions of each population and we denote by

the prior probabilities of the classes (

). LDA uses the rule (1) with discriminant function

where for .

The LDA rule involves a set of unknown parameters: the class mean vectors and the covariance matrix . These are estimated from the training data set that consists of observations from each of the classes (). Let denotes the class label associated with the -th observation, so . Then is the number of observations belonging to -th population, and we denote by the relative sample proportions. We assume observations in the training data set are centered by the sample mean vectors of the classes,


Since the data matrix is centered group-wise, the pooled (over groups) SCM can be written simply as


In practice, an observation is classified using an estimated discriminant function,


where , and is an estimator of . Note that in (4) the prior probabilities -s are replaced by their estimates, -s. Commonly, the pooled SCM in (3) is used as an estimator of the covariance matrix .

Since , the pooled SCM is no longer invertible. To avoid the singularity of the estimated covariance matrix, a commonly used approach (cf. [11]) is to use a regularized SCM (RSCM) that shrinks the estimator towards a target matrix. One such method is reviewed in subsection II-A. An alternative approach is to use penalized likelihood approach in which a penalty term is added to the likelihood function with an aim to penalize covariance matrices that are far from the desired target matrix. One prominent estimator of this type is reviewed in subsection II-B. There are many other covariance matrix estimators for high-dimensional data, including Bayes estimators [12], structured covariance matrix estimators [13, 14], robust shrinkage estimators [15, 16, 17, 18], etc, but most of these are not applicable in HDLSS setting due to incurred high computational cost.

Ii-a Regularized SCM estimators

A commonly used RSCM estimator (see e.g., [11, 19, 7]) is of the form:


where denotes the shrinkage (or regularization) parameter and

denotes an identity matrix of appropriate size. We use the automated method for choosing the tuning parameter

proposed in [7] and the respective estimators, called Ell1-RSCM and Ell2-RSCM. These estimators are based on that is a consistent estimator of that minimize the mean squared error,


See Appendix A for details.

Note that SCRDA [3] uses a different estimator, , where the regularization parameter is chosen using cross-validation (CV). The benefit of Ell1-RSCM and Ell2-RSCM estimators is that they avoid computationally intensive CV scheme and are fast to compute. In the HD settings, another computational burden is related to inverting the matrix in (5). The matrix inversion can be done efficiently using the SVD-trick detailed in Appendix B.

Ii-B Penalized SCM estimator using Riemannian distance

Assume that is a random sample from a MVN distribution. Then the maximum-likelihood estimator of is found by minimizing the Gaussian log-likelihood function,

over the set of positive definite symmetric matrices, denoted by . When and the SCM is non-singular, the minimum is attained at . Since and is singular, it is common to add a penalty function to the likelihood function to enforce a certain structure to the covariance matrix and to guarantee that the combined objective function has a unique (positive definite matrix) solution. One such penalized SCM (PSCM) proposed by [8] uses a Riemannian distance of from a scaled identity matrix, , where is a shrinkage constant. One then solves the penalized likekelihood problem:


where is the penalty parameter. As in [9], we choose

as the mean of the eigenvalues (

) of the SCM, i.e.,

The eigenvalues of are then shrunken towards , and in the limit as [8]. The solution to (7), which we refer to as Rie-PSCM estimator, does not have a closed-form solution, but a relatively efficient Newton-Raphson algorithm [9] can be used to compute the solution. We choose the associated penalty parameter using the cross-validation procedure proposed in [9].

Iii Compressive RDA

In order to explain the proposed compressive RDA (CRDA) approach, we first write the discriminant rule in (4) in vector form as


where , and the coefficient matrix is defined as,


The discriminant function in (8) is linear in with coefficient matrix . Hence, if the -th row of the coefficient matrix is a zero vector , then the -th feature does not contribute to the classification rule and hence can be eliminated. If the coefficient matrix is row-sparse, then the discriminant rule achieves simultaneous feature elimination. s

To achieve row-sparsity, we apply hard-thresholding (HT) operator on the coefficient matrix , defined as a transform which retains only the elements of rows of that obtained largest values of HT-selector function and set elements of other rows to zero. See Figure 1 for an illustration. The compressive RDA discriminant function is then defined as




By its construction, the CRDA classifier uses only features and thus performs feature selection simultaneously with classification. The tuning parameter determines how many features are selected. As HT-selector function , we use -norm,

or the sample variance,

where . Two dimensional CV search described in subsection III-A is used to choose an appropriate level for joint-sparsity and the best HT-selector function.

Fig. 1: Computing the -rowsparse coefficient matrix in (11)

Shrunken Centroids Regularized Discriminant Analysis (SCRDA) [3] uses an estimator and element-wise soft-shrinkage. SCRDA can be expressed in the form (III) with


where the soft-thresholding function is applied element-wise to . A disadvantage of SCRDA is that the parameter is the same across all groups, and different populations would often benefit from different shrinkage intensity. Another drawback is that an upper bound for shrinkage threshold is data dependent. Also in SCRDA, two-dimensional CV search is used to find the tuning parameter pair and .

Iii-a Cross-validation scheme

The joint-sparsity order and the HT-selector function are chosen using a cross-validation (CV) scheme. For we use a logarithmically spaced grid of -values, where the upper bound for largest element in the grid is defined as follows. For each , we calculate


where denotes the -th row vector of . Thus counts the number of row vectors of that has -norm larger than the average of -norms of row vectors of . As final upper bound we use the minimum value, . We then use a logarithmically spaced grid of values , with starting from .

Apart from the way the lower and upper bounds for the grid of values are chosen, the used CV scheme is conventional. We split the training data set into folds and use folds as a training set and one fold as the validation set. We fit the model (CRDA classifier as defined by (III) and (11)) on the training set using and and compute the misclassification error rate on the validation set. This procedure is repeated so that each of the Q folds is left out in turn as a validation set, and misclassification rates are accumulated. We then choose the pair in the grid that had the smallest misclassification error rate. In case of ties, we choose the one having smaller value of , i.e., we prefer model with smaller number of features.

Iii-B CRDA variants

We consider the following three CRDA variants:

  • CRDA1 uses Ell1-RSCM estimator to compute the coefficient matrix . The tuning parameters, and , are chosen using the CV procedure of subsection III-A.

  • CRDA2 is as CRDA1 except that the is computed using the Ell2-RSCM estimator.

  • CRDA3 uses Rie-PSCM to compute the coefficient matrix , is used as the HT-selector and as the joint-sparsity level.

Uniform priors (, ) are used in all analysis results.

Iii-C Analysis of real data sets

Next, we explore the performance of CRDA variants in classification task of three different types of data: an image data (data set #1), speech data (data set #2) and gene expression data (data set #3) described in Table II

. The first data set is constructed for classification of five English vowel letters spoken twice by thirty different subjects while the second data set consists of modified observations from the MNIST data set of handwritten digits (namely, digits 4 and 9) and used in NIPS 2003 feature selection challenge. These data sets are available at UC Irvine machine learning repository

[20] under the names, Isolette and Gisette data set, respectively. The 3rd data set is from Khan et al. [21] and consists of small round blue cell tumors (SRBCT-s) found in children, and are classified as BL (Burkitt lymphoma), EWS (Ewing’s sarcoma), NB (neuroblastoma) and RMS (rhabdomyosarcoma). The available data is split randomly into training and test sets of fixed length, and , respectively, such that relative proportions of observations in the classes are preserved as accurately as possible. Then such random splits are performed and averages of error rates are reported.

The basic metrics used are the test error rate (TER), feature selection rate (FSR) and the average computational time (ACT) in seconds. TER is the misclassification rate calculated by classifying the observations from the test set using the classification rule estimated from the training set. FSR is defined as , where denotes the number of features used by the classifier, and it is a measure of interpretability of the chosen model. The FSR of CRDA3 will naturally always be higher than FSR of CRDA1 or CRDA2.

The results of the study reported in Table I is divided in three groups. The first group consists of the proposed CRDA classification methods, the middle group consists of five state-of-the-art classifiers (cf. Table III). The third group consists of three standard off-the-shelf classifiers commonly used in machine learning, but which do not perform feature selection. The methods in the third group are RF

which refers to random forest using decision trees (where maximum number of splits in tree was 9 and number of learning cycles was 150),

LogitBoost which refers to LogitBoost method [22] using stumps (i.e., a two terminal node decision tree) as base learners and 120 as the number of boosting iterations, and LinSVM which refers to a linear support vector machine (SVM) using one versus one (OVO) coding design and linear kernel. Note that since

, one can always find a separating hyperplane and hence a linear kernel suffices for SVM.

We first discuss the results of CRDA variants when compared against one another. In the case of data set #2, CRDA1 and CRDA3 obtained notably better misclassification rates than CRDA2. In the case of data set #3, CRDA1 and CRDA2 obtained perfect 0% misclassification results while using only 5% out of all features. This is a remarkable achievement, especially when compared to TER/FSR rates of the competing estimators. In the case of data set #1, CRDA3 had better performance than CRDA1 and CRDA2 but it attained this result using more features than CRDA1 and CRDA2.

Next, we compare CRDA against the methods in the middle group. We notice from Table I that one of the CRDA variants is always the best performing method in terms of both accuracy (TER rates) and feature selection rates (FSR). Only in the case of data set #2, SCRDA is able to achieve equally good error rate (6.3%) as CRDA3.

Finally, we compare the CRDA methods against the methods in the third group. First we notice that the linear SVM is the best performing method among the third group, and for data set #2 it achieves the best error rate among all methods. However, it chooses all features.

As noted earlier, for data set #3 (Khan et al.), CRDA1 and CRDA2 attained perfect classification (0% error rate) while choosing only 5% out of 2308 genes. The heat map of randomly selected 115 genes is shown in Figure 2a while Figure 2b shows the heat map of 115 genes picked by CRDA1. While the former shows random patterns, clearly visible patterns can be seen in the latter heat map. Besides for prognostic classification, the set of genes found important by CRDA1 can be used in developing understanding of genes that influence SRBCT-s.

Fig. 2: For the data set #3 (Khan et al.), the figure displays the heat map of (a) randomly selected 115 genes and (b) 115 genes that were selected by CRDA1 classifier. We note that the CRDA1 classifier obtained 0% error rate on all 10 Monte Carlo splits of the data to training and test set, and outperformed all other classifiers.
#1 Isolet-vowels #2 Gisette #3 Khan et al.
CRDA1 2.8 21.8 7.0 18.8 0 5.0
CRDA2 2.7 16.3 14.4 8.2 0 5.0
CRDA3 1.3 45.1 6.3 36.7 1.2 35.5
PLDA 3.6 100 40.5 50.8 18.8 82.0
SCRDA 1.6 88.2 6.3 87.0 2.8 81.5
varSelRF 3.0 28.9 7.9 18.8 2.4 78.8
Logit-ASPLS 26.4 98.6 27.8 90.0 43.2 100
SPCALDA 19.1 73.7 6.4 47.4 3.2 46.9
SVM 1.3 100.0 6.0 100.0 0.8 100.0
LogitBoost 3.0 100.0 7.7 100.0 10.4 100.0
RF 1.8 100.0 9.5 100.0 1.2 100.0
TABLE I: Classification results of CRDA variants and their competitors (see Table III) for data sets #1-#3 in terms of test error rate (TER) and feature selection rates (FSR) reported in percentages.
data set Description
#1 Isolet-vowels [20] 617 100 200 {20,20,20,20,20} 5 Spoken English vowel letters
#2 Gisette [20] 5000 350 6650 {175,175} 2 OCR digit recognition (’4’ vs ’9’)
#3 Khan et al. [21] 2308 38 25 {5, 14, 7, 12} 4 SRBCTs
#4 Su et al. [23] 5565 62 40 {15, 16, 17, 14} 4 Multiple mammalian tissues
#5 Gordon et al. [24] 12533 109 72 {90, 19} 2 Lung Cancer
#6 Burczynski et al. [25] 22283 76 51 {35, 25, 16} 3 Inflammatory bowel diseases (IBD)
#7 Yeoh et al. [26] 12625 148 100 {9,16, 38, 12, 26, 47} 6 Acute lymphoblastic leukemia (ALL)
#8 Tirosh et al. [27] 13056 114 77 {23, 22, 23, 46} 4 Yeast species
#9 Ramaswamyet al. [28] 16063 117 73 {7 ,6, 7, 7,13, 7, 6, 6, 18, 7, 7, 7, 7,12} 14 Cancer
TABLE II: Summary of the real data sets used in this paper.

Iv Feature selection performance

Next we create a partially synthetic gene expression data set using the 3rd data set in Table II to study how well the different classifiers are able to detect differentially expressed (DE) genes. The data set #3 consists of expression levels of genes and we randomly kept (i.e., of ) covariates of gene expression levels as significant features, i.e., differentially expressed, and the rest irrelevant (non-DE) features are generated from .

Since we now have a ground truth of differentially expressed genes available, we may evaluate the estimated models using the false-positive rate (FPR) and the false-negative rate (FNR). A false-positive is a feature that is picked by the classifier but is not DE in any of the classes in the sense that for all . A false-negative is a feature that is DE in the true model in the sense that for all , but is not picked by the classifier. Formally,

where is the number of truly positive (DE) features, is the number of truly negative (non-DE) features, is the number of false positives and is the number of true positives, i.e., gives the total number of features picked by the classifier. Again, recall that the sparsity level used in the hard-thresholding operator is the number in the case of CRDA methods. Note that both FPR and FNR should be as small as possible for exact and accurate selection (and elimination) of genes.

First, we compare the gene selection performance of CRDA with the conventional gene selection approach [29]

which performs a t-test and computes the p-values using permutation analysis using 10,000 permutations. A gene is considered to be differentially expressed when it shows both statistical and biological significance. Statistical significance is attested when the obtained p-value of the test statistics is below the significance level (false alarm rate), say

. Biological significance is attested when the log ratio of fold change of mean gene expressions is outside the interval for some cutoff limit , say . b(a) display such a volcano plot which displays the - of p-values of -tests for each genes against the fold change. The used statistical significance level is shown as the dotted horizontal line and the biological significance bounds as dotted vertical lines.

The plot illustrates that there are 19 out of 115 differentially expressed genes which are both statistically and biologically significant (genes on the upper left and upper right corners of the plot) along with a few which are biologically significant but not statistically significant (lower left and lower right corners). When considering each class in turn against the rest, then 60 out of a total of 115 DE genes are picked as being DE. The results for CRDA2 based gene selection is shown in b(b) for the same data set where we have plotted the gene index, , against the sample variance, , of the th row vector of the coefficient matrix . From 115 DE genes, CRDA2 has selected 107 DE genes.

Method Reference R-package or other Version
PLDA Penalized LDA [4] penalizedLDA 1.1
SCRDA Shrunken centroids RDA [3] rda
varSelRF Random forest (RF) with variable selection [30] varSelRF 0.7.8

Adaptive sparse partial least squares based logistic regression

plsgenomics 1.5.2

Supervised principal component analysis based LDA

Lasso Regularized multinomial regression with (Lasso) penalty [32] GLMNet 3.0-2
EN Regularized multinomial regression with Elastic Net (EN) penalty [32] GLMNet 3.0-2
TABLE III: State-of-the-art methods included in our comparative study. All of the included methods are designed with an aim of performing feature selection during the learning process. Tuning parameters are chosen using the default method implemented in the R-packages, including the grids for parameter tuning. When CV is used, then a five-fold CV is selected.

As a baseline, we further compute the test error rate of the naive classifier that assigns all observations in the test set to the class that has the most observations in the training set. Figure 3(a) displays the box plots of FSR and TER of CRDA methods and three state-of-the-art classifiers. Again the CRDA classifiers obtained the smallest average TER while choosing the smallest number of features. Figure 3

(b) provides the box plots of FPR and FNR for all methods. The proposed CRDA methods offer the best feature selection accuracy. On the contrary, SCRDA and varSelRF methods produce large false-positive rates, meaning that they have chosen a large set of genes as DE while they are not. PLDA is performing equally well with CRDA in terms of FPR/FNR rates, but its misclassification rates are considerably larger. Furthermore, CRDA methods exhibit the smallest amount of variation (i.e., standard error) in all measures (TER/FSR/FPR/FNR). We also notice that varSelRF does not perform very well compared to CRDA or other regularized LDA techniques like SCRDA.

Fig. 3: Classication accuracy results for the partially synthetic gene expression data. (a) Boxplots of test error rates (TER) and feature selection rates (FSR). (b) Boxplots of false negative rates (FNR) and false positive rates (FPR) reported in percentages. Results are from simulation study of 10 random splits of the data to training set and test set.

V Analysis of real gene expression data

We compare the performance of the proposed CRDA methods to seven state-of-the-art classifiers on a diverse set of real gene expression (microarray) data sets. These data sets #4 – #9 in Table II constitute a representative set of gene expression data sets available in the open literature, having a varying number of features (), sample sizes () or classes (). As earlier, in each Monte-Carlo trial, the data is divided randomly into training sets and test sets of fixed length, and , respectively, such that relative proportions of observations in the classes are preserved as accurately as possible. We do not include CRDA3 in this study due to incurred high computational cost related to computing the Rie-PSCM covariance matrix estimator.

The Gene Expression Omnibus (GEO) data repository contains the data sets #6 and #8 with accession numbers GDS1615 and GDS2910, respectively. The 9th data set (also known as the global cancer map (GCM) data set) is available online [33]. The rest of the data sets, namely data sets 4, 5, and 7 are included in the R-package datamicroarray111https://github.com/ramhiser/datamicroarray.

Figure 4 displays the average TER and FSR values. CRDA method obtained the lowest error rates for 5 out of 6 data sets considered; only for data set #5 (Gordon), Logit-ASPLS had better performance than CRDA. Furthermore, CRDA achieved the top performance by picking a largely reduced number of genes: the number of genes selected ranged from 5.1% to 23.4%. Only Lasso and EN selected fewer number of genes than CRDA methods but their error rates were much higher in all cases except for data set #5 (Gordon).

Fig. 4: Test error rates (TER) and feature selection rates (FSR) in percentages, and average computational times (ACT) in seconds are reported for the last six real data sets given in Table II. Results are averages over random splits of data into training and test sets.

For some data sets, such as for the seventh (Yeoh) and eight (Tirosh), SCRDA and CRDA obtained comparable TER values, but the former fails to eliminate as many genes. Again we observe that many of the competing classifiers, such as varSelRF, PLDA or SPCALDA are performing rather poorly for some data sets; this is especially the case for the sixth (Burczynski) and ninth (Ramaswamy) data sets.

Data set #9 (Ramaswamy) presents a very difficult data set for any any classification method: it contains classes, genes and only 117 observations in the training set. Results obtained by CRDA clearly illustrates its top performance compared to others in very demanding HDLSS data scenarios. CRDA classifiers outperformed most of the methods with a significant margin: CRDA gives smallest error rates (25.2%) while at the same time reducing the number of features (genes) used in the classification procedure. The next best method is varSelRF, but it is still far behind (32.6%). We note that Logit-ASPLS method failed for this data set due to computational and memory issues and hence it is excluded.

Finally, bottom panel of Figure 4 displays the ACT values (in seconds) of the methods. Variation in ACT between different data sets is due to different number of features () and observations () of data sets. In terms of computational complexity, CRDA looses to Lasso, EN and LinSVM. However, compared to other methods that are designed to perform feature selection, CRDA is among the most efficient method. Particularly, one can note that Logit-ASPLS and varSelRF are computationally very intensive.

Vi Discussions and Conclusions

We proposed a novel compressive regularized discriminant analysis (CRDA) method for high-dimensional data sets, where data dimensionality is much larger than the sample size (i.e., , where is very large, often several thousands).

The method showcase accurate classification results and feature selection ability while being fast to compute in real-time. Feature selection is important for post-selection inference in various problems as it provides insights on which features (e.g., genes) are particularly important in expressing a particular trait (e.g., cancer type).

Our simulation studies and real data analysis examples illustrate that CRDA offers top performance in all three considered facets: classification accuracy, feature selection, and computational complexity. Software (both MATLAB and R) toolboxes to reproduce the results are available: see Section I for links to the packages.

Appendix A Tuning parameter selection of Ell1- and Ell2-RSCM estimator

Ell1-RSCM and Ell2-RSCM estimators [7] computes an estimate of the optimal shrinkage parameter in [7] as


where is an estimator of the sphericity parameter and

is an estimate of the elliptical kurtosis parameter

, calculated as the average sample kurtosis of the marginal variables and scaled by . The formula above is derived under the assumption that the data is a random sample from an unspecified elliptically symmetric distribution with mean vector an covariance matrix .

Ell1-RSCM and Ell2-RSCM estimators differ only in the used estimate of the sphericity parameter in (14). Estimate of sphericity used by Ell1-RSCM is


where is the sample spatial sign covariance matrix, defined as

where is the spatial median. In Ell2-RSCM, the estimate of sphericity is computed as



and is the pooled SCM in (3).

Appendix B On computing

The matrix inversion can be performed efficiently using the SVD-trick [3]. The SVD of is

where , , and . Direct computation of SVD is time consuming and the trick is that and can be computed first from SVD of , which is only an matrix. Here is an orthogonal matrix whose first -columns are and is an diagonal matrix whose upper left corner matrix is . After we compute and from SVD of , we may compute from by . Then, using the SVD representation of the SCM, , and simple algebra, one obtains a simple formula for the inverse of the regularized SCM in (5) as:

where . This reduces the complexity from to which is a significant saving in case.


The research was partially supported by the Academy of Finland grant no. 298118 which is gratefully acknowledged.


  • [1] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, “Class prediction by nearest shrunken centroids, with applications to dna microarrays,” Statistical Science, pp. 104–117, 2003.
  • [2] G. Durif, L. Modolo, J. Michaelsson, J. E. Mold, S. Lambert-Lacroix, and F. Picard, “High dimensional classification with combined adaptive sparse pls and logistic regression,” Bioinformatics, vol. 34, no. 3, pp. 485–493, 2018.
  • [3] Y. Guo, T. Hastie, and R. Tibshirani, “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, vol. 8, no. 1, pp. 86–100, 2006.
  • [4] D. M. Witten and R. Tibshirani, “Penalized classification using Fisher’s linear discriminant,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011.
  • [5] A. Sharma, K. K. Paliwal, S. Imoto, and S. Miyano, “A feature selection method using improved regularized linear discriminant analysis,” Machine vision and applications, vol. 25, no. 3, pp. 775–786, 2014.
  • [6] E. Neto, F. Biessmann, H. Aurlien, H. Nordby, and T. Eichele, “Regularized linear discriminant analysis of EEG features in dementia patients,” Frontiers in aging neuroscience, vol. 8, 2016.
  • [7] E. Ollila and E. Raninen, “Optimal shrinkage covariance matrix estimation under random sampling from elliptical distributions,” IEEE Transactions on Signal Processing, vol. 67, no. 10, pp. 2707–2719, May 2019.
  • [8] P. L. Yu, X. Wang, and Y. Zhu, “High dimensional covariance matrix estimation by penalizing the matrix-logarithm transformed likelihood,” Computational Statistics & Data Analysis, vol. 114, pp. 12–25, 2017.
  • [9] D. E. Tyler and M. Yi, “Shrinking the sample covariance matrix using convex penalties on the matrix-log transformation,” arXiv preprint arXiv:1903.08281, 2019.
  • [10] M. N. Tabassum and E. Ollila, “Compressive regularized discriminant analysis of high-dimensional data with applications to microarray studies,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4204–4208.
  • [11] O. Ledoit and M. Wolf, “A well-conditioned estimator for large-dimensional covariance matrices,”

    Journal of multivariate analysis

    , vol. 88, no. 2, pp. 365–411, 2004.
  • [12] A. Coluccia, “Regularized covariance matrix estimation via empirical bayes,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 2127–2131, 2015.
  • [13] B. Meriaux, C. Ren, M. N. El Korso, A. Breloy, and P. Forster, “Robust estimation of structured scatter matrices in (mis) matched models,” Signal Processing, vol. 165, pp. 163–174, 2019.
  • [14] A. Wiesel, T. Zhang et al., “Structured robust covariance estimation,” Foundations and Trends® in Signal Processing, vol. 8, no. 3, pp. 127–216, 2015.
  • [15] E. Ollila and D. E. Tyler, “Regularized -estimators of scatter matrix,” vol. 62, no. 22, pp. 6059–6070, 2014.
  • [16] F. Pascal, Y. Chitour, and Y. Quek, “Generalized robust shrinkage estimator and its application to STAP detection problem,” vol. 62, no. 21, pp. 5640–5651, 2014.
  • [17] Y. Sun, P. Babu, and D. P. Palomar, “Regularized Tyler’s scatter estimator: Existence, uniqueness, and algorithms,” vol. 62, no. 19, pp. 5143–5156, 2014.
  • [18] Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero, “Shrinkage algorithms for MMSE covariance estimation,” vol. 58, no. 10, pp. 5016–5029, 2010.
  • [19] E. Ollila, “Optimal high-dimensional shrinkage covariance estimation for elliptical distributions,” in Proc. European Signal Processing Conference (EUSIPCO 2017), Kos, Greece, 2017, pp. 1689–1693.
  • [20] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  • [21] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson et al.

    , “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,”

    Nature medicine, vol. 7, no. 6, p. 673, 2001.
  • [22] J. Friedman, T. Hastie, R. Tibshirani et al., “Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors),” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000.
  • [23] A. I. Su, M. P. Cooke, K. A. Ching, Y. Hakak, J. R. Walker, T. Wiltshire, A. P. Orth, R. G. Vega, L. M. Sapinoso, A. Moqrich, A. Patapoutian, G. M. Hampton, P. G. Schultz, and J. B. Hogenesch, “Large-scale analysis of the human and mouse transcriptomes.” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 7, pp. 4465–4470, Apr. 2002.
  • [24] G. J. Gordon, R. V. Jensen, L.-L. Hsiao, S. R. Gullans, J. E. Blumenstock, S. Ramaswamy, W. G. Richards, D. J. Sugarbaker, and R. Bueno, “Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma,” Cancer Research, vol. 62, no. 17, pp. 4963–4967, 2002.
  • [25] M. E. Burczynski, R. L. Peterson, N. C. Twine, K. A. Zuberek, B. J. Brodeur, L. Casciotti, V. Maganti, P. S. Reddy, A. Strahs, F. Immermann, W. Spinelli, U. Schwertschlag, A. M. Slager, M. M. Cotreau, and A. J. Dorner, “Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells,” The Journal of Molecular Diagnostics, vol. 8, no. 1, pp. 51 – 61, 2006.
  • [26] E.-J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams, D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patel et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling,” Cancer cell, vol. 1, no. 2, pp. 133–143, 2002.
  • [27] I. Tirosh, A. Weinberger, M. Carmi, and N. Barkai, “A genetic signature of interspecies variations in gene expression,” Nature genetics, vol. 38, no. 7, p. 830, 2006.
  • [28] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proceedings of the National Academy of Sciences, vol. 98, no. 26, pp. 15 149–15 154, 2001.
  • [29] X. Cui and G. A. Churchill, “Statistical tests for differential expression in cDNA microarray experiments,” Genome biology, vol. 4, no. 4, p. 210, 2003.
  • [30] R. Diaz-Uriarte, “GeneSrF and varSelRF: a web-based tool and r package for gene selection and classification using random forest,” BMC Bioinformatics, vol. 8, no. 1, p. 328, Sep 2007.
  • [31] Y. Niu, N. Hao, and B. Dong, “A new reduced-rank linear discriminant analysis method and its applications,” Statistica Sinica, vol. 28, no. 1, pp. 189–202, 1 2018.
  • [32] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of statistical software, vol. 33, no. 1, p. 1, 2010.
  • [33] Cancer Program Legacy Publication Resources - Broad Institute, “Global cancer map dataset,” 2001, http://portals.broadinstitute.org/cgi-bin/cancer/publications/view/61, Last accessed on 2018-09-18.