I Introduction
Highdimensional (HD) classification is at the core of numerous contemporary statistical studies. An increasingly common occurrence is the collection of large amounts of information on each individual sample point, even though the number of sample points themselves may remain relatively small. Typical examples are gene expression and protein mass spectrometry data, and other areas of computational biology. Regularization and shrinkage are commonly used tools in many applications such as regression or classification to overcome significant statistical challenges posed particularly due to the hugedimension, lowsamplesize (HDLSS) data settings in which the number of features, , is often several magnitudes larger than the sample size, (i.e., ). In this paper, we propose a novel classification framework for such HD data.
In many applications, the ideal classifier of an HD input vector should have three characteristics: (i) the proportion of correctly classified observations is high, (ii) the method is computationally feasible, allowing the training of the classifier in realtime with a personal computer, and (iii) the classifier selects only the significant features, thus yielding interpretable models. The last property is important for example in gene expression studies, where it is important to identify which genes influence the trait (e.g., disease). This is obviously important for the biological understanding of the disease process but can help in the development of clinical tests for early diagnosis
[1]. We propose a HD classification method that takes into account all the aforementioned three key aspects.Some of the standard offtheshelf classification methods either lack feature elimination or require that it is done beforehand. For example, the support vector machine (SVM) classifier uses all features. Some other classifiers on the other hand have higher computational cost, e.g., LogitASPLS approach (with Rpackage: plsgenomics)
[2]depends on three hyperparameters. This increases the computational complexity of the method, especially when finer grids of hyperparameters are used and tuned using crossvalidation (CV). In this paper, we propose a classification framework referred to as
compressive regularized discriminant analysis (CRDA) method, which is computationally efficient and performs feature selection during the learning process. Our simulation studies and real data analysis examples illustrate that it often achieves superior classification performance compared to offtheshelf classification approaches in the HDLSS settings.Many classification techniques assign a dimensional observation to one of the classes (groups or populations) based on the following rule:
(1) 
where is called the discriminant function for population . In linear discriminant analysis (LDA), is a linear function of , , for some constant and vector , . The vector depends on the unknown covariance matrix of the populations (via its inverse matrix) which is commonly estimated by the pooled sample covariance matrix (SCM). In the HD setting, the SCM is nolonger invertible, and therefore regularized SCM (RSCM) is used for constructing an estimated discriminant function . Such approaches are commonly referred to as regularized LDA methods, which we refer shortly as RDA. See e.g., [3, 1, 4, 5, 6].
Next note that if the th entry of is zero, then the th feature does not contribute in the classification of th population. To eliminate unnecessary features, one can shrink using elementwise softthresholding as in the shrunken centroids (SC)RDA method [3]. These methods are often difficult to tune because the shrinkage threshold parameter is the same across all groups, but different populations would often benefit from different shrinkage intensity. Consequently, they tend to yield higher falsepositive rates (FPR).
Elementwise shrinkage does not achieve simultaneous feature elimination as the eliminated feature from group may still affect the discriminant function of group . The proposed CRDA method uses jointsparsity via norm based hardthresholding. Another advantage is of having a shrinkage parameter that is much easier to tune and interpret: namely, the jointsparsity level instead of a shrinkage threshold as in SCRDA. CRDA also uses two recent stateoftheart regularized covariance matrix estimators proposed in [7] and [8, 9], respectively. The first estimator, called the EllRSCM estimator [7], is designed to attain the minimum mean squared error and is fast to compute due to automatic dataadaptive tuning of the involved regularization parameter. The second estimator is a penalized SCM (PSCM) proposed by [8], which we refer to as RiePSCM, that minimizes a penalized negative Gaussian likelihood function with a Riemannian distance penalty. Although a closedform solution of the optimization problem is not possible to find, the estimator can be computed using a fast iterative NewtonRaphson algorithm developed in [9].
We test the effectiveness of the proposed CRDA methods using partially synthetic genomic data as well as on nine real data sets and compare the obtained results against seven stateoftheart HD classifiers that also perform feature selection. The results illustrate that overall, the proposed CRDA method gives fewer misclassification errors than its competitors while achieving, at the same time, accurate feature selection with minimal computational efforts. Preliminary results of the CRDA approach has appeared in [10].
The paper is structured as follows. Section II describes the regularized LDA procedure and the regularized covariance matrix estimators used. In Section III, CRDA is introduced as a compressive version of regularized LDA performing feature selection simultaneously with classification. Classification results for 3 different types (image, speech, genomics) of real data sets are also presented. Section IV illustrates the promising performance of CRDA in feature selection for partially synthetic data set while Section V provides analysis results for six real gene expression data sets against a large number of benchmark classifiers. Section VI provides discussion and concludes the paper. MATLAB and R toolboxes are available at http://users.spa.aalto.fi/esollila/crda and https://github.com/mntabassm/compressiveRDA).
Notations: We use to denote the Frobenius (matrix) norm, defined as for all matrices . An norm is defined as
for any vector . The softthresholding (ST)operator is defined as
where for and is a threshold constant. Finally, denotes the indicator function which takes value 1 when true and value 0 otherwise. denotes expectation.
Ii Regularized LDA
We are given a variate random vector which we need to classify into one of the groups or populations. In LDA, one assumes that class populations are variate multivariate normal (MVN) with a common positive definite symmetric covariance matrix over each class but having distinct class mean vectors , . The problem is then to classify to one of the MVN populations, , . Sometimes prior knowledge is available on proportions of each population and we denote by
the prior probabilities of the classes (
). LDA uses the rule (1) with discriminant functionwhere for .
The LDA rule involves a set of unknown parameters: the class mean vectors and the covariance matrix . These are estimated from the training data set that consists of observations from each of the classes (). Let denotes the class label associated with the th observation, so . Then is the number of observations belonging to th population, and we denote by the relative sample proportions. We assume observations in the training data set are centered by the sample mean vectors of the classes,
(2) 
Since the data matrix is centered groupwise, the pooled (over groups) SCM can be written simply as
(3) 
In practice, an observation is classified using an estimated discriminant function,
(4) 
where , and is an estimator of . Note that in (4) the prior probabilities s are replaced by their estimates, s. Commonly, the pooled SCM in (3) is used as an estimator of the covariance matrix .
Since , the pooled SCM is no longer invertible. To avoid the singularity of the estimated covariance matrix, a commonly used approach (cf. [11]) is to use a regularized SCM (RSCM) that shrinks the estimator towards a target matrix. One such method is reviewed in subsection IIA. An alternative approach is to use penalized likelihood approach in which a penalty term is added to the likelihood function with an aim to penalize covariance matrices that are far from the desired target matrix. One prominent estimator of this type is reviewed in subsection IIB. There are many other covariance matrix estimators for highdimensional data, including Bayes estimators [12], structured covariance matrix estimators [13, 14], robust shrinkage estimators [15, 16, 17, 18], etc, but most of these are not applicable in HDLSS setting due to incurred high computational cost.
Iia Regularized SCM estimators
A commonly used RSCM estimator (see e.g., [11, 19, 7]) is of the form:
(5) 
where denotes the shrinkage (or regularization) parameter and
denotes an identity matrix of appropriate size. We use the automated method for choosing the tuning parameter
proposed in [7] and the respective estimators, called Ell1RSCM and Ell2RSCM. These estimators are based on that is a consistent estimator of that minimize the mean squared error,(6) 
See Appendix A for details.
Note that SCRDA [3] uses a different estimator, , where the regularization parameter is chosen using crossvalidation (CV). The benefit of Ell1RSCM and Ell2RSCM estimators is that they avoid computationally intensive CV scheme and are fast to compute. In the HD settings, another computational burden is related to inverting the matrix in (5). The matrix inversion can be done efficiently using the SVDtrick detailed in Appendix B.
IiB Penalized SCM estimator using Riemannian distance
Assume that is a random sample from a MVN distribution. Then the maximumlikelihood estimator of is found by minimizing the Gaussian loglikelihood function,
over the set of positive definite symmetric matrices, denoted by . When and the SCM is nonsingular, the minimum is attained at . Since and is singular, it is common to add a penalty function to the likelihood function to enforce a certain structure to the covariance matrix and to guarantee that the combined objective function has a unique (positive definite matrix) solution. One such penalized SCM (PSCM) proposed by [8] uses a Riemannian distance of from a scaled identity matrix, , where is a shrinkage constant. One then solves the penalized likekelihood problem:
(7) 
where is the penalty parameter. As in [9], we choose
as the mean of the eigenvalues (
) of the SCM, i.e.,The eigenvalues of are then shrunken towards , and in the limit as [8]. The solution to (7), which we refer to as RiePSCM estimator, does not have a closedform solution, but a relatively efficient NewtonRaphson algorithm [9] can be used to compute the solution. We choose the associated penalty parameter using the crossvalidation procedure proposed in [9].
Iii Compressive RDA
In order to explain the proposed compressive RDA (CRDA) approach, we first write the discriminant rule in (4) in vector form as
(8) 
where , and the coefficient matrix is defined as,
(9) 
The discriminant function in (8) is linear in with coefficient matrix . Hence, if the th row of the coefficient matrix is a zero vector , then the th feature does not contribute to the classification rule and hence can be eliminated. If the coefficient matrix is rowsparse, then the discriminant rule achieves simultaneous feature elimination. s
To achieve rowsparsity, we apply hardthresholding (HT) operator on the coefficient matrix , defined as a transform which retains only the elements of rows of that obtained largest values of HTselector function and set elements of other rows to zero. See Figure 1 for an illustration. The compressive RDA discriminant function is then defined as
(10) 
where
(11) 
By its construction, the CRDA classifier uses only features and thus performs feature selection simultaneously with classification. The tuning parameter determines how many features are selected. As HTselector function , we use norm,
or the sample variance,
where . Two dimensional CV search described in subsection IIIA is used to choose an appropriate level for jointsparsity and the best HTselector function.
Shrunken Centroids Regularized Discriminant Analysis (SCRDA) [3] uses an estimator and elementwise softshrinkage. SCRDA can be expressed in the form (III) with
(12) 
where the softthresholding function is applied elementwise to . A disadvantage of SCRDA is that the parameter is the same across all groups, and different populations would often benefit from different shrinkage intensity. Another drawback is that an upper bound for shrinkage threshold is data dependent. Also in SCRDA, twodimensional CV search is used to find the tuning parameter pair and .
Iiia Crossvalidation scheme
The jointsparsity order and the HTselector function are chosen using a crossvalidation (CV) scheme. For we use a logarithmically spaced grid of values, where the upper bound for largest element in the grid is defined as follows. For each , we calculate
(13) 
where denotes the th row vector of . Thus counts the number of row vectors of that has norm larger than the average of norms of row vectors of . As final upper bound we use the minimum value, . We then use a logarithmically spaced grid of values , with starting from .
Apart from the way the lower and upper bounds for the grid of values are chosen, the used CV scheme is conventional. We split the training data set into folds and use folds as a training set and one fold as the validation set. We fit the model (CRDA classifier as defined by (III) and (11)) on the training set using and and compute the misclassification error rate on the validation set. This procedure is repeated so that each of the Q folds is left out in turn as a validation set, and misclassification rates are accumulated. We then choose the pair in the grid that had the smallest misclassification error rate. In case of ties, we choose the one having smaller value of , i.e., we prefer model with smaller number of features.
IiiB CRDA variants
We consider the following three CRDA variants:

CRDA1 uses Ell1RSCM estimator to compute the coefficient matrix . The tuning parameters, and , are chosen using the CV procedure of subsection IIIA.

CRDA2 is as CRDA1 except that the is computed using the Ell2RSCM estimator.

CRDA3 uses RiePSCM to compute the coefficient matrix , is used as the HTselector and as the jointsparsity level.
Uniform priors (, ) are used in all analysis results.
IiiC Analysis of real data sets
Next, we explore the performance of CRDA variants in classification task of three different types of data: an image data (data set #1), speech data (data set #2) and gene expression data (data set #3) described in Table II
. The first data set is constructed for classification of five English vowel letters spoken twice by thirty different subjects while the second data set consists of modified observations from the MNIST data set of handwritten digits (namely, digits 4 and 9) and used in NIPS 2003 feature selection challenge. These data sets are available at UC Irvine machine learning repository
[20] under the names, Isolette and Gisette data set, respectively. The 3rd data set is from Khan et al. [21] and consists of small round blue cell tumors (SRBCTs) found in children, and are classified as BL (Burkitt lymphoma), EWS (Ewing’s sarcoma), NB (neuroblastoma) and RMS (rhabdomyosarcoma). The available data is split randomly into training and test sets of fixed length, and , respectively, such that relative proportions of observations in the classes are preserved as accurately as possible. Then such random splits are performed and averages of error rates are reported.The basic metrics used are the test error rate (TER), feature selection rate (FSR) and the average computational time (ACT) in seconds. TER is the misclassification rate calculated by classifying the observations from the test set using the classification rule estimated from the training set. FSR is defined as , where denotes the number of features used by the classifier, and it is a measure of interpretability of the chosen model. The FSR of CRDA3 will naturally always be higher than FSR of CRDA1 or CRDA2.
The results of the study reported in Table I is divided in three groups. The first group consists of the proposed CRDA classification methods, the middle group consists of five stateoftheart classifiers (cf. Table III). The third group consists of three standard offtheshelf classifiers commonly used in machine learning, but which do not perform feature selection. The methods in the third group are RF
which refers to random forest using decision trees (where maximum number of splits in tree was 9 and number of learning cycles was 150),
LogitBoost which refers to LogitBoost method [22] using stumps (i.e., a two terminal node decision tree) as base learners and 120 as the number of boosting iterations, and LinSVM which refers to a linear support vector machine (SVM) using one versus one (OVO) coding design and linear kernel. Note that since, one can always find a separating hyperplane and hence a linear kernel suffices for SVM.
We first discuss the results of CRDA variants when compared against one another. In the case of data set #2, CRDA1 and CRDA3 obtained notably better misclassification rates than CRDA2. In the case of data set #3, CRDA1 and CRDA2 obtained perfect 0% misclassification results while using only 5% out of all features. This is a remarkable achievement, especially when compared to TER/FSR rates of the competing estimators. In the case of data set #1, CRDA3 had better performance than CRDA1 and CRDA2 but it attained this result using more features than CRDA1 and CRDA2.
Next, we compare CRDA against the methods in the middle group. We notice from Table I that one of the CRDA variants is always the best performing method in terms of both accuracy (TER rates) and feature selection rates (FSR). Only in the case of data set #2, SCRDA is able to achieve equally good error rate (6.3%) as CRDA3.
Finally, we compare the CRDA methods against the methods in the third group. First we notice that the linear SVM is the best performing method among the third group, and for data set #2 it achieves the best error rate among all methods. However, it chooses all features.
As noted earlier, for data set #3 (Khan et al.), CRDA1 and CRDA2 attained perfect classification (0% error rate) while choosing only 5% out of 2308 genes. The heat map of randomly selected 115 genes is shown in Figure 2a while Figure 2b shows the heat map of 115 genes picked by CRDA1. While the former shows random patterns, clearly visible patterns can be seen in the latter heat map. Besides for prognostic classification, the set of genes found important by CRDA1 can be used in developing understanding of genes that influence SRBCTs.
#1 Isoletvowels  #2 Gisette  #3 Khan et al.  

TER  FSR  TER  FSR  TER  FSR  
CRDA1  2.8  21.8  7.0  18.8  0  5.0 
CRDA2  2.7  16.3  14.4  8.2  0  5.0 
CRDA3  1.3  45.1  6.3  36.7  1.2  35.5 
PLDA  3.6  100  40.5  50.8  18.8  82.0 
SCRDA  1.6  88.2  6.3  87.0  2.8  81.5 
varSelRF  3.0  28.9  7.9  18.8  2.4  78.8 
LogitASPLS  26.4  98.6  27.8  90.0  43.2  100 
SPCALDA  19.1  73.7  6.4  47.4  3.2  46.9 
SVM  1.3  100.0  6.0  100.0  0.8  100.0 
LogitBoost  3.0  100.0  7.7  100.0  10.4  100.0 
RF  1.8  100.0  9.5  100.0  1.2  100.0 
data set  Description  

#1 Isoletvowels [20]  617  100  200  {20,20,20,20,20}  5  Spoken English vowel letters 
#2 Gisette [20]  5000  350  6650  {175,175}  2  OCR digit recognition (’4’ vs ’9’) 
#3 Khan et al. [21]  2308  38  25  {5, 14, 7, 12}  4  SRBCTs 
#4 Su et al. [23]  5565  62  40  {15, 16, 17, 14}  4  Multiple mammalian tissues 
#5 Gordon et al. [24]  12533  109  72  {90, 19}  2  Lung Cancer 
#6 Burczynski et al. [25]  22283  76  51  {35, 25, 16}  3  Inflammatory bowel diseases (IBD) 
#7 Yeoh et al. [26]  12625  148  100  {9,16, 38, 12, 26, 47}  6  Acute lymphoblastic leukemia (ALL) 
#8 Tirosh et al. [27]  13056  114  77  {23, 22, 23, 46}  4  Yeast species 
#9 Ramaswamyet al. [28]  16063  117  73  {7 ,6, 7, 7,13, 7, 6, 6, 18, 7, 7, 7, 7,12}  14  Cancer 
Iv Feature selection performance
Next we create a partially synthetic gene expression data set using the 3rd data set in Table II to study how well the different classifiers are able to detect differentially expressed (DE) genes. The data set #3 consists of expression levels of genes and we randomly kept (i.e., of ) covariates of gene expression levels as significant features, i.e., differentially expressed, and the rest irrelevant (nonDE) features are generated from .
Since we now have a ground truth of differentially expressed genes available, we may evaluate the estimated models using the falsepositive rate (FPR) and the falsenegative rate (FNR). A falsepositive is a feature that is picked by the classifier but is not DE in any of the classes in the sense that for all . A falsenegative is a feature that is DE in the true model in the sense that for all , but is not picked by the classifier. Formally,
where is the number of truly positive (DE) features, is the number of truly negative (nonDE) features, is the number of false positives and is the number of true positives, i.e., gives the total number of features picked by the classifier. Again, recall that the sparsity level used in the hardthresholding operator is the number in the case of CRDA methods. Note that both FPR and FNR should be as small as possible for exact and accurate selection (and elimination) of genes.
First, we compare the gene selection performance of CRDA with the conventional gene selection approach [29]
which performs a ttest and computes the pvalues using permutation analysis using 10,000 permutations. A gene is considered to be differentially expressed when it shows both statistical and biological significance. Statistical significance is attested when the obtained pvalue of the test statistics is below the significance level (false alarm rate), say
. Biological significance is attested when the log ratio of fold change of mean gene expressions is outside the interval for some cutoff limit , say . b(a) display such a volcano plot which displays the  of pvalues of tests for each genes against the fold change. The used statistical significance level is shown as the dotted horizontal line and the biological significance bounds as dotted vertical lines.The plot illustrates that there are 19 out of 115 differentially expressed genes which are both statistically and biologically significant (genes on the upper left and upper right corners of the plot) along with a few which are biologically significant but not statistically significant (lower left and lower right corners). When considering each class in turn against the rest, then 60 out of a total of 115 DE genes are picked as being DE. The results for CRDA2 based gene selection is shown in b(b) for the same data set where we have plotted the gene index, , against the sample variance, , of the th row vector of the coefficient matrix . From 115 DE genes, CRDA2 has selected 107 DE genes.
Method  Reference  Rpackage or other  Version 

PLDA  Penalized LDA [4]  penalizedLDA  1.1 
SCRDA  Shrunken centroids RDA [3]  rda  1.0.2.2 
varSelRF  Random forest (RF) with variable selection [30]  varSelRF  0.7.8 
LogitASPLS  Adaptive sparse partial least squares based logistic regression [2] 
plsgenomics  1.5.2 
SPCALDA  Supervised principal component analysis based LDA [31] 
SPCALDA  1.0 
Lasso  Regularized multinomial regression with (Lasso) penalty [32]  GLMNet  3.02 
EN  Regularized multinomial regression with Elastic Net (EN) penalty [32]  GLMNet  3.02 
As a baseline, we further compute the test error rate of the naive classifier that assigns all observations in the test set to the class that has the most observations in the training set. Figure 3(a) displays the box plots of FSR and TER of CRDA methods and three stateoftheart classifiers. Again the CRDA classifiers obtained the smallest average TER while choosing the smallest number of features. Figure 3
(b) provides the box plots of FPR and FNR for all methods. The proposed CRDA methods offer the best feature selection accuracy. On the contrary, SCRDA and varSelRF methods produce large falsepositive rates, meaning that they have chosen a large set of genes as DE while they are not. PLDA is performing equally well with CRDA in terms of FPR/FNR rates, but its misclassification rates are considerably larger. Furthermore, CRDA methods exhibit the smallest amount of variation (i.e., standard error) in all measures (TER/FSR/FPR/FNR). We also notice that varSelRF does not perform very well compared to CRDA or other regularized LDA techniques like SCRDA.
V Analysis of real gene expression data
We compare the performance of the proposed CRDA methods to seven stateoftheart classifiers on a diverse set of real gene expression (microarray) data sets. These data sets #4 – #9 in Table II constitute a representative set of gene expression data sets available in the open literature, having a varying number of features (), sample sizes () or classes (). As earlier, in each MonteCarlo trial, the data is divided randomly into training sets and test sets of fixed length, and , respectively, such that relative proportions of observations in the classes are preserved as accurately as possible. We do not include CRDA3 in this study due to incurred high computational cost related to computing the RiePSCM covariance matrix estimator.
The Gene Expression Omnibus (GEO) data repository contains the data sets #6 and #8 with accession numbers GDS1615 and GDS2910, respectively. The 9th data set (also known as the global cancer map (GCM) data set) is available online [33]. The rest of the data sets, namely data sets 4, 5, and 7 are included in the Rpackage datamicroarray^{1}^{1}1https://github.com/ramhiser/datamicroarray.
Figure 4 displays the average TER and FSR values. CRDA method obtained the lowest error rates for 5 out of 6 data sets considered; only for data set #5 (Gordon), LogitASPLS had better performance than CRDA. Furthermore, CRDA achieved the top performance by picking a largely reduced number of genes: the number of genes selected ranged from 5.1% to 23.4%. Only Lasso and EN selected fewer number of genes than CRDA methods but their error rates were much higher in all cases except for data set #5 (Gordon).
For some data sets, such as for the seventh (Yeoh) and eight (Tirosh), SCRDA and CRDA obtained comparable TER values, but the former fails to eliminate as many genes. Again we observe that many of the competing classifiers, such as varSelRF, PLDA or SPCALDA are performing rather poorly for some data sets; this is especially the case for the sixth (Burczynski) and ninth (Ramaswamy) data sets.
Data set #9 (Ramaswamy) presents a very difficult data set for any any classification method: it contains classes, genes and only 117 observations in the training set. Results obtained by CRDA clearly illustrates its top performance compared to others in very demanding HDLSS data scenarios. CRDA classifiers outperformed most of the methods with a significant margin: CRDA gives smallest error rates (25.2%) while at the same time reducing the number of features (genes) used in the classification procedure. The next best method is varSelRF, but it is still far behind (32.6%). We note that LogitASPLS method failed for this data set due to computational and memory issues and hence it is excluded.
Finally, bottom panel of Figure 4 displays the ACT values (in seconds) of the methods. Variation in ACT between different data sets is due to different number of features () and observations () of data sets. In terms of computational complexity, CRDA looses to Lasso, EN and LinSVM. However, compared to other methods that are designed to perform feature selection, CRDA is among the most efficient method. Particularly, one can note that LogitASPLS and varSelRF are computationally very intensive.
Vi Discussions and Conclusions
We proposed a novel compressive regularized discriminant analysis (CRDA) method for highdimensional data sets, where data dimensionality is much larger than the sample size (i.e., , where is very large, often several thousands).
The method showcase accurate classification results and feature selection ability while being fast to compute in realtime. Feature selection is important for postselection inference in various problems as it provides insights on which features (e.g., genes) are particularly important in expressing a particular trait (e.g., cancer type).
Our simulation studies and real data analysis examples illustrate that CRDA offers top performance in all three considered facets: classification accuracy, feature selection, and computational complexity. Software (both MATLAB and R) toolboxes to reproduce the results are available: see Section I for links to the packages.
Appendix A Tuning parameter selection of Ell1 and Ell2RSCM estimator
Ell1RSCM and Ell2RSCM estimators [7] computes an estimate of the optimal shrinkage parameter in [7] as
(14) 
where is an estimator of the sphericity parameter and
is an estimate of the elliptical kurtosis parameter
, calculated as the average sample kurtosis of the marginal variables and scaled by . The formula above is derived under the assumption that the data is a random sample from an unspecified elliptically symmetric distribution with mean vector an covariance matrix .Ell1RSCM and Ell2RSCM estimators differ only in the used estimate of the sphericity parameter in (14). Estimate of sphericity used by Ell1RSCM is
(15) 
where is the sample spatial sign covariance matrix, defined as
where is the spatial median. In Ell2RSCM, the estimate of sphericity is computed as
(16) 
where
and is the pooled SCM in (3).
Appendix B On computing
The matrix inversion can be performed efficiently using the SVDtrick [3]. The SVD of is
where , , and . Direct computation of SVD is time consuming and the trick is that and can be computed first from SVD of , which is only an matrix. Here is an orthogonal matrix whose first columns are and is an diagonal matrix whose upper left corner matrix is . After we compute and from SVD of , we may compute from by . Then, using the SVD representation of the SCM, , and simple algebra, one obtains a simple formula for the inverse of the regularized SCM in (5) as:
where . This reduces the complexity from to which is a significant saving in case.
Acknowledgment
The research was partially supported by the Academy of Finland grant no. 298118 which is gratefully acknowledged.
References
 [1] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, “Class prediction by nearest shrunken centroids, with applications to dna microarrays,” Statistical Science, pp. 104–117, 2003.
 [2] G. Durif, L. Modolo, J. Michaelsson, J. E. Mold, S. LambertLacroix, and F. Picard, “High dimensional classification with combined adaptive sparse pls and logistic regression,” Bioinformatics, vol. 34, no. 3, pp. 485–493, 2018.
 [3] Y. Guo, T. Hastie, and R. Tibshirani, “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, vol. 8, no. 1, pp. 86–100, 2006.
 [4] D. M. Witten and R. Tibshirani, “Penalized classification using Fisher’s linear discriminant,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011.
 [5] A. Sharma, K. K. Paliwal, S. Imoto, and S. Miyano, “A feature selection method using improved regularized linear discriminant analysis,” Machine vision and applications, vol. 25, no. 3, pp. 775–786, 2014.
 [6] E. Neto, F. Biessmann, H. Aurlien, H. Nordby, and T. Eichele, “Regularized linear discriminant analysis of EEG features in dementia patients,” Frontiers in aging neuroscience, vol. 8, 2016.
 [7] E. Ollila and E. Raninen, “Optimal shrinkage covariance matrix estimation under random sampling from elliptical distributions,” IEEE Transactions on Signal Processing, vol. 67, no. 10, pp. 2707–2719, May 2019.
 [8] P. L. Yu, X. Wang, and Y. Zhu, “High dimensional covariance matrix estimation by penalizing the matrixlogarithm transformed likelihood,” Computational Statistics & Data Analysis, vol. 114, pp. 12–25, 2017.
 [9] D. E. Tyler and M. Yi, “Shrinking the sample covariance matrix using convex penalties on the matrixlog transformation,” arXiv preprint arXiv:1903.08281, 2019.
 [10] M. N. Tabassum and E. Ollila, “Compressive regularized discriminant analysis of highdimensional data with applications to microarray studies,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4204–4208.

[11]
O. Ledoit and M. Wolf, “A wellconditioned estimator for largedimensional
covariance matrices,”
Journal of multivariate analysis
, vol. 88, no. 2, pp. 365–411, 2004.  [12] A. Coluccia, “Regularized covariance matrix estimation via empirical bayes,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 2127–2131, 2015.
 [13] B. Meriaux, C. Ren, M. N. El Korso, A. Breloy, and P. Forster, “Robust estimation of structured scatter matrices in (mis) matched models,” Signal Processing, vol. 165, pp. 163–174, 2019.
 [14] A. Wiesel, T. Zhang et al., “Structured robust covariance estimation,” Foundations and Trends® in Signal Processing, vol. 8, no. 3, pp. 127–216, 2015.
 [15] E. Ollila and D. E. Tyler, “Regularized estimators of scatter matrix,” vol. 62, no. 22, pp. 6059–6070, 2014.
 [16] F. Pascal, Y. Chitour, and Y. Quek, “Generalized robust shrinkage estimator and its application to STAP detection problem,” vol. 62, no. 21, pp. 5640–5651, 2014.
 [17] Y. Sun, P. Babu, and D. P. Palomar, “Regularized Tyler’s scatter estimator: Existence, uniqueness, and algorithms,” vol. 62, no. 19, pp. 5143–5156, 2014.
 [18] Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero, “Shrinkage algorithms for MMSE covariance estimation,” vol. 58, no. 10, pp. 5016–5029, 2010.
 [19] E. Ollila, “Optimal highdimensional shrinkage covariance estimation for elliptical distributions,” in Proc. European Signal Processing Conference (EUSIPCO 2017), Kos, Greece, 2017, pp. 1689–1693.
 [20] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml

[21]
J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann,
F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson et al.
, “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,”
Nature medicine, vol. 7, no. 6, p. 673, 2001.  [22] J. Friedman, T. Hastie, R. Tibshirani et al., “Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors),” The annals of statistics, vol. 28, no. 2, pp. 337–407, 2000.
 [23] A. I. Su, M. P. Cooke, K. A. Ching, Y. Hakak, J. R. Walker, T. Wiltshire, A. P. Orth, R. G. Vega, L. M. Sapinoso, A. Moqrich, A. Patapoutian, G. M. Hampton, P. G. Schultz, and J. B. Hogenesch, “Largescale analysis of the human and mouse transcriptomes.” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 7, pp. 4465–4470, Apr. 2002.
 [24] G. J. Gordon, R. V. Jensen, L.L. Hsiao, S. R. Gullans, J. E. Blumenstock, S. Ramaswamy, W. G. Richards, D. J. Sugarbaker, and R. Bueno, “Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma,” Cancer Research, vol. 62, no. 17, pp. 4963–4967, 2002.
 [25] M. E. Burczynski, R. L. Peterson, N. C. Twine, K. A. Zuberek, B. J. Brodeur, L. Casciotti, V. Maganti, P. S. Reddy, A. Strahs, F. Immermann, W. Spinelli, U. Schwertschlag, A. M. Slager, M. M. Cotreau, and A. J. Dorner, “Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells,” The Journal of Molecular Diagnostics, vol. 8, no. 1, pp. 51 – 61, 2006.
 [26] E.J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams, D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patel et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling,” Cancer cell, vol. 1, no. 2, pp. 133–143, 2002.
 [27] I. Tirosh, A. Weinberger, M. Carmi, and N. Barkai, “A genetic signature of interspecies variations in gene expression,” Nature genetics, vol. 38, no. 7, p. 830, 2006.
 [28] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proceedings of the National Academy of Sciences, vol. 98, no. 26, pp. 15 149–15 154, 2001.
 [29] X. Cui and G. A. Churchill, “Statistical tests for differential expression in cDNA microarray experiments,” Genome biology, vol. 4, no. 4, p. 210, 2003.
 [30] R. DiazUriarte, “GeneSrF and varSelRF: a webbased tool and r package for gene selection and classification using random forest,” BMC Bioinformatics, vol. 8, no. 1, p. 328, Sep 2007.
 [31] Y. Niu, N. Hao, and B. Dong, “A new reducedrank linear discriminant analysis method and its applications,” Statistica Sinica, vol. 28, no. 1, pp. 189–202, 1 2018.
 [32] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” Journal of statistical software, vol. 33, no. 1, p. 1, 2010.
 [33] Cancer Program Legacy Publication Resources  Broad Institute, “Global cancer map dataset,” 2001, http://portals.broadinstitute.org/cgibin/cancer/publications/view/61, Last accessed on 20180918.
Comments
There are no comments yet.