1 Introduction
Sparse signal approximations are widely used in many applications such as regression or classification where variableselection (i.e., ranking and selection of features) aims at reducing the number of variables (or features) without sacrificing accuracy measured by the test error. Reduction in the set of features facilitates interpretation as well as stabilizes estimation. This is often deemed necessary in the highdimensional (HD) context where the number of features,
, is often several magnitudes larger than the number of observations, , in the training dataset (i.e., ).Many classification techniques assign a dimensional observation to one of the classes (groups or populations) based on the following rule ∈group [ = argmax_g d_g() ], where is called the discriminant function for population . In linear discriminant analysis (LDA), is a linear function of , , for some constant and vector . The vector depends on the unknown covariance matrix of the populations (via its inverse matrix) which is commonly estimated by the pooled sample covariance matrix (SCM). In the HD setting, the SCM is nolonger invertible, and therefore regularized SCM (RSCM) is used for constructing an estimated discriminant function . Such approaches are commonly referred to as regularized LDA methods, which we refer shortly as RDA. See e.g., [1, 2, 3, 4, 5].
Next note that if the th entry of is zero, then the th feature does not contribute in the classification of th population. To eliminate unnecessary features, many authors have proposed to shrink using elementwise softthresholding, e.g., as in shrunken centroids (SC)RDA method [1]. These methods are often difficult to tune because the shrinkage threshold parameter is the same across all groups, but different populations would often benefit from different shrinkage intensity. Consequently, they tend to yield rather higher falsepositive (FP) rates.
Elementwise shrinkage does not achieve simultaneous feature selection as the eliminated feature from group may still affect the discriminant function of group . In this paper, we propose compressive regularized discriminant analysis (CRDA) that promotes simultaneous jointsparsity to pick fewer and differentially expressed variables. CRDA lends ideas from mixed norm minimization in the multiple measurement vectors (MMV) model [6], which is an extension of compressed sensing model to multivariate case. CRDA uses norm based hardthresholding which has the advantage of having a shrinkage parameter that is much easier to tune: namely, jointsparsity level instead of shrinkage threshold as in SCRDA. Our approach also employs a different RSCM estimator compared to SCRDA. The used RSCM has the benefit of being able to attain the minimum mean squared error [7, 8] for an appropriate choice of the regularization parameter. The optimal pair of the tuning parameters can be found via cross validation (CV), but we also propose a computationally simpler approach that uses the RSCM proposed in [8]. This facilitates the computations considerably as only a single variable, the jointsparsity level , needs to be tuned.
The paper is organized as follows. Section 2 describes RDA and SVD based inversion of the used RSCM. In Section 3, the proposed CRDA as well as our tuning parameter selection criteria is introduced. Section 4 provides the results on simulation studies which explore both the featureselection capability and misclassification errors of CRDA, and the competing methods. Classification results on four real microarray datasets are also provided. Section 5 concludes the paper.
2 Regularized LDA
We are given a variate random vector which we need to classify into one of the classes or populations. In LDA, one assumes that the class populations are variate multivariate normal (MVN) with a common positive definite symmetric covariance matrix over each class but distinct class mean vectors , . The problem is then to classify to one of the MVN populations, , . Sometimes prior knowledge is available on proportions of each population and we denote by ,
, the prior probabilities of the classes (
). LDA uses the rule (1) with discriminant functionwhere for .
The LDA rule involves a set of unknown parameters, the class mean vectors and the covariance matrix . These are estimated from the training dataset that consists of observations from each of the classes (). Let denote the class label associated with the th observation, so . Then is the number of observations belonging to th population, and we denote by the relative sample proportions. We assume observations in the training dataset are centered by the sample mean vectors of the classes, _g = ¯_g = 1ng ∑_ c(i)=g _i. Since is centered, the pooled (over groups) sample covariance matrix (SCM) can be written simply as
In practice, an observation is classified using an estimated discriminant function, ^d_g() = ^⊤_g  12 _g^⊤_g + ln_g, where , and is an estimator of . Note that in (2) the prior probabilities s are replaced by their estimates, s. Commonly, the pooled SCM is used as an estimator . Since we are in the regime, where , the pooled SCM is no longer invertible and hence can not be used in (2). To avoid the singularity of the estimated covariance matrix, a commonly used approach in the literature (cf. [7, 8]) is to use a regularized SCM (RSCM), = §+ (1 ) ηI where . SCRDA [1] uses an estimator . However, (2) has some theoretical justification since with an appropriate (data dependent) choice , the obtained RSCM in (2) will be a consistent minimum mean squared error (MMSE) estimator of . Such choices of have been proposed, e.g., in [7] and in [8].
In the HD setup, the main computational burden is related with inverting the matrix in (2). The inversion can be done using the SVDtrick, as follows [1, 9]. The SVD of is
where , , and . Direct computation of SVD is time consuming and the trick is that and can be computed first from SVD of , which is only an matrix. Here is an orthogonal matrix whose first columns are and is an diagonal matrix whose upper left corner matrix is . After we compute and from SVD of , we may compute from by . Then, using the SVD representation of the SCM, , and simple algebra, one obtains a simple formula for the inverse:
(1) 
where . This reduces the complexity from to which is a significant saving in case.
3 Compressive RDA
3.1 Proposed CRDA Approach
In order to explain the proposed compressive RDA approach, we first write the discriminant rule in vector form as
(2) 
where , and . Above notation extract the diagonal of the matrix into a vector, i.e., . The discriminant function in (2) is linear in with coefficient matrix . This means that if the th row of the coefficient matrix is a zero vector , then it implies that th predictor does not contribute to the classification rule and hence can be eliminated. If the coefficient matrix is rowsparse, then the method can be potentially used as a simultaneous feature elimination procedure. In microarray data analysis, this means that gene does not contribute in the classification procedure and thus the rowsparsity of the coefficient matrix allows, at the same time, identify differentially expressed genes.
In the MMV model [6], the goal is to achieve simultaneous sparse reconstruction (SSR) of the signal matrix. The task is to estimate the rowsparse signal matrix , given an observed measurement matrix and an (over complete) basis matrix (or dictionary) . rowsparsity of means that only rows of contain nonzero entries. Commonly, this goal is achieved by mixed matrix norm minimization, where
for some , where denotes the th row of . Values have been advocated in the literature. Many SSR algorithms use hardthresholding operator , defined as transform , which retains the elements of the rows of that possess largest norm and set elements of the other rows to zero. This leads us to define our compressive RDA discriminant function as
(3) 
where , and
where has been defined in (2) and are the sample mean vectors of the classes in (2). Fast formula to compute is given in (2).
Next let us draw attention to SCRDA [1] which uses instead of estimator in (2). Another difference is in its use of elementwise softshrinkage. Namely, SCRDA can also be written in the multivariate form (3), but using ^= S_Δ( ^^1 ^) where is the softthresholding function that is applied elementwise to its matrixvalued argument. That is, the th element of in (3.1) is
where for and denotes the th element of . One disadvantage of SCRDA is the shrinkage thresholding parameter which is the same across all groups, and different populations would often benefit from different shrinkage intensity. A sensible upper bound of is difficult to determine and is highly data dependent. The proposed CRDA on the other hand uses simple to tune jointsparsity level and has the benefit of offering simultaneous jointsparse recovery, i.e., features are eliminated across all groups instead of groupwise.

Setup I  Setup II  
TE  NFS  TE  NFS  
120 (116)  165 (163)  180 (174)  105 (101)  
95 (94)  126 (120)  184 (182)  96 (105)  
84 (81)  112 (114)  185 (177)  94 (96)  
PLDA  117  301  151  148  
SCRDA  97  227  291  349  
NSC  89  290  277  440  
Setup III  
Methods  TE  NFS  DR  FP  
46 (50)  205 (259)  90 (94)  12 (27)  
49 (46)  240 (203)  92 (92)  23 (10)  
50 (52)  238 (252)  89 (92)  27 (27)  
SCRDA  108  282  69  51 

Ramaswamy et al. dataset  Yeoh et al. dataset  Sun et al. dataset  Nakayama et al. dataset  
TE / 47  NFS  TE / 62  NFS  TE / 45  NFS  TE / 26  NFS  
10.6 (9.9)  2634 (4899)  9.6 (7.5)  2525 (4697)  12.5 (12.9)  23320 (27416)  8.3 (7.9)  2941 (6952)  
10.4 (10.3)  2683 (3968)  9.7 (6.0)  2273 (4659)  12.9 (13.3)  20589 (23484)  7.9 (7.6)  3142 (7755)  
10.3 (10.3)  3405 (4530)  9.3 (6.5)  846 (4697)  12.4 (13.5)  21354 (20207)  7.6 (7.6)  2719 (2340)  
PLDA  18.8  5023  NA  NA  15.2  21635  4.4  10479  
SCRDA  24  14874  NA  NA  15.7  54183  2.8  22283  
NSC  16.3  2337  NA  NA  15  30005  4.2  5908 
3.2 Model (Parameters) Selection
We employ fold CV to estimate the optimal pair using a 2D grid of candidate values of the tuning parameters, where and . Often there are several pairs that yield the minimal crossvalidation error from the training dataset and each pair can exhibit varying degree of sparsity (number of features selected). Among them, we would prefer the pair that had minimal number of features. Since a pair with minimal CV error may not yield a classifier that is at the same time sparse, one may wish to set a lower bound for the number of features selected (NFS) in order to enhance the interpretability of the discriminant function.
Let denote a CV error for a pair . To have a tradeoff between a minimum (CVbased) training error and NFS, we use a threshold and choose only the pairs which have CV error smaller than , i.e., pairs which verify . From these pairs, the final optimal pair is chosen as the one that has the smallest NFS value. For finding the optimal pair, we utilize a uniform grid of 100 values and a uniform grid of 25 values.
We compare the CV approach to computationally much lighter approach which uses the estimated parameter given in [8]. We note that value of can be computed efficiently using the SVD trick. Given the optimal RSCM based on we then estimate the sparsity level using CV estimate . This reduces the computational cost significantly.
4 Numerical examples
The simulation study investigates the performance of CRDA based classifiers using different simulation setups commonly used in the RDA literature (e.g. in [1, 3, 10, 11]) and draw a comparison with the available results, against the nearest shrunken centroids (NSC) [2], SCRDA [1] and PLDA [3]. For simulation setups I and II, we generate 1200 observations from MVN distribution, , with equal probabilities for each of groups. The observations are divided into three sets: (i) the validation set with 100 observations finds the tuning parameters, (ii) then 100 observations in the training set estimate and (iii) the rest 1000 form the test set for calculating misclassification test errors (TE). A total of out of features differ between the groups. In setup I, contains nonzeros for each group and rest all zeros, i.e., for . While, if and zero otherwise for setup II. Table 1 lists the average of the TE and NFS for each classifier using 25 MC trials and 5fold CV.
The third simulation setup resembles real gene expression data. We generate training and test observations each having features. All groups have equal probabilities and follow MVN distribution for . We have and contains all zeros except first entries (i.e., true positives) with value and . Each group employs following blockdiagonal autoregressive covariancestructure
where indicates the direct sum (not the Kronecker sum) of block matrices having the AR covariance structure
where is the correlation which is different for each group, namely, , and . This setup mimics real microarray data as genes are correlated within a pathway and independent between the pathways. Table 1 reveals the higher accuracy of the proposed CRDA methods compared to SCRDA when measured by TE, NFS, detection rate (DR) and FP rates. The results are averaged over 10 MonteCarlo trials using 10fold CV.
Next we do a comparison based on real microarray datasets. A summary of the used datasets is given below:
Dataset  Disease  
Ramaswamy et al. [12]  190  16,063  14  Cancer 
Yeoh et al. [13]  248  12,625  6  Leukemia 
Sun et al. [14]  180  54,613  4  Glioma 
Nakayama et al. [15]  105  22,283  10  Sarcoma 
We compute the results for each dataset over 10 trainingtest set splits, each with a random choice of training and test set containing 75% and 25% of the total observations, respectively. The average results of classification and geneselection by CRDA methods are given in Table 2 with available comparison results. The proposed CRDA based classifiers showcase better classification and featureselection results for all simulation setups. Overall, it seems that and norm based CRDA methods are doing better as compared to others. Moreover, the CRDA based on norm appears to have best overall performance. Note that the proposed CRDA classifiers outperform other methods with a significant margin in the case of Ramaswamy et al. (with 14 groups) and Sun et al. (of genes).
5 Discussions and Conclusions
We proposed a modified version of LDA, called compressive regularized discriminant CRDA, for analysis of data sets in high dimension low sample size situations. CRDA was shown to outperform competing methods in most of the cases. It also had the best detection rate which illustrates that the method can be a useful tool for accurate selection of (differentially expressed) genes in microarray studies.
References
 [1] Yaqian Guo, Trevor Hastie, and Robert Tibshirani, “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, vol. 8, no. 1, pp. 86–100, 2006.
 [2] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu, “Class prediction by nearest shrunken centroids, with applications to dna microarrays,” Statistical Science, pp. 104–117, 2003.
 [3] Daniela M Witten and Robert Tibshirani, “Penalized classification using fisher’s linear discriminant,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011.
 [4] Alok Sharma, Kuldip K Paliwal, Seiya Imoto, and Satoru Miyano, “A feature selection method using improved regularized linear discriminant analysis,” Machine vision and applications, vol. 25, no. 3, pp. 775–786, 2014.
 [5] Emanuel Neto, Felix Biessmann, Harald Aurlien, Helge Nordby, and Tom Eichele, “Regularized linear discriminant analysis of eeg features in dementia patients,” Frontiers in aging neuroscience, vol. 8, 2016.
 [6] M. F. Duarte and Y. C. Eldar, “Structured compressed sensing: From theory to applications,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4053–4085, 2011.

[7]
Olivier Ledoit and Michael Wolf,
“A wellconditioned estimator for largedimensional covariance
matrices,”
Journal of multivariate analysis
, vol. 88, no. 2, pp. 365–411, 2004.  [8] Esa Ollila, “Optimal highdimensional shrinkage covariance estimation for elliptical distributions,” in Proc. European Signal Processing Conference (EUSIPCO 2017), Kos, Greece, 2017, pp. 1689–1693.
 [9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, The elements of statistical learning, vol. 1, Springer series in statistics New York, 2001.
 [10] John A Ramey and Phil D Young, “A comparison of regularization methods applied to the linear discriminant function with highdimensional microarray data,” Journal of Statistical Computation and Simulation, vol. 83, no. 3, pp. 581–596, 2013.
 [11] Yan Zhou, Baoxue Zhang, Gaorong Li, Tiejun Tong, and Xiang Wan, “Gdrda: A new regularized discriminant analysis for highdimensional data,” Journal of Computational Biology, 2017.
 [12] Sridhar Ramaswamy, Pablo Tamayo, Ryan Rifkin, Sayan Mukherjee, ChenHsiang Yeang, Michael Angelo, Christine Ladd, Michael Reich, Eva Latulippe, Jill P Mesirov, et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proceedings of the National Academy of Sciences, vol. 98, no. 26, pp. 15149–15154, 2001.
 [13] EngJuh Yeoh, Mary E Ross, Sheila A Shurtleff, W Kent Williams, Divyen Patel, Rami Mahfouz, Fred G Behm, Susana C Raimondi, Mary V Relling, Anami Patel, et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling,” Cancer cell, vol. 1, no. 2, pp. 133–143, 2002.

[14]
Lixin Sun, AiMin Hui, Qin Su, Alexander Vortmeyer, Yuri Kotliarov, Sandra
Pastorino, Antonino Passaniti, Jayant Menon, Jennifer Walling, Rolando
Bailey, et al.,
“Neuronal and gliomaderived stem cell factor induces angiogenesis within the brain,”
Cancer cell, vol. 9, no. 4, pp. 287–300, 2006.  [15] Robert Nakayama, Takeshi Nemoto, Hiro Takahashi, Tsutomu Ohta, Akira Kawai, Kunihiko Seki, Teruhiko Yoshida, Yoshiaki Toyama, Hitoshi Ichikawa, and Tadashi Hasegawa, “Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma,” Modern pathology, vol. 20, no. 7, pp. 749–759, 2007.
Comments
There are no comments yet.