Sparse signal approximations are widely used in many applications such as regression or classification where variable-selection (i.e., ranking and selection of features) aims at reducing the number of variables (or features) without sacrificing accuracy measured by the test error. Reduction in the set of features facilitates interpretation as well as stabilizes estimation. This is often deemed necessary in the high-dimensional (HD) context where the number of features,, is often several magnitudes larger than the number of observations, , in the training dataset (i.e., ).
Many classification techniques assign a -dimensional observation to one of the classes (groups or populations) based on the following rule ∈group [ = argmax_g d_g() ], where is called the discriminant function for population . In linear discriminant analysis (LDA), is a linear function of , , for some constant and vector . The vector depends on the unknown covariance matrix of the populations (via its inverse matrix) which is commonly estimated by the pooled sample covariance matrix (SCM). In the HD setting, the SCM is no-longer invertible, and therefore regularized SCM (RSCM) is used for constructing an estimated discriminant function . Such approaches are commonly referred to as regularized LDA methods, which we refer shortly as RDA. See e.g., [1, 2, 3, 4, 5].
Next note that if the -th entry of is zero, then the -th feature does not contribute in the classification of -th population. To eliminate unnecessary features, many authors have proposed to shrink using element-wise soft-thresholding, e.g., as in shrunken centroids (SC)RDA method . These methods are often difficult to tune because the shrinkage threshold parameter is the same across all groups, but different populations would often benefit from different shrinkage intensity. Consequently, they tend to yield rather higher false-positive (FP) rates.
Element-wise shrinkage does not achieve simultaneous feature selection as the eliminated feature from group may still affect the discriminant function of group . In this paper, we propose compressive regularized discriminant analysis (CRDA) that promotes simultaneous joint-sparsity to pick fewer and differentially expressed variables. CRDA lends ideas from mixed norm minimization in the multiple measurement vectors (MMV) model , which is an extension of compressed sensing model to multivariate case. CRDA uses -norm based hard-thresholding which has the advantage of having a shrinkage parameter that is much easier to tune: namely, joint-sparsity level instead of shrinkage threshold as in SCRDA. Our approach also employs a different RSCM estimator compared to SCRDA. The used RSCM has the benefit of being able to attain the minimum mean squared error [7, 8] for an appropriate choice of the regularization parameter. The optimal pair of the tuning parameters can be found via cross validation (CV), but we also propose a computationally simpler approach that uses the RSCM proposed in . This facilitates the computations considerably as only a single variable, the joint-sparsity level , needs to be tuned.
The paper is organized as follows. Section 2 describes RDA and SVD based inversion of the used RSCM. In Section 3, the proposed CRDA as well as our tuning parameter selection criteria is introduced. Section 4 provides the results on simulation studies which explore both the feature-selection capability and misclassification errors of CRDA, and the competing methods. Classification results on four real microarray datasets are also provided. Section 5 concludes the paper.
2 Regularized LDA
We are given a -variate random vector which we need to classify into one of the classes or populations. In LDA, one assumes that the class populations are -variate multivariate normal (MVN) with a common positive definite symmetric covariance matrix over each class but distinct class mean vectors , . The problem is then to classify to one of the MVN populations, , . Sometimes prior knowledge is available on proportions of each population and we denote by ,
, the prior probabilities of the classes (). LDA uses the rule (1) with discriminant function
where for .
The LDA rule involves a set of unknown parameters, the class mean vectors and the covariance matrix . These are estimated from the training dataset that consists of observations from each of the classes (). Let denote the class label associated with the -th observation, so . Then is the number of observations belonging to -th population, and we denote by the relative sample proportions. We assume observations in the training dataset are centered by the sample mean vectors of the classes, _g = ¯_g = 1ng ∑_ c(i)=g _i. Since is centered, the pooled (over groups) sample covariance matrix (SCM) can be written simply as
In practice, an observation is classified using an estimated discriminant function, ^d_g() = ^⊤_g - 12 _g^⊤_g + ln_g, where , and is an estimator of . Note that in (2) the prior probabilities -s are replaced by their estimates, -s. Commonly, the pooled SCM is used as an estimator . Since we are in the regime, where , the pooled SCM is no longer invertible and hence can not be used in (2). To avoid the singularity of the estimated covariance matrix, a commonly used approach in the literature (cf. [7, 8]) is to use a regularized SCM (RSCM), = §+ (1- ) ηI where . SCRDA  uses an estimator . However, (2) has some theoretical justification since with an appropriate (data dependent) choice , the obtained RSCM in (2) will be a consistent minimum mean squared error (MMSE) estimator of . Such choices of have been proposed, e.g., in  and in .
where , , and . Direct computation of SVD is time consuming and the trick is that and can be computed first from SVD of , which is only an matrix. Here is an orthogonal matrix whose first -columns are and is an diagonal matrix whose upper left corner matrix is . After we compute and from SVD of , we may compute from by . Then, using the SVD representation of the SCM, , and simple algebra, one obtains a simple formula for the inverse:
where . This reduces the complexity from to which is a significant saving in case.
3 Compressive RDA
3.1 Proposed CRDA Approach
In order to explain the proposed compressive RDA approach, we first write the discriminant rule in vector form as
where , and . Above notation extract the diagonal of the matrix into a vector, i.e., . The discriminant function in (2) is linear in with coefficient matrix . This means that if the -th row of the coefficient matrix is a zero vector , then it implies that -th predictor does not contribute to the classification rule and hence can be eliminated. If the coefficient matrix is row-sparse, then the method can be potentially used as a simultaneous feature elimination procedure. In microarray data analysis, this means that gene does not contribute in the classification procedure and thus the row-sparsity of the coefficient matrix allows, at the same time, identify differentially expressed genes.
In the MMV model , the goal is to achieve simultaneous sparse reconstruction (SSR) of the signal matrix. The task is to estimate the -rowsparse signal matrix , given an observed measurement matrix and an (over complete) basis matrix (or dictionary) . -rowsparsity of means that only rows of contain non-zero entries. Commonly, this goal is achieved by mixed matrix norm minimization, where
for some , where denotes the -th row of . Values have been advocated in the literature. Many SSR algorithms use hard-thresholding operator , defined as transform , which retains the elements of the rows of that possess largest norm and set elements of the other rows to zero. This leads us to define our compressive RDA discriminant function as
where , and
Next let us draw attention to SCRDA  which uses instead of estimator in (2). Another difference is in its use of element-wise soft-shrinkage. Namely, SCRDA can also be written in the multivariate form (3), but using ^= S_Δ( ^^-1 ^) where is the soft-thresholding function that is applied element-wise to its matrix-valued argument. That is, the -th element of in (3.1) is
where for and denotes the -th element of . One disadvantage of SCRDA is the shrinkage thresholding parameter which is the same across all groups, and different populations would often benefit from different shrinkage intensity. A sensible upper bound of is difficult to determine and is highly data dependent. The proposed CRDA on the other hand uses simple to tune joint-sparsity level and has the benefit of offering simultaneous joint-sparse recovery, i.e., features are eliminated across all groups instead of group-wise.
|Setup I||Setup II|
|120 (116)||165 (163)||180 (174)||105 (101)|
|95 (94)||126 (120)||184 (182)||96 (105)|
|84 (81)||112 (114)||185 (177)||94 (96)|
|46 (50)||205 (259)||90 (94)||12 (27)|
|49 (46)||240 (203)||92 (92)||23 (10)|
|50 (52)||238 (252)||89 (92)||27 (27)|
|Ramaswamy et al. dataset||Yeoh et al. dataset||Sun et al. dataset||Nakayama et al. dataset|
|TE / 47||NFS||TE / 62||NFS||TE / 45||NFS||TE / 26||NFS|
|10.6 (9.9)||2634 (4899)||9.6 (7.5)||2525 (4697)||12.5 (12.9)||23320 (27416)||8.3 (7.9)||2941 (6952)|
|10.4 (10.3)||2683 (3968)||9.7 (6.0)||2273 (4659)||12.9 (13.3)||20589 (23484)||7.9 (7.6)||3142 (7755)|
|10.3 (10.3)||3405 (4530)||9.3 (6.5)||846 (4697)||12.4 (13.5)||21354 (20207)||7.6 (7.6)||2719 (2340)|
3.2 Model (Parameters) Selection
We employ -fold CV to estimate the optimal pair using a 2D grid of candidate values of the tuning parameters, where and . Often there are several pairs that yield the minimal cross-validation error from the training dataset and each pair can exhibit varying degree of sparsity (number of features selected). Among them, we would prefer the pair that had minimal number of features. Since a pair with minimal CV error may not yield a classifier that is at the same time sparse, one may wish to set a lower bound for the number of features selected (NFS) in order to enhance the interpretability of the discriminant function.
Let denote a CV error for a pair . To have a trade-off between a minimum (CV-based) training error and NFS, we use a threshold and choose only the pairs which have CV error smaller than , i.e., pairs which verify . From these pairs, the final optimal pair is chosen as the one that has the smallest NFS value. For finding the optimal pair, we utilize a uniform grid of 100 -values and a uniform grid of 25 -values.
We compare the CV approach to computationally much lighter approach which uses the estimated parameter given in . We note that value of can be computed efficiently using the SVD trick. Given the optimal RSCM based on we then estimate the sparsity level using CV estimate . This reduces the computational cost significantly.
4 Numerical examples
The simulation study investigates the performance of CRDA based classifiers using different simulation setups commonly used in the RDA literature (e.g. in [1, 3, 10, 11]) and draw a comparison with the available results, against the nearest shrunken centroids (NSC) , SCRDA  and PLDA . For simulation setups I and II, we generate 1200 observations from MVN distribution, , with equal probabilities for each of groups. The observations are divided into three sets: (i) the validation set with 100 observations finds the tuning parameters, (ii) then 100 observations in the training set estimate and (iii) the rest 1000 form the test set for calculating misclassification test errors (TE). A total of out of features differ between the groups. In setup I, contains nonzeros for each group and rest all zeros, i.e., for . While, if and zero otherwise for setup II. Table 1 lists the average of the TE and NFS for each classifier using 25 MC trials and 5-fold CV.
The third simulation setup resembles real gene expression data. We generate training and test observations each having features. All groups have equal probabilities and follow MVN distribution for . We have and contains all zeros except first -entries (i.e., true positives) with value and . Each group employs following block-diagonal auto-regressive covariance-structure
where indicates the direct sum (not the Kronecker sum) of block matrices having the AR covariance structure
where is the correlation which is different for each group, namely, , and . This setup mimics real microarray data as genes are correlated within a pathway and independent between the pathways. Table 1 reveals the higher accuracy of the proposed CRDA methods compared to SCRDA when measured by TE, NFS, detection rate (DR) and FP rates. The results are averaged over 10 Monte-Carlo trials using 10-fold CV.
Next we do a comparison based on real microarray datasets. A summary of the used datasets is given below:
|Ramaswamy et al. ||190||16,063||14||Cancer|
|Yeoh et al. ||248||12,625||6||Leukemia|
|Sun et al. ||180||54,613||4||Glioma|
|Nakayama et al. ||105||22,283||10||Sarcoma|
We compute the results for each dataset over 10 training-test set splits, each with a random choice of training and test set containing 75% and 25% of the total observations, respectively. The average results of classification and gene-selection by CRDA methods are given in Table 2 with available comparison results. The proposed CRDA based classifiers showcase better classification and feature-selection results for all simulation setups. Overall, it seems that and -norm based CRDA methods are doing better as compared to others. Moreover, the CRDA based on -norm appears to have best overall performance. Note that the proposed CRDA classifiers outperform other methods with a significant margin in the case of Ramaswamy et al. (with 14 groups) and Sun et al. (of genes).
5 Discussions and Conclusions
We proposed a modified version of LDA, called compressive regularized discriminant CRDA, for analysis of data sets in high dimension low sample size situations. CRDA was shown to outperform competing methods in most of the cases. It also had the best detection rate which illustrates that the method can be a useful tool for accurate selection of (differentially expressed) genes in microarray studies.
-  Yaqian Guo, Trevor Hastie, and Robert Tibshirani, “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, vol. 8, no. 1, pp. 86–100, 2006.
-  Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu, “Class prediction by nearest shrunken centroids, with applications to dna microarrays,” Statistical Science, pp. 104–117, 2003.
-  Daniela M Witten and Robert Tibshirani, “Penalized classification using fisher’s linear discriminant,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011.
-  Alok Sharma, Kuldip K Paliwal, Seiya Imoto, and Satoru Miyano, “A feature selection method using improved regularized linear discriminant analysis,” Machine vision and applications, vol. 25, no. 3, pp. 775–786, 2014.
-  Emanuel Neto, Felix Biessmann, Harald Aurlien, Helge Nordby, and Tom Eichele, “Regularized linear discriminant analysis of eeg features in dementia patients,” Frontiers in aging neuroscience, vol. 8, 2016.
-  M. F. Duarte and Y. C. Eldar, “Structured compressed sensing: From theory to applications,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4053–4085, 2011.
Olivier Ledoit and Michael Wolf,
“A well-conditioned estimator for large-dimensional covariance
Journal of multivariate analysis, vol. 88, no. 2, pp. 365–411, 2004.
-  Esa Ollila, “Optimal high-dimensional shrinkage covariance estimation for elliptical distributions,” in Proc. European Signal Processing Conference (EUSIPCO 2017), Kos, Greece, 2017, pp. 1689–1693.
-  Jerome Friedman, Trevor Hastie, and Robert Tibshirani, The elements of statistical learning, vol. 1, Springer series in statistics New York, 2001.
-  John A Ramey and Phil D Young, “A comparison of regularization methods applied to the linear discriminant function with high-dimensional microarray data,” Journal of Statistical Computation and Simulation, vol. 83, no. 3, pp. 581–596, 2013.
-  Yan Zhou, Baoxue Zhang, Gaorong Li, Tiejun Tong, and Xiang Wan, “Gd-rda: A new regularized discriminant analysis for high-dimensional data,” Journal of Computational Biology, 2017.
-  Sridhar Ramaswamy, Pablo Tamayo, Ryan Rifkin, Sayan Mukherjee, Chen-Hsiang Yeang, Michael Angelo, Christine Ladd, Michael Reich, Eva Latulippe, Jill P Mesirov, et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proceedings of the National Academy of Sciences, vol. 98, no. 26, pp. 15149–15154, 2001.
-  Eng-Juh Yeoh, Mary E Ross, Sheila A Shurtleff, W Kent Williams, Divyen Patel, Rami Mahfouz, Fred G Behm, Susana C Raimondi, Mary V Relling, Anami Patel, et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling,” Cancer cell, vol. 1, no. 2, pp. 133–143, 2002.
Lixin Sun, Ai-Min Hui, Qin Su, Alexander Vortmeyer, Yuri Kotliarov, Sandra
Pastorino, Antonino Passaniti, Jayant Menon, Jennifer Walling, Rolando
Bailey, et al.,
“Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain,”Cancer cell, vol. 9, no. 4, pp. 287–300, 2006.
-  Robert Nakayama, Takeshi Nemoto, Hiro Takahashi, Tsutomu Ohta, Akira Kawai, Kunihiko Seki, Teruhiko Yoshida, Yoshiaki Toyama, Hitoshi Ichikawa, and Tadashi Hasegawa, “Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma,” Modern pathology, vol. 20, no. 7, pp. 749–759, 2007.