# Compressive Regularized Discriminant Analysis of High-Dimensional Data with Applications to Microarray Studies

We propose a modification of linear discriminant analysis, referred to as compressive regularized discriminant analysis (CRDA), for analysis of high-dimensional datasets. CRDA is specially designed for feature elimination purpose and can be used as gene selection method in microarray studies. CRDA lends ideas from ℓ_q,1 norm minimization algorithms in the multiple measurement vectors (MMV) model and utilizes joint-sparsity promoting hard thresholding for feature elimination. A regularization of the sample covariance matrix is also needed as we consider the challenging scenario where the number of features (variables) is comparable or exceeding the sample size of the training dataset. A simulation study and four examples of real-life microarray datasets evaluate the performances of CRDA based classifiers. Overall, the proposed method gives fewer misclassification errors than its competitors, while at the same time achieving accurate feature selection.

## Authors

• 4 publications
• 17 publications
05/09/2020

### A Compressive Classification Framework for High-Dimensional Data

We propose a compressive classification framework for settings where the...
02/03/2016

### High-Dimensional Regularized Discriminant Analysis

Regularized discriminant analysis (RDA), proposed by Friedman (1989), is...
07/04/2018

### Diagonal Discriminant Analysis with Feature Selection for High Dimensional Data

We introduce a new method of performing high dimensional discriminant an...
10/02/2020

### Regularized K-means through hard-thresholding

We study a framework of regularized K-means methods based on direct pena...
11/01/2017

04/09/2018

### High-dimensional Linear Discriminant Analysis: Optimality, Adaptive Algorithm, and Missing Data

This paper aims to develop an optimality theory for linear discriminant ...
03/19/2016

### L0-norm Sparse Graph-regularized SVD for Biclustering

Learning the "blocking" structure is a central challenge for high dimens...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sparse signal approximations are widely used in many applications such as regression or classification where variable-selection (i.e., ranking and selection of features) aims at reducing the number of variables (or features) without sacrificing accuracy measured by the test error. Reduction in the set of features facilitates interpretation as well as stabilizes estimation. This is often deemed necessary in the high-dimensional (HD) context where the number of features,

, is often several magnitudes larger than the number of observations, , in the training dataset (i.e., ).

Many classification techniques assign a -dimensional observation to one of the classes (groups or populations) based on the following rule ∈group[  = argmax_g d_g()  ], where is called the discriminant function for population . In linear discriminant analysis (LDA), is a linear function of , , for some constant and vector . The vector depends on the unknown covariance matrix of the populations (via its inverse matrix) which is commonly estimated by the pooled sample covariance matrix (SCM). In the HD setting, the SCM is no-longer invertible, and therefore regularized SCM (RSCM) is used for constructing an estimated discriminant function . Such approaches are commonly referred to as regularized LDA methods, which we refer shortly as RDA. See e.g., [1, 2, 3, 4, 5].

Next note that if the -th entry of is zero, then the -th feature does not contribute in the classification of -th population. To eliminate unnecessary features, many authors have proposed to shrink using element-wise soft-thresholding, e.g., as in shrunken centroids (SC)RDA method [1]. These methods are often difficult to tune because the shrinkage threshold parameter is the same across all groups, but different populations would often benefit from different shrinkage intensity. Consequently, they tend to yield rather higher false-positive (FP) rates.

Element-wise shrinkage does not achieve simultaneous feature selection as the eliminated feature from group may still affect the discriminant function of group . In this paper, we propose compressive regularized discriminant analysis (CRDA) that promotes simultaneous joint-sparsity to pick fewer and differentially expressed variables. CRDA lends ideas from mixed norm minimization in the multiple measurement vectors (MMV) model [6], which is an extension of compressed sensing model to multivariate case. CRDA uses -norm based hard-thresholding which has the advantage of having a shrinkage parameter that is much easier to tune: namely, joint-sparsity level instead of shrinkage threshold as in SCRDA. Our approach also employs a different RSCM estimator compared to SCRDA. The used RSCM has the benefit of being able to attain the minimum mean squared error [7, 8] for an appropriate choice of the regularization parameter. The optimal pair of the tuning parameters can be found via cross validation (CV), but we also propose a computationally simpler approach that uses the RSCM proposed in [8]. This facilitates the computations considerably as only a single variable, the joint-sparsity level , needs to be tuned.

The paper is organized as follows. Section 2 describes RDA and SVD based inversion of the used RSCM. In Section 3, the proposed CRDA as well as our tuning parameter selection criteria is introduced. Section 4 provides the results on simulation studies which explore both the feature-selection capability and misclassification errors of CRDA, and the competing methods. Classification results on four real microarray datasets are also provided. Section 5 concludes the paper.

## 2 Regularized LDA

We are given a -variate random vector which we need to classify into one of the classes or populations. In LDA, one assumes that the class populations are -variate multivariate normal (MVN) with a common positive definite symmetric covariance matrix over each class but distinct class mean vectors , . The problem is then to classify to one of the MVN populations, , . Sometimes prior knowledge is available on proportions of each population and we denote by ,

, the prior probabilities of the classes (

). LDA uses the rule (1) with discriminant function

 dg(\x)=\x⊤\bebg−12\m⊤g\bebg+lnpg,

where for .

The LDA rule involves a set of unknown parameters, the class mean vectors and the covariance matrix . These are estimated from the training dataset that consists of observations from each of the classes (). Let denote the class label associated with the -th observation, so . Then is the number of observations belonging to -th population, and we denote by the relative sample proportions. We assume observations in the training dataset are centered by the sample mean vectors of the classes, _g = ¯_g = 1ng ∑_ c(i)=g _i. Since is centered, the pooled (over groups) sample covariance matrix (SCM) can be written simply as

 \lx@sectionsign=1n\Xs\Xs⊤.

In practice, an observation is classified using an estimated discriminant function, ^d_g() = ^⊤_g - 12 _g^⊤_g + ln_g, where , and is an estimator of . Note that in (2) the prior probabilities -s are replaced by their estimates, -s. Commonly, the pooled SCM is used as an estimator . Since we are in the regime, where , the pooled SCM is no longer invertible and hence can not be used in (2). To avoid the singularity of the estimated covariance matrix, a commonly used approach in the literature (cf. [7, 8]) is to use a regularized SCM (RSCM), =  §+ (1- )   ηI where . SCRDA [1] uses an estimator . However, (2) has some theoretical justification since with an appropriate (data dependent) choice , the obtained RSCM in (2) will be a consistent minimum mean squared error (MMSE) estimator of . Such choices of have been proposed, e.g., in [7] and in [8].

In the HD setup, the main computational burden is related with inverting the matrix in (2). The inversion can be done using the SVD-trick, as follows [1, 9]. The SVD of is

 \Xs=\Us\Ds\Vs⊤,

where , , and . Direct computation of SVD is time consuming and the trick is that and can be computed first from SVD of , which is only an matrix. Here is an orthogonal matrix whose first -columns are and is an diagonal matrix whose upper left corner matrix is . After we compute and from SVD of , we may compute from by . Then, using the SVD representation of the SCM, , and simple algebra, one obtains a simple formula for the inverse:

 \Sigit =\Us[(\aln\Ds2+(1−\al)η\boIm)−1−1(1−\al)η\boIm]\Us⊤ +1(1−\al)η\boI\pdim, (1)

where . This reduces the complexity from to which is a significant saving in case.

## 3 Compressive RDA

### 3.1 Proposed CRDA Approach

In order to explain the proposed compressive RDA approach, we first write the discriminant rule in vector form as

 \bod(\x) =(d1(\x),…,dG(\x)) =\x⊤\B−12diag(\M⊤\B)+ln\bop, (2)

where , and . Above notation extract the diagonal of the matrix into a vector, i.e., . The discriminant function in (2) is linear in with coefficient matrix . This means that if the -th row of the coefficient matrix is a zero vector , then it implies that -th predictor does not contribute to the classification rule and hence can be eliminated. If the coefficient matrix is row-sparse, then the method can be potentially used as a simultaneous feature elimination procedure. In microarray data analysis, this means that gene does not contribute in the classification procedure and thus the row-sparsity of the coefficient matrix allows, at the same time, identify differentially expressed genes.

In the MMV model [6], the goal is to achieve simultaneous sparse reconstruction (SSR) of the signal matrix. The task is to estimate the -rowsparse signal matrix , given an observed measurement matrix and an (over complete) basis matrix (or dictionary) . -rowsparsity of means that only rows of contain non-zero entries. Commonly, this goal is achieved by mixed matrix norm minimization, where

 ∥\B∥q,1=\pdim∑i=1∥\beb[i]∥q,

for some , where denotes the -th row of . Values have been advocated in the literature. Many SSR algorithms use hard-thresholding operator , defined as transform , which retains the elements of the rows of that possess largest norm and set elements of the other rows to zero. This leads us to define our compressive RDA discriminant function as

 ^\bod(\x) =(^d1(\x),…,^dG(\x)) =\x⊤^\B−12diag(^\M⊤^\B)+ln\bom\pr, (3)

where , and

 ^\B=HK(^\Sig−1^\M,q)

where has been defined in (2) and are the sample mean vectors of the classes in (2). Fast formula to compute is given in (2).

Next let us draw attention to SCRDA [1] which uses instead of estimator in (2). Another difference is in its use of element-wise soft-shrinkage. Namely, SCRDA can also be written in the multivariate form (3), but using ^= S_Δ( ^^-1 ^) where is the soft-thresholding function that is applied element-wise to its matrix-valued argument. That is, the -th element of in (3.1) is

 ^bij=SΔ(tij)=sign(tij)(|tij|−Δ)+

where for and denotes the -th element of . One disadvantage of SCRDA is the shrinkage thresholding parameter which is the same across all groups, and different populations would often benefit from different shrinkage intensity. A sensible upper bound of is difficult to determine and is highly data dependent. The proposed CRDA on the other hand uses simple to tune joint-sparsity level and has the benefit of offering simultaneous joint-sparse recovery, i.e., features are eliminated across all groups instead of group-wise.

### 3.2 Model (Parameters) Selection

We employ -fold CV to estimate the optimal pair using a 2D grid of candidate values of the tuning parameters, where and . Often there are several pairs that yield the minimal cross-validation error from the training dataset and each pair can exhibit varying degree of sparsity (number of features selected). Among them, we would prefer the pair that had minimal number of features. Since a pair with minimal CV error may not yield a classifier that is at the same time sparse, one may wish to set a lower bound for the number of features selected (NFS) in order to enhance the interpretability of the discriminant function.

Let denote a CV error for a pair . To have a trade-off between a minimum (CV-based) training error and NFS, we use a threshold and choose only the pairs which have CV error smaller than , i.e., pairs which verify . From these pairs, the final optimal pair is chosen as the one that has the smallest NFS value. For finding the optimal pair, we utilize a uniform grid of 100 -values and a uniform grid of 25 -values.

We compare the CV approach to computationally much lighter approach which uses the estimated parameter given in [8]. We note that value of can be computed efficiently using the SVD trick. Given the optimal RSCM based on we then estimate the sparsity level using CV estimate . This reduces the computational cost significantly.

## 4 Numerical examples

The simulation study investigates the performance of CRDA based classifiers using different simulation setups commonly used in the RDA literature (e.g. in [1, 3, 10, 11]) and draw a comparison with the available results, against the nearest shrunken centroids (NSC) [2], SCRDA [1] and PLDA [3]. For simulation setups I and II, we generate 1200 observations from MVN distribution, , with equal probabilities for each of groups. The observations are divided into three sets: (i) the validation set with 100 observations finds the tuning parameters, (ii) then 100 observations in the training set estimate and (iii) the rest 1000 form the test set for calculating misclassification test errors (TE). A total of out of features differ between the groups. In setup I, contains nonzeros for each group and rest all zeros, i.e., for . While, if and zero otherwise for setup II. Table 1 lists the average of the TE and NFS for each classifier using 25 MC trials and 5-fold CV.

The third simulation setup resembles real gene expression data. We generate training and test observations each having features. All groups have equal probabilities and follow MVN distribution for . We have and contains all zeros except first -entries (i.e., true positives) with value and . Each group employs following block-diagonal auto-regressive covariance-structure

 \Sigg=\Sig(ρg)⊕\Sig(−ρg)⊕⋯⊕\Sig(ρg)⊕\Sig(−ρg),

where indicates the direct sum (not the Kronecker sum) of block matrices having the AR covariance structure

 [\Sig(ρg)]1≤i,j≤100=ρ|i−j|g

where is the correlation which is different for each group, namely, , and . This setup mimics real microarray data as genes are correlated within a pathway and independent between the pathways. Table 1 reveals the higher accuracy of the proposed CRDA methods compared to SCRDA when measured by TE, NFS, detection rate (DR) and FP rates. The results are averaged over 10 Monte-Carlo trials using 10-fold CV.

Next we do a comparison based on real microarray datasets. A summary of the used datasets is given below:

 Dataset N p G Disease Ramaswamy et al. [12] 190 16,063 14 Cancer Yeoh et al. [13] 248 12,625 6 Leukemia Sun et al. [14] 180 54,613 4 Glioma Nakayama et al. [15] 105 22,283 10 Sarcoma

We compute the results for each dataset over 10 training-test set splits, each with a random choice of training and test set containing 75% and 25% of the total observations, respectively. The average results of classification and gene-selection by CRDA methods are given in Table 2 with available comparison results. The proposed CRDA based classifiers showcase better classification and feature-selection results for all simulation setups. Overall, it seems that and -norm based CRDA methods are doing better as compared to others. Moreover, the CRDA based on -norm appears to have best overall performance. Note that the proposed CRDA classifiers outperform other methods with a significant margin in the case of Ramaswamy et al. (with 14 groups) and Sun et al. (of genes).

## 5 Discussions and Conclusions

We proposed a modified version of LDA, called compressive regularized discriminant CRDA, for analysis of data sets in high dimension low sample size situations. CRDA was shown to outperform competing methods in most of the cases. It also had the best detection rate which illustrates that the method can be a useful tool for accurate selection of (differentially expressed) genes in microarray studies.

## References

• [1] Yaqian Guo, Trevor Hastie, and Robert Tibshirani, “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, vol. 8, no. 1, pp. 86–100, 2006.
• [2] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu, “Class prediction by nearest shrunken centroids, with applications to dna microarrays,” Statistical Science, pp. 104–117, 2003.
• [3] Daniela M Witten and Robert Tibshirani, “Penalized classification using fisher’s linear discriminant,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011.
• [4] Alok Sharma, Kuldip K Paliwal, Seiya Imoto, and Satoru Miyano, “A feature selection method using improved regularized linear discriminant analysis,” Machine vision and applications, vol. 25, no. 3, pp. 775–786, 2014.
• [5] Emanuel Neto, Felix Biessmann, Harald Aurlien, Helge Nordby, and Tom Eichele, “Regularized linear discriminant analysis of eeg features in dementia patients,” Frontiers in aging neuroscience, vol. 8, 2016.
• [6] M. F. Duarte and Y. C. Eldar, “Structured compressed sensing: From theory to applications,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4053–4085, 2011.
• [7] Olivier Ledoit and Michael Wolf, “A well-conditioned estimator for large-dimensional covariance matrices,”

Journal of multivariate analysis

, vol. 88, no. 2, pp. 365–411, 2004.
• [8] Esa Ollila, “Optimal high-dimensional shrinkage covariance estimation for elliptical distributions,” in Proc. European Signal Processing Conference (EUSIPCO 2017), Kos, Greece, 2017, pp. 1689–1693.
• [9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, The elements of statistical learning, vol. 1, Springer series in statistics New York, 2001.
• [10] John A Ramey and Phil D Young, “A comparison of regularization methods applied to the linear discriminant function with high-dimensional microarray data,” Journal of Statistical Computation and Simulation, vol. 83, no. 3, pp. 581–596, 2013.
• [11] Yan Zhou, Baoxue Zhang, Gaorong Li, Tiejun Tong, and Xiang Wan, “Gd-rda: A new regularized discriminant analysis for high-dimensional data,” Journal of Computational Biology, 2017.
• [12] Sridhar Ramaswamy, Pablo Tamayo, Ryan Rifkin, Sayan Mukherjee, Chen-Hsiang Yeang, Michael Angelo, Christine Ladd, Michael Reich, Eva Latulippe, Jill P Mesirov, et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proceedings of the National Academy of Sciences, vol. 98, no. 26, pp. 15149–15154, 2001.
• [13] Eng-Juh Yeoh, Mary E Ross, Sheila A Shurtleff, W Kent Williams, Divyen Patel, Rami Mahfouz, Fred G Behm, Susana C Raimondi, Mary V Relling, Anami Patel, et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling,” Cancer cell, vol. 1, no. 2, pp. 133–143, 2002.
• [14] Lixin Sun, Ai-Min Hui, Qin Su, Alexander Vortmeyer, Yuri Kotliarov, Sandra Pastorino, Antonino Passaniti, Jayant Menon, Jennifer Walling, Rolando Bailey, et al.,

Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain,”

Cancer cell, vol. 9, no. 4, pp. 287–300, 2006.
• [15] Robert Nakayama, Takeshi Nemoto, Hiro Takahashi, Tsutomu Ohta, Akira Kawai, Kunihiko Seki, Teruhiko Yoshida, Yoshiaki Toyama, Hitoshi Ichikawa, and Tadashi Hasegawa, “Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma,” Modern pathology, vol. 20, no. 7, pp. 749–759, 2007.