A Doubly Regularized Linear Discriminant Analysis Classifier with Automatic Parameter Selection

by   Alam Zaib, et al.

Linear discriminant analysis (LDA) based classifiers tend to falter in many practical settings where the training data size is smaller than, or comparable to, the number of features. As a remedy, regularized LDA (RLDA) methods have been proposed. However, the classification performance of these methods vary depending on the size of training and test data. In this paper, we propose a doubly regularized LDA classifier that we denote as R2LDA. In the proposed R2LDA approach, two regularization operations are carried out; one involving only the training data set, while the other also includes the given test data sample. The proposed R2LDA algorithm, unlike the classical RLDA techniques, caters for errors due to training data as well as the possible noise in the test data. Choosing the two regularization parameters in R2LDA can be automated through existing methods based on least squares (LS). Particularly, we show that a constrained perturbation regularization approach (COPRA) is well suited for the regularization parameter selection task needed for the proposed R2LDA classifier. Results obtained from both synthetic and real data demonstrate the consistency and effectiveness of the proposed R2LDA-COPRA classifier, especially in scenarios involving noisy test data.



page 1


Quadratic Discriminant Analysis by Projection

Discriminant analysis, including linear discriminant analysis (LDA) and ...

Nested Cavity Classifier: performance and remedy

Nested Cavity Classifier (NCC) is a classification rule that pursues par...

Weight Vector Tuning and Asymptotic Analysis of Binary Linear Classifiers

Unlike its intercept, a linear classifier's weight vector cannot be tune...

Classification with imperfect training labels

We study the effect of imperfect training data labels on the performance...

Regularized Bilinear Discriminant Analysis for Multivariate Time Series Data

In recent years, the methods on matrix-based or bilinear discriminant an...

KNN Classification with One-step Computation

KNN classification is a query triggered yet improvisational learning mod...

Inverse Classification for Comparison-based Interpretability in Machine Learning

In the context of post-hoc interpretability, this paper addresses the ta...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The idea of linear discriminant analysis (LDA) was originally conceived by R. A. Fisher [12]

and is based on the assumption of Gaussian distribution of data with a common class covariance matrix. Owing to its simplicity, LDA has been successfully applied to various classification and recognition problems such as detection

[30], speech recognition [3], cancer genomics [23, 21]

and face recognition

[28] to mention a few.

The performance of LDA based classifiers depends heavily on accurate estimation of the class statistics in the form of sample covariance matrices and mean vectors. These estimates are fairly accurate when the number of available samples is large compared to the data dimensionality. In practical high-dimensional data settings, the challenge is to cope with the limitation in the number of available samples. In this case, the sample covariance estimates become highly perturbed and ill-conditioned resulting in severe performance degradation. To alleviate this problem, the sample covariance matrix is replaced with a regularized or ridge covariance matrix

[25], giving the name regularized LDA (RLDA) [20, 18, 19]. The performance of RLDA classifiers is ultimately dictated by the choice of the regularization parameter. It is essential to judiciously set the value of the regularization parameter to reap the full benefits of RLDA. Towards this end, various regularization techniques have been proposed, e.g., cross-validation [13] has been one of the classical techniques for estimating the ridge parameter as evidenced in [14, 6, 33, 21, 32]. However, the search mechanisms of these methods lead to high computational complexity. In addition, they are not based on performance optimizing criteria.

Recently, an optimal regularization method that minimizes the asymptotic classification error was derived in [35, 10]

. The method is based on recent results from random matrix theory. In

[11], the latter method was extended to a more general class of discriminant analysis based classifiers, with LDA obtained as a special case. Despite being elegant approaches, both [10] and [11] require a grid search mechanism to find the best value of the regularization parameter. In [26], an improved RLDA classifier is proposed which avoids the grid search but is limited to spiked-model covariance structures. It is worth mentioning here that these theoretical results strongly rely on the Gaussian assumption, and so they might not apply equally well to real data. Moreover, the performance of the above mentioned approaches deteriorates significantly when the test data is contaminated with noise that is not observed during the training stage.

Focusing on binary classification, this paper presents a doubly regularized LDA (R2LDA) classifier by expressing the LDA score function as an inner product of two vectors which are linearly related to the mean vectors and the data covariance matrices. These vectors are estimated by using a perturbation regularization approach [27] where the regularization parameters can be selected to be optimal in the mean-squared-error (MSE) sense. The proposed method takes care of the ill-conditioning of the sample covariance matrices and the uncertainties in the training or the test data. In addition, the proposed method has the following distinctive features:

  • Two regularization parameters are calculated based on both the training and test data. These parameters can be tuned independently to cope with the different perturbations including those in the test data. This is to be contrasted with existing approaches which utilize a single regularization operation based solely on the training data. This feature makes the proposed approach more robust to noise that is unobserved in the training data but occurs in the test data.

  • The regularization parameter selection approach is agnostic to the underlying distribution of the data contrary to [10, 11, 26], which rely on the Gaussian assumption.

Ii RLDA Classification

We consider the binary classification problem of assigning a multivariate observation vector to one of two classes . Let

be a prior probability that

belongs to a class and assume that the class conditional densities are Gaussian with mean vectors and non-negative covariance matrices .

LDA employs the Bayesian discriminant rule, which assigns

to the class with maximum posterior probability. Let

and represent the available training samples pertaining to the classes and , respectively, where is the number of samples in class and is the total number of training samples. The LDA score function reads [1]


where is the matrix transpose operation. The unbiased mean vector estimates and the pooled sample covariance matrix are given by


where the sample covariance matrices are defined as


The class assignment rule for is as follows:


A major source of error in the above formulation is the inversion of the covariance matrix . In many practical setups where is comparable to , becomes ill conditioned, or even singular. To get around this issue, in (1) is replaced with a regularized matrix , where and

is the identity matrix of dimension

. This replacement results in the RLDA score function [10, 35]


In this work, we employ a different form of regularization to that in (1). In the proposed regularized LDA classifier, we apply two separate regularization operations, which help in accounting for errors in the training data and providing robustness against error contributions that are present in the test data.

Iii The proposed R2LDA classifier

Existing RLDA techniques are based on (5), with estimated by selecting the regularization parameter using only the training data. This makes these techniques vulnerable to errors in the test data, especially when the error statistics of the test data deviate from those of the training data. To address this issue, we reformulate the LDA score function (1) as


where , , , , and . Based on the last two definitions, our proposed R2LDA method aims to obtain regularized estimates of and to improve the computation of the score function in (6). To this end, we utilize the linear models


where and are additive noise vectors. Each of (7) and (8) can be represented by the model


where represents model noise. To simplify our derivations, we make the following assumptions:

  1. The noise vector has zero mean and an unknown covariance matrix .

  2. The unknown random vector is zero mean with an unknown positive semi-definite diagonal covariance matrix .

  3. The vectors and are mutually independent.

In Section V, we will see that these simplifying assumptions still work for different classification examples.

Focusing on (9

), regularization methods, commonly named ridge regression or Tikhonov regularization

[29, 22, 16], can be applied to obtain a stabilized estimate of . This estimate can be expressed in a closed form as [19]


Based on (10), we can estimate and and substitute the results in (6) to obtain the R2LDA score function in the form


where and are the regularization parameters associated with the linear systems (7) and (8), respectively. The second equality in (III) follows directly from substituting (in (10

)) the eigenvalue decomposition (EVD) of

given by , where

is the matrix of eigenvectors and

is the diagonal matrix of eigenvalues of .

Now, it only remains to set the values of the regularization parameters and . In the following section, we present a robust method to compute the regularization parameter for the regularized least-squares (RLS) solution in (10).

Remark 1

Compared to the conventional RLDA score function (5), the new formulation (III) involves two regularization operations. Note that the estimation of the class mean vectors results in perturbations in both and . In addition, also has errors coming from the test data. By carrying out two independent estimations to obtain regularized estimates of and in (6), we can optimize the choice of two different regularization parameters to cope with the different perturbations in and . This is the key advantage of proposed R2LDA method over the classical RLDA based on (5

) which uses a single regularization operation that involves only the training data. It will also become clearer that the proposed R2LDA still uses the statistics from the training data only, which is fundamental requirement of any machine learning algorithm.

Iv Regularization Parameter Selection

Several methods have been proposed in the literature for selecting the regularization parameter required in (10), e.g., the L-curve [15], the generalized cross-validation (GCV) [31], and the quasi-optimal method [2, 8], to mention a few. These methods use different criteria which results in different values of the regularization parameter (see [7]).

In this work, we adopt the constrained perturbation regularization approach (COPRA) [27]

, which allows for regularization parameter selection in a way that optimizes the MSE. We adapt this algorithm to the setting of the problem in hand. COPRA works by introducing an artificial perturbation in the linear model to improve the singular-value structure of the resulting model matrix

, and hence, is well suited to the naturally perturbed model in hand. To proceed, we start by replacing in (9) by a perturbed version to obtain the model


where is an unknown perturbation matrix which is norm bounded by a positive number , i.e., . One can consider to be a way to perturb to make the solution of (12) stable [27]. The perturbation can also be thought of as a genuine error in the model due to the noisy nature of , which is the case for (9). To obtain an estimate of , we consider the minimization of the worst-case residual error. Namely, we pursue the following optimization:


Interestingly, as shown in [9, 27, 4], the min-max problem (13) can be converted to a minimization problem whose solution is given by (10) with the constraint


We observe that the solution of (13) depends on the bound (in addition to the other linear system parameters) and is agnostic to the structure of the perturbation matrix . Note that both and are unknown. However, we can substitute (10) and the EVD of in (14) and manipulate to obtain


where is the matrix trace operator. Since in (15) is stochastic in nature, we consider a value of that would represent the average case. To this end we replace with its expected value , which can be written based on (9) in the following form:


Owing to the ill-conditioning of , it is likely that some of its eigenvalues are very close to, or even, zero. Therefore, the EVD of can be written in the form,


where and are diagonal matrices containing most significant and least significant eigenvalues, respectively. This partitioning is introduced as a general case form. For the special case where all eigenvalues are significant, we set and no partitioning is required. A threshold based approach to find the point of this partitioning is recommended in [27]. However, a simple and intuitive rule is used here to determine the value of as the smaller value of and , i.e., . The rationale behind (17) will be explained subsequently (see Remark 2).

Now, we substitute (16) and (17) in (15) and manipulate to obtain (18, shown on the top of the next page).


Next, we proceed by eliminating the dependency of on the unknowns and in (18) by using the MSE criterion. The MSE of the RLS estimator (10) can be written as [19]


By differentiating (IV), the regularization parameter that minimizes the MSE, is given by


By substituting (20) in (18), we obtain (21, shown on the top of the next page), which shows a bound that does not depend on the statistics of or that of the noise. Note that the derivation of (16)–(18) is largely based on the Assumptions 1–3. Ultimately, by using (21), we can eliminate from (15) to obtain (22), where and . Equation (22), which is non-linear in , can be solved by using Newton’s Method [34] to obtain the optimal value of . The iterations should be initialized from a positive initial guess close to zero to avoid missing the positive root, as explained in [27].

Remark 2

Equation (22) is based on the contribution of only the significant eigenvalues of which occupy the diagonal of the matrix . In this case, if Newton’s method iterations start from a small initial value of , the (diagonal) matrix inversion operation required to compute the right-hand side of (22) will be numerically stable since the diagonal elements of are not overly small. This highlights the benefit of partitioning and truncation of the insignificant eigenvalues in (17).



Iv-a Summary of the Proposed R2LDA-COPRA Algorithm

The main steps involved in the proposed R2LDA algorithm based on COPRA are summarized as follows:

  • Estimate the class statistics; , and based on the training data by using (2) and (3).

  • Compute , and the EVD of to determine and corresponding to the most significant eigenvalues.

  • Set in (22) and solve using Newton’s method to obtain .

  • For a given test sample, compute . Then repeat step by setting to obtain .

  • Compute the R2LDA score function given in (III) and classify the given test sample according to (4).

By using (IV), COPRA guarantees that we obtain the best (in terms of MSE) regularized estimates of and required to form our R2LDA score function in (III). This does not guarantee optimal classification performance based on (III). However, our results show that the proposed R2LDA algorithm still outperforms classical RLDA classifiers of the form given by (5). It is worth mentioning here that COPRA can be replaced with other regularization methods to compute the regularization parameters and . Further, the proposed R2LDA algorithm uses only the statistics from the training data (step 1). The computations in step 4 and step 5 use the given test sample and not the test data or the noise statistics.

V Results

We demonstrate the performance of the proposed R2LDA classification against the RLDA techniques of the asymptotic error estimator (Asym)[10] and the improved error estimator (Impr)[26]. We also consider GCV [31] and bounded perturbation regularization (BPR) [5] as alternatives to COPRA in selecting the two regularization parameters of the R2LDA classifier. We consider both synthetic and real data for performance comparison.

(a) Gaussian,
(b) Gaussian,
(c) Gaussian,
Fig. 1: Gaussian data Misclassification rate versus training data size for different test data noise levels.
(a) MNIST (1,7),
(b) MNIST (1,7),
(c) MNIST(1,7),
(d) MNIST (5,8),
(e) MNIST (5,8),
(f) MNIST(5,8),
(g) MNIST (7,9),
(h) MNIST (7,9),
(i) MNIST(7,9),
Fig. 2: MNIST data Misclassification rate versus training data size for different test data noise levels.
(a) Phonemes,
(b) Phonemes,
(c) Phonemes,
Fig. 3: Phoneme data Misclassification rate versus training data size for different test data noise levels.

The synthetic data was generated using a Gaussian data model with class covariance matrices and mean vectors defined as: , which is of dimensionality and has on the main diagonal and as off-diagonal elements; and , where . The parameter was chosen according to Mahalanobis distance () between classes defined as, [10]. We set . A training set of size for the class was generated in each trial. We set . For the test data, we generated an independent set of samples for each class. A total of 500 training trials were carried out, each followed by 500 test trials.

For real data, we use the MNIST dataset of gray-scale images of handwritten digits [24], and the phonemes dataset considered in [17]. The later is based on log-periodogram (of length ) of digitized speech frames extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, U.S. Department of Commerce) [17], which is widely used in speech recognition. The MNIST images are vectorized to result in data of dimensionally equal to . For binary classification, selected pairs of images were used. On the other hand, we used only two phonemes transcribed as: “sh”as in “she”and “dcl”as in “dark”, for binary classification. Real data results were obtained from 100 training attempts. In each attempt, the training samples were chosen randomly from the dataset. Each trained model was tested using 500 examples, which were also randomly selected from the dataset.

For both the synthetic and real datasets, zero-mean Gaussian noise with standard deviation

was added during the test phase. The properties of the noise were not known by the proposed R2LDA classifier, nor were they known by any of the benchmarks.

V-a Discussion

Figs.13 shows classification error versus the size of the training data for different scenarios. Fig.1 presents the results for the (synthetic) Gaussian data, while Fig.2 and Fig.3 present results for the MNIST and phonemes datasets, respectively. The MNIST results are based on images/digits pair examples of (1,7), (5,8) and (7,9). From these results in Figs.13, we observe the following:

  • On average, the R2LDA methods outperform the RLDA methods.

  • The R2LDA remains more consistent and stable than the RLDA methods as the level of noise in the test data increases. This is more visible with the MNIST and phonemes datasets.

  • Amongst the R2LDA classifiers, R2LDA-COPRA is the most consistent. R2LDA-GCV and R2LDA-BPR falters occasionally as in Fig.2(a) and Fig.2(g).

Vi Conclusions

We presented a novel regularized LDA classifier based on a dual regularization approach to provide robustness against both training and test data perturbations. In the proposed classifier, the regularization parameters are obtained by solving a non-linear equation using Newton’s method. Results based on both synthetic and real data demonstrate the effectiveness of our method, especially when noise is present in the test data. Although the proposed method is presented for binary classification, it can be easily extended to multi-class problems.


  • [1] T. W. Anderson (1951-03)

    Classification by multivariate analysis

    Psychometrika 16 (1), pp. 31–50. External Links: ISSN 1860-0980, Document, Link Cited by: §II.
  • [2] A.B. Aries, Z. Nashed, and V.A. Morozov (2012) Methods for solving incorrectly posed problems. Springer New York. External Links: ISBN 9781461252801, LCCN 84013961, Link Cited by: §IV.
  • [3] C. Avendano, S. Van Vuuren, and H. Hermansky (1996-10) Data based filter design for rasta-like channel normalization in asr. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, Vol. 4, pp. 2087–2090. External Links: Document, ISSN Cited by: §I.
  • [4] T. Ballal and T. Y. Al-Naffouri (2015-04) Improved linear least squares estimation using bounded data uncertainty. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICAAS), Vol. , pp. 3427–3431. External Links: Document, ISSN 1520-6149 Cited by: §IV.
  • [5] T. Ballal, M. A. Suliman, and T. Y. Al-Naffouri (2017) Bounded perturbation regularization for linear least squares estimation. IEEE Access 5 (), pp. 27551–27562. External Links: Document, ISSN Cited by: §V.
  • [6] T. V. Bandos, L. Bruzzone, and G. Camps-Valls (2009-03) Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing 47 (3), pp. 862–873. External Links: Document, ISSN 0196-2892 Cited by: §I.
  • [7] F. Bauer and M. A. Lukas (2011-05) Original article: comparing parameter choice methods for regularization of ill-posed problems. Math. Comput. Simul. 81 (9), pp. 1795–1841. External Links: ISSN 0378-4754, Link, Document Cited by: §IV.
  • [8] F. Bauer and M. Reiß (2008) Regularization independent of the noise level: an analysis of quasi-optimality. Inverse Problems 24 (5), pp. 055009. External Links: Link Cited by: §IV.
  • [9] S. Chandrasekaran, G. H. Golub, M. Gu, and A. H. Sayed (1998-01) Parameter estimation in the presence of bounded data uncertainties. SIAM J. Matrix Analysis and Applications 19 (), pp. 235–252. External Links: ISSN , Link, Document Cited by: §IV.
  • [10] B. Daniyar, J. Alex, and Z. Amin (2016) An efficient method to estimate the optimum regularization parameter in RLDA. Bioinformatics 32 22, pp. 3461–3468. Cited by: 2nd item, §I, §II, §V, §V.
  • [11] K. Elkhalil, A. Kammoun, R. Couillet, T. Y. Al-Naffouri, and M. S. Alouini (2017-09) Asymptotic performance of regularized quadratic discriminant analysis based classifiers. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Vol. , pp. 1–6. External Links: Document, ISSN Cited by: 2nd item, §I.
  • [12] R. A. Fisher (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 (7), pp. 179–188. Cited by: §I.
  • [13] J. H. Friedman (1989) Regularized discriminant analysis. Journal of the American Statistical Association 84 (405), pp. 165–175. External Links: ISSN 01621459, Link Cited by: §I.
  • [14] Y. Guo, T. Hastie, and R. Tibshirani (2007) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8 (1), pp. 86–100. External Links: Document, Link Cited by: §I.
  • [15] P. C. Hansen and D. P. O’Leary (1993-11) The use of the l-curve in the regularization of discrete ill-posed problems. SIAM J. Sci. Comput. 14 (6), pp. 1487–1503. External Links: ISSN 1064-8275, Link, Document Cited by: §IV.
  • [16] P. C. Hansen (2010) Discrete inverse problems: insight and algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. External Links: ISBN 0898716969, 9780898716962 Cited by: §III.
  • [17] T. Hastie, A. Buja, and R. Tibshirani (1995-02) Penalized discriminant analysis. Ann. Statist. 23 (1), pp. 73–102. External Links: Document, Link Cited by: §V.
  • [18] A. E. Hoerl and R. W. Kennard (1970-02) Ridge regression: applications to nonorthogonal problems. Technometrics 12 (1), pp. 69–82. Cited by: §I.
  • [19] A. E. Hoerl and R. W. Kennard (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, pp. 55–67. Cited by: §I, §III, §IV.
  • [20] A. E. Hoerl (1962) Application of ridge analysis to regression problems. Chemical Engineering Progress 58 (3), pp. 54–59. Cited by: §I.
  • [21] D. Huang, Y. Quan, M. He, and B. Zhou (2009) Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data. 28, pp. 1–8. External Links: Document Cited by: §I, §I.
  • [22] B. B. John (1963-10) Reviewed work: solutions of ill-posed problems by A. N. Tikhonov, V. Y. Arsenin. Mathematics of Computation 32 (144), pp. 1320–1322. Cited by: §III.
  • [23] S. Kim, E. R. Dougherty, I. Shmulevich, K. R. Hess, S. R. Hamilton, J. M. Trent, G. N. Fuller, and W. Zhang (2002) Identification of combination gene sets for glioma classification. 1 (13), pp. 1229–1236. Cited by: §I.
  • [24] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 0018-9219 Cited by: §V.
  • [25] P. J. D. Pillo (1976) The application of bias to discriminant analysis. Communications in Statistics - Theory and Methods 5 (9), pp. 843–854. External Links: Document Cited by: §I.
  • [26] H. Sifaou, A. Kammoun, and M. Alouini (2018-06) Improved LDA classifier based on spiked models. In 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Cited by: 2nd item, §I, §V.
  • [27] M. A. Suliman, T. Ballal, and T. Y. Al-Naffouri (2018) Perturbation-based regularization for signal estimation in linear discrete ill-posed problems. Signal Processing 152, pp. 35–46. External Links: ISSN 0165-1684, Document, Link Cited by: §I, §IV, §IV.
  • [28] D. L. Swets and J. J. Weng (1996-08)

    Using discriminant eigenfeatures for image retrieval

    IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (8), pp. 831–836. External Links: Document, ISSN 0162-8828 Cited by: §I.
  • [29] A. N. Tikhonov (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4, pp. 1035–1038. Cited by: §III.
  • [30] K. R. Varshney (2012-06) Generalization error of linear discriminant analysis in spatially-correlated sensor networks. IEEE Transactions on Signal Processing 60 (6), pp. 3295–3301. External Links: Document, ISSN 1053-587X Cited by: §I.
  • [31] G. Wahba (1990) Spline models for observational data. Society for Industrial and Applied Mathematics, Philadelphia. Cited by: §IV, §V.
  • [32] J. Ye, T. Xiong, Q. Li, R. Janardan, J. Bi, V. Cherkassky, and C. Kambhamettu (2006) Efficient model selection for regularized linear discriminant analysis. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM ’06, New York, NY, USA, pp. 532–539. External Links: ISBN 1-59593-433-2, Link, Document Cited by: §I.
  • [33] J. Ye and T. Xiong (2006-12) Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. J. Mach. Learn. Res. 7, pp. 1183–1204. External Links: ISSN 1532-4435, Link Cited by: §I.
  • [34] C.J. Zarowski (2004) An introduction to numerical analysis for electrical and computer engineers. Wiley. External Links: ISBN 9780471650409, Link Cited by: §IV.
  • [35] A. Zollanvari and E. R. Dougherty (2015-06) Generalized consistent error estimator of linear discriminant analysis. IEEE Transactions on Signal Processing 63 (11), pp. 2804–2814. External Links: Document, ISSN 1053-587X Cited by: §I, §II.