Improved Design of Quadratic Discriminant Analysis Classifier in Unbalanced Settings

The use of quadratic discriminant analysis (QDA) or its regularized version (R-QDA) for classification is often not recommended, due to its well-acknowledged high sensitivity to the estimation noise of the covariance matrix. This becomes all the more the case in unbalanced data settings for which it has been found that R-QDA becomes equivalent to the classifier that assigns all observations to the same class. In this paper, we propose an improved R-QDA that is based on the use of two regularization parameters and a modified bias, properly chosen to avoid inappropriate behaviors of R-QDA in unbalanced settings and to ensure the best possible classification performance. The design of the proposed classifier builds on a refined asymptotic analysis of its performance when the number of samples and that of features grow large simultaneously, which allows to cope efficiently with the high-dimensionality frequently met within the big data paradigm. The performance of the proposed classifier is assessed on both real and synthetic data sets and was shown to be much better than what one would expect from a traditional R-QDA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/25/2020

High-Dimensional Quadratic Discriminant Analysis under Spiked Covariance Model

Quadratic discriminant analysis (QDA) is a widely used classification te...
02/03/2016

High-Dimensional Regularized Discriminant Analysis

Regularized discriminant analysis (RDA), proposed by Friedman (1989), is...
05/09/2020

A Compressive Classification Framework for High-Dimensional Data

We propose a compressive classification framework for settings where the...
11/01/2017

A Large Dimensional Analysis of Regularized Discriminant Analysis Classifiers

This article carries out a large dimensional analysis of standard regula...
10/05/2021

Classification of high-dimensional data with spiked covariance matrix structure

We study the classification problem for high-dimensional data with n obs...
04/11/2020

Robust Generalised Quadratic Discriminant Analysis

Quadratic discriminant analysis (QDA) is a widely used statistical tool ...
04/04/2018

Classification of Vehicles Based on Audio Signals using Quadratic Discriminant Analysis and High Energy Feature Vectors

The focus of this paper is on classification of different vehicles using...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discriminant analysis encompasses a wide variety of techniques used for classification purposes. These techniques, commonly recognized among the class of model-based methods in the field of machine learning (Devijver and Kittler, 1982)

, rely merely on the fact that we assume a parametric model in which the outcome is described by a set of explanatory variables that follow a certain distribution. Among them, we particularly distinguish linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) as the most representatives. LDA is often connected or confused with Fisher discriminant analysis (FDA)

(Fisher, 1936)

, a method of projecting the data into a subspace and turns out to coincide with LDA when the target subspace has two dimensions. Both LDA and QDA are obtained by maximizing the posterior probability under the assumption that observations follow normal distribution, with the single difference that LDA assumes common covariances across classes while QDA assumes the most general situation with classes possessing different means and covariances. If the data follow perfectly the normal distributions and the statistics are perfectly known, QDA turns out to be the optimal classifier that achieves the lowest possible classification error rate

(Tibshirani, 2009). It coincides with LDA when the covariances are equal but outperforms it when they are different. However, in practical scenarios, the use of QDA was not always shown to yield the expected performances. This is because the mean and covariance of each class, which are in general unknown, are estimated based on available training data with perfectly known classes. The obtained estimates are then used as plug-in estimators in the classification rules associated with LDA and QDA. The estimation error of the class statistics causes a provably degradation of the performances which reaches very high levels when the number of samples is comparable or less than their dimensions. In this latter situation, QDA and LDA, relying on computing the inverse of the covariance matrix could not be used. To overcome this issue, one technique consists in using a regularized estimate of the covariance matrix as a plug-in estimator of the covariance matrix giving the name to Regularized LDA (R-LDA) or Regularized QDA (R-QDA) to the associated classifiers. However, this solution does not allow for a significant reduction of the estimation noise. The situation is even worse for R-QDA, since the number of samples used to estimate the covariance matrix of each class is lower than that of LDA. This is probably the reason why LDA provided in many scenarios better performances than QDA, although it might wrongly consider the covariances across classes equal.

A question of major theoretical and practical interest is to investigate to which extent the estimation noise of the covariance matrix impacts the performances of R-LDA and R-QDA. In this respect, the study of LDA and subsequently that of R-LDA have received a particular attention, dating back to the early works of Raudys (Raudys, 1967)

, before being investigated again using recent advances of random matrix theory tools in a recent series of works

(Zollanvari and Dougherty, 2015; Wang and Jiang, 2018). However, the theoretical analysis of QDA and R-QDA is more scarce and very often limited to specific situations in which the number of samples is higher than that of the dimensions of the statistics (McFarland and Richards, 2002), or under specific structures of the covariance matrices (Cheng, 2004; Li and Shao, 2015; Jiang et al., 2015). It was only recently that the work in (Elkhalil et al., 2017) considered the analysis of R-QDA for general structures of the covariance matrices and identified the necessary asymptotic conditions under which QDA does not exhibit the trivial behavior by which it returns always the same class or randomly guess it. Particularly, the work in (Elkhalil et al., 2017) assumes balanced data across classes, because otherwise R-QDA would tend to assign all observations to one class, thereby limiting the use of R-QDA in general settings.

This lies behind the main motivation of the present work. Based on a careful investigation of the asymptotic behavior of R-QDA under unbalanced settings in binary classification problems, we propose to amend the traditional R-QDA to cope with cases in which the proportions of training data from both classes are not equal. The new classifier is based on using two different regularization parameters instead of a common regularization parameter as well as an optimized bias properly chosen to minimize the misclassification error rates. Interestingly, we show that the proposed classifier not only outperforms R-LDA and R-QDA but also other state-of-the-art classification methods, opening promising avenues for the use of the proposed classifier in practical scenarios.

The rest of the paper is organized as follows: In section 2, we provide an overview of the quadratic discriminant classifier and identify the issues related to the use of this classifier in unbalanced settings. In section 3, we propose an improved version of the R-QDA classifier that overcomes all these problems and we design a consistent estimator of the misclassification error rate that can be used as an alternative to the traditional cross-validation approach. Finally, Section 4 contains simulations on both synthetic and real data that confirm our theoretical results.

Notations

Scalars, vectors and matrices are respectively denoted by non-boldface, boldface lowercase and boldface uppercase characters.

and are respectively the matrix of zeros and ones of size , denotes the identity matrix. The notation stands for the Euclidean norm for vectors and the spectral norm for matrices. , and stands for the transpose, the trace and the determinant of a matrix respectively. For two functions f and g, we say that , if such that . We say also that that , if such that . and

respectively denote the probability measure, the convergence in probability and the almost sure convergence of random variables.

denotes the cumulative density function (CDF) of the standard normal distribution, i.e. .

2 Regularized quadratic discriminant analysis

As aforementioned, R-QDA is equivalent to the classifier that assigns all observations to the same class when designed out of a set of unbalanced training data samples. Such a behavior has led the authors in (Elkhalil et al., 2017)

to consider the analysis of R-QDA only under a balanced training sample. In this section, we show that this behavior can be easily predicted through a close examination of the mean and variance of the classification rule associated with R-QDA. This constitutes an important step that will pave the way towards the improved R-QDA presented in the next section. But prior to that, we shall first review the traditional R-QDA for binary classification.

2.1 Regularized QDA for binary classification

For ease of presentation, we focus on binary classification problems where we have two distinct classes. We assume that the data follow a Gaussian mixture model, such that observations in class

,

are drawn from a multivariate Gaussian distribution with mean

and covariance . More formally, we assume that

(1)

Let

, i = 0, 1 denote the prior probability that

belongs to class . The classification rule associated with the QDA classifier is given by

(2)

which is used to classify the observations based on the following rule:

(3)

As seen from (2), the classification rule of QDA involves the true parameters of the Gaussian distribution, namely the means and covariances associated with each class. In practice, these parameters are not known. One approach to solve this issue is to estimate them using the available training data. The obtained estimates are then used as plug-in estimators in (2). In particular, consider the case in which training observations for each class are available and denote by and their respective samples. The sample estimates of the mean and covariances of each class are then given by:

In case the number of samples or is less than the number of features, the use of the sample covariance matrix as plug-in estimator is not permitted since the inverse could not be defined. A popular approach to circumvent this issue is to consider a regularized estimator of the inverse of the covariance matrix given by

(4)

where is a regularization parameter, which serves to shrink the sample covariance matrix towards identity. Replacing by yields the following classification rule

(5)

The classifier R-QDA assigns wrongly observation if when or if when . Conditioning on the training sample , the classification error associated with class , is thus given by

(6)

which gives the following expression for the total misclassification error probability

(7)

2.2 Identification of the problems of the R-QDA classifier in unbalanced data settings

In this section, we unveil several issues pertaining to the use of the classification rule (2) of R-QDA in high dimensional settings. These issues can be revealed through a careful investigation of the asymptotic distribution of the classification rule associated with R-QDA. We first recall that the classification rule associated with R-QDA is a quadratic function of the Gaussian test observation and as such behaves like a Gaussian distribution with a certain mean and variance as long as the Lyapunov conditions are met (Billingsley, 1995). To get direct insights into how the R-QDA behaves, we assume that there is asymptotically no error in assuming that when belongs to class behaves like a Gaussian distribution with mean and variance where here the expected value and variances are taken with respect to the distribution of the testing observation , and the scaling factor is used to produce fluctuations of order . For the R-QDA to lead to appropriate behavior (including perfect classification error rate), the means should be of opposite signs (namely and ) and at least of order while the variances be . This latter condition on the variance is already ensured provided that spectral norms of the covariances is bounded and the difference between mean vectors have a norm at most . Under these assumptions, and taking the expectation over the testing observation, and satisfy:

(8)
(9)

It can be easily seen that under the assumption that , and the spectral norms of are bounded uniformly in , the means are asymptotically approximated as:

(10)

Several important remarks are in order regarding (10). First, we note that the prior probabilities and do not play asymptotically any role in the classification, since the term tends to zero. Second, one can easily see that if the distance between the covariances is such that and which occurs for instance when has at most rank (Elkhalil et al., 2017), the means are given by:

(a) Regularized covariance estimate
(b) Perfect knowledge of the covariance matrices
Figure 1: Histogram of the classification rule for the case with regularized covariance estimate where and the case with perfect knowledge of the covariance matrices. We consider features with unbalanced training size where , , and . The testing set is of size 5000 and 10000 samples for the first and second class respectively.

It appears thus that the direct use of R-QDA poses two main issues. The first one concerns the bias term, the contribution of which in and is asymptotically independent of the mean vectors and the prior probabilities. This makes R-QDA perform classification only on the basis of the covariance matrix. It is thus important to modify the bias term. The second issue is that unlike the balanced case for which and were shown when there are exactly

of eigenvalues with order

(Elkhalil et al., 2017), and are up to order the same for both classes. This can be clearly illustrated through Figure 1 which displays the histogram associated with the classification rule of R-QDA and that of QDA with perfect knowledge of the statistics. As can be seen, the use of R-QDA does not allow discrimination between both classes since the means of the classification rule under class or class at the highest order is the same. Based on random matrix theory results, we can prove that such a behavior is caused by the use of the same regularization parameter for both and . In light of these observations, we propose to replace the classification rule of R-QDA by the following rule:

(11)

where 1) and are two regularization parameters for each class carefully devised so that the means when or are and reflects the class under consideration and 2) is a bias term that will be set to the value that minimizes the asymptotic classification error rate.

3 Design of the improved R-QDA classifier

In this section, we propose an improved design of the R-QDA classifier that fixes the aforementioned issues met in unbalanced settings. The design will be based on asymptotic analysis of the statistics in (11) under the following asymptotic regime, which was also considered in (Elkhalil et al., 2017):

Assumption. 1 (Data scaling). and
Assumption. 2 (Mean scaling).
Assumption. 3 (Covariance scaling). ,
Assumption. 4. Matrix has exactly eigenvalues of order . The remaining eigenvalues are of order .

Assumption 1 and 3 are standard and are often used to describe a growth regime in which the number of features scales comparably with that of samples and the spectral norm of both covariance matrices remain bounded. Assumption 2 defines the smallest distance between the mean vectors so that they are used to discriminate between both classes, while Assumption 4, introduced in (Elkhalil et al., 2017) is used to ensure that the difference between covariances has a contribution that is of the same order of magnitude as that of the difference between the mean vectors.

Under the asymptotic regime specified by Assumptions 1-4 and along the same lines as in (Elkhalil et al., 2017), we analyze the classification error rate of the proposed classifier with classification rule 11. Before presenting the corresponding result, we shall first introduce the following notations which defines deterministic objects that naturally appears when using random matrix theory results.

For , let be the unique positive solution to the following equation:

(12)

The existence and uniqueness of follows from standard results in random matrix theory (Hachem et al., 2008). For , we also define matrices , as:

(13)

and the scalars and as:

(14)

With these notations at hand, we are now in position to state the first asymptotic result: Theorem 1 Under Assumption 1-4, and assuming that the regularization parameters and are , the classification error rate associated with class satisfies:

(15)

where

(16)
(17)
(18)
(19)

Proof. The proof follows along the same lines in (Elkhalil et al., 2017) and is as such omitted.

Remark: Under Assumption 4, it can be shown that can asymptotically be simplified to

Moreover, the term is and as such converges to zero as grow to infinity. However, in our simulations, we chose to work with the non-simplified expressions for and to keep the term , since we observed that in doing so a better accuracy is obtained in finite-dimensional simulations.

The result of Theorem 1 allows to provide guidelines on how to choose and and the optimal bias . As discussed before, the design should require the mean of the classification rule to be and to reflect the class under consideration. This mean is represented in the asymptotic expression of the classification error rate by the quantity which, at first sight, is as and . Moreover, the class of the testing observation is not reflected in since under Assumption 3-4, in case , . To solve this issue, we need to design and such that is or equivalently,

(20)

so that becomes different from at its highest order. To this end, we prove that it suffices to select the regularization parameter associated with the class with the largest number of samples as:
Theorem 2 Under assumption 1-4, and assume that , if

(21)

where is fixed to a given constant then .
Proof. See Appendix A.
It is worth mentioning that in the balanced case, plugging into (21) yields . It is thus not necessary to use different regularization parameters when the classes are balanced. With this choice of the regularization parameters being set, the optimal bias can be chosen so that the asymptotic classification error rate given by:

is minimized.

Theorem 3 The optimal bias that allows to minimize the asymptotic classification error rate is given by:

(22)

where

Proof. See Appendix B.
Before proceeding further, it is important to note that thanks to the careful choice of the regularization parameters and provided in Theorem 2, the term is for i , Additionally, it can be shown easily that the term is of order . As a result, both and are .
On another note, it is worth mentioning that even in the case of balanced classes , characterized by as proved in Theorem 2, the optimal bias is different from the one used in R-QDA. As such, the proposed design improves on the traditional R-QDA studied in (Elkhalil et al., 2017) in the balanced case by optimally adapting the bias term to the case where the covariance matrix are not known.
Theorem 2 and Theorem 3 can be used to obtain an optimized design of the proposed R-QDA classifier. As can be seen, the improved classifier employs only one regularization parameter associated with the class that presents the smallest number of training samples. Assume is such a class. The regularization parameter associated with the other class cannot be arbitrarily chosen and should be set as (21), while the bias is selected according to (22). However, pursuing this design is not possible in practice due to the dependence of (21) and (22) on the true covariance matrices. To solve this issue, we propose in the following theorem a consistent estimator to estimate quantities arising in (21) and (22) that depend only on the training samples.

Theorem 4 Assume and let be the regularization parameter associated with class . Let be given by:

and define as:

(23)

Then,

where is given in (21). Define , and as:

(24)

where writes as:

(25)

Let be given by:

(26)

Then,

where is given in (22).

Proof. See Appendix C.

It is worth mentioning that unlike , is random. It does not satisfy with equality (21) but ensures (20) with high probability. Its use as a replacement of would lead asymptotically to the same results as the improved classifier using .

With these consistent estimators at hand, we are now in position to present the improved design of the R-QDA classifier:   Algorithm 1: Improved design of the R-QDA classifier.   Input : Assuming , let the regularization parameter associated with class , training samples in and
output : Estimation of the parameters and to be plugged in (11)

  1. Compute as in (23)

  2. Compute as in (26)

  3. Return and that will be plugged in the classification rule (11)

  The improved design described in Algorithm 1 depends on the regularization parameter associated with the class with the smallest number of training samples. One possible way to adjust this parameter is to resort to a traditional cross-validation approach which consists in estimating using a set of testing data the classification error rate for a set of candidate values for the regularization parameter . Such an approach is however computationally expensive and could not be used to test a large number of candidate values for . As an alternative we propose rather to build a consistent estimator of the classification error rate based on results from random matrix theory. This is the objective of the following theorem:

Theorem 5 Under Assumptions 1-4, a consistent estimator of the misclassification error rate associated with class is given by:

where is given in (25) and

in the sense that:

It is worth noting that for i=1, is replaced by .
Proof.The proof of this theorem can be derived from the results established in Theorem 2 in (Elkhalil et al., 2017) and as such is omitted.

4 Numerical results

4.1 Validation with synthetic data

In this section, we assess the performance of our improved R-QDA classifier and compare it the with standard QDA classifier in the case of unbalanced data. To this end, we start by generating synthetic data for both classes that are compliant with the different assumptions used thoughout this work in order to validate our theoretical results.

Figure 2: Average misclassification error rate versus the regularization parameter using the G-estimator. We consider features with unbalanced training size where , ,, , and .
Figure 3: Average misclassification error rate versus the dimension p. We consider with unbalanced training size where , ,, , and .

In Figure 2 and Figure 3, we plot the classification error rate of the improved classifier and the traditional R-QDA classifier with respect to the regularization parameter and the features’ dimension , respectively. As can be seen, we note that the standard R-QDA has a classification error rate that converges to the prior of the most dominant class, which reveals that as expected, it tends to assign all observations to the same class, which in this case coincides with the class that presents the highest number of training samples. On the opposite, the proposed R-QDA classifier presents a much higher performance, making it more suitable to cope with unbalanced settings. We finally note that the consistent estimator based on the results of Theorem 5 is accurate and as such can be used to properly adjust the regularization parameter .

4.2 Experiment with real data

In this section, we test the performance of the proposed R-QDA classifier on the public USPS dataset of handwritten digits(Lecun et al., 1998) and the EEG dataset. The USPS dataset is composed of labeled digit images, and each image has features represented by pixels. The EEG dataset is composed of 5 classes that contain 4097 observations, and each observation has features. We consider the classification of two classes from each dataset composed of and samples. Based on the results of Theorem 3, we tune the regularization factor to the value that minimizes the consistent estimate of the misclassification error rate. The values of and are then computed based in (23) and (26). Fig. 4 and Fig.5 compares the performance of the proposed classifier with other state-of-the-art classification algorithms using cross-validation for different proportions of and . As seen, our classifier, termed in the figure , not only outperforms the standard QDA but also other existing classification algorithms. This suggests that the use of different regularization across classes in the QDA classification rule along with an adequate tune of the bias makes the QDA classifier more robust to the estimation noise of the covariance matrices in unbalanced settings.

Figure 4: Comparaison between the performance of the our improved RQDA classifier with respect to other machine learning algorithms on the EEG dataset.
Figure 5: Comparaison between the performance of the our improved RQDA classifier with respect to other machine learning algorithms on the USPS dataset.

5 Conclusion

A common belief holds that the use of R-QDA leads in general to lower classification performances than many other existing classification methods, even though it is a classifier derived from the maximum likelihood principle under a general Gaussian mixture model. As a matter of fact, contrary to the other existing classifiers, the main issues of the R-QDA lies in its high sensitivity to the estimation noise of the parameters of the Gaussian mixture model. Through a careful investigation of the classification rule of R-QDA, we prove that in case of unbalanced training data, the estimation noise lead the R-QDA to assign all the observations to the same class, which is behind its inefficiency to classify data under such settings. In this work, we propose to modify the design of R-QDA so that it becomes more resilient to the estimation noise. Particularly, we propose to use two regularization parameters for each class as well as a carefully designed bias to optimize the classification performance. Our design, which leverages advanced results from random matrix theory, clearly shows that there is room for improvement of basic classification methods based on the use of advanced statistical tools.

References

  • P. Billingsley (1995) Probability and measure. Wiley. Note: 3rd edition Cited by: §2.2.
  • Y. Cheng (2004)

    Asymptotic probabilities of misclassification of two discriminant functions in cases of high dimensional data

    .
    Statistics & Probability Letters 67, pp. 9–17. External Links: Document Cited by: §1.
  • P. Devijver and J. Kittler (1982) Pattern recognition: a statistical approach. Cited by: §1.
  • K. Elkhalil, A. Kammoun, R. Couillet, T. Y. Al-Naffouri, and M. Alouini (2017) A large dimensional study of regularized discriminant analysis classifiers. abs/1711.00382. External Links: Link, 1711.00382 Cited by: 1.§, §1, §2.2, §2.2, §2, §3, §3, §3, §3, §3, §3.
  • R. A. Fisher (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §1.
  • W. Hachem, O. Khorunzhiy, P. Loubaton, J. Najim, and L. Pastur (2008) A New Approach for Mutual Information Analysis of Large Dimensional Multi-Antenna Channels. IEEE Transactions on Information Theory 54 (9), pp. 3987–4004. Cited by: §3.
  • B. Jiang, X. Wang, and C. Leng (2015) QUDA: a direct approach for sparse quadratic discriminant analysis. Journal of Machine Learning Research 19, pp. . Cited by: §1.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 0018-9219 Cited by: §4.2.
  • Q. Li and J. Shao (2015) Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica 25, pp. . External Links: Document Cited by: §1.
  • H. R. McFarland and D. St. P. Richards (2002) Exact Misclassification Probabilities for Plug-In Normal Quadratic Discriminant Functions.

    Journal of Multivariate Analysis

    82, pp. 299–330.
    Cited by: §1.
  • S. Raudys (1967) On determining training sample size of a linear classifier. Computing Systems 28, pp. 79–87. Note: in Russian Cited by: §1.
  • R. Tibshirani (2009) The elements of statistical learning. Cited by: §1.
  • C. Wang and B. Jiang (2018) On the dimension effect of regularized linear discriminant analysis. Electronic Journal of Statistics 12, pp. 2709–2742. Cited by: §1.
  • A. Zollanvari and E. R. Dougherty (2015) Generalized Consistent Error Estimator of Linear Discriminant Analysis. IEEE Transactions on Signal Processing 63 (11), pp. 2804–2814. Cited by: §1.

Appendix A.

As discussed in the paper, the design of the regularization parameters and should ensure that:

(27)

where with Using the relation for any two square matrices and , (27) boils down to:

or equivalently:

Using Assumption 4, it can be readily seen that the first term . To satisfy (27), we thus only need to design and such that:

or equivalently:

Under Assumption 4,