1 Introduction
Discriminant analysis encompasses a wide variety of techniques used for classification purposes. These techniques, commonly recognized among the class of modelbased methods in the field of machine learning (Devijver and Kittler, 1982)
, rely merely on the fact that we assume a parametric model in which the outcome is described by a set of explanatory variables that follow a certain distribution. Among them, we particularly distinguish linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) as the most representatives. LDA is often connected or confused with Fisher discriminant analysis (FDA)
(Fisher, 1936), a method of projecting the data into a subspace and turns out to coincide with LDA when the target subspace has two dimensions. Both LDA and QDA are obtained by maximizing the posterior probability under the assumption that observations follow normal distribution, with the single difference that LDA assumes common covariances across classes while QDA assumes the most general situation with classes possessing different means and covariances. If the data follow perfectly the normal distributions and the statistics are perfectly known, QDA turns out to be the optimal classifier that achieves the lowest possible classification error rate
(Tibshirani, 2009). It coincides with LDA when the covariances are equal but outperforms it when they are different. However, in practical scenarios, the use of QDA was not always shown to yield the expected performances. This is because the mean and covariance of each class, which are in general unknown, are estimated based on available training data with perfectly known classes. The obtained estimates are then used as plugin estimators in the classification rules associated with LDA and QDA. The estimation error of the class statistics causes a provably degradation of the performances which reaches very high levels when the number of samples is comparable or less than their dimensions. In this latter situation, QDA and LDA, relying on computing the inverse of the covariance matrix could not be used. To overcome this issue, one technique consists in using a regularized estimate of the covariance matrix as a plugin estimator of the covariance matrix giving the name to Regularized LDA (RLDA) or Regularized QDA (RQDA) to the associated classifiers. However, this solution does not allow for a significant reduction of the estimation noise. The situation is even worse for RQDA, since the number of samples used to estimate the covariance matrix of each class is lower than that of LDA. This is probably the reason why LDA provided in many scenarios better performances than QDA, although it might wrongly consider the covariances across classes equal.A question of major theoretical and practical interest is to investigate to which extent the estimation noise of the covariance matrix impacts the performances of RLDA and RQDA. In this respect, the study of LDA and subsequently that of RLDA have received a particular attention, dating back to the early works of Raudys (Raudys, 1967)
, before being investigated again using recent advances of random matrix theory tools in a recent series of works
(Zollanvari and Dougherty, 2015; Wang and Jiang, 2018). However, the theoretical analysis of QDA and RQDA is more scarce and very often limited to specific situations in which the number of samples is higher than that of the dimensions of the statistics (McFarland and Richards, 2002), or under specific structures of the covariance matrices (Cheng, 2004; Li and Shao, 2015; Jiang et al., 2015). It was only recently that the work in (Elkhalil et al., 2017) considered the analysis of RQDA for general structures of the covariance matrices and identified the necessary asymptotic conditions under which QDA does not exhibit the trivial behavior by which it returns always the same class or randomly guess it. Particularly, the work in (Elkhalil et al., 2017) assumes balanced data across classes, because otherwise RQDA would tend to assign all observations to one class, thereby limiting the use of RQDA in general settings.This lies behind the main motivation of the present work. Based on a careful investigation of the asymptotic behavior of RQDA under unbalanced settings in binary classification problems, we propose to amend the traditional RQDA to cope with cases in which the proportions of training data from both classes are not equal. The new classifier is based on using two different regularization parameters instead of a common regularization parameter as well as an optimized bias properly chosen to minimize the misclassification error rates. Interestingly, we show that the proposed classifier not only outperforms RLDA and RQDA but also other stateoftheart classification methods, opening promising avenues for the use of the proposed classifier in practical scenarios.
The rest of the paper is organized as follows: In section 2, we provide an overview of the quadratic discriminant classifier and identify the issues related to the use of this classifier in unbalanced settings. In section 3, we propose an improved version of the RQDA classifier that overcomes all these problems and we design a consistent estimator of the misclassification error rate that can be used as an alternative to the traditional crossvalidation approach. Finally, Section 4 contains simulations on both synthetic and real data that confirm our theoretical results.
Notations
Scalars, vectors and matrices are respectively denoted by nonboldface, boldface lowercase and boldface uppercase characters.
and are respectively the matrix of zeros and ones of size , denotes the identity matrix. The notation stands for the Euclidean norm for vectors and the spectral norm for matrices. , and stands for the transpose, the trace and the determinant of a matrix respectively. For two functions f and g, we say that , if such that . We say also that that , if such that . andrespectively denote the probability measure, the convergence in probability and the almost sure convergence of random variables.
denotes the cumulative density function (CDF) of the standard normal distribution, i.e. .2 Regularized quadratic discriminant analysis
As aforementioned, RQDA is equivalent to the classifier that assigns all observations to the same class when designed out of a set of unbalanced training data samples. Such a behavior has led the authors in (Elkhalil et al., 2017)
to consider the analysis of RQDA only under a balanced training sample. In this section, we show that this behavior can be easily predicted through a close examination of the mean and variance of the classification rule associated with RQDA. This constitutes an important step that will pave the way towards the improved RQDA presented in the next section. But prior to that, we shall first review the traditional RQDA for binary classification.
2.1 Regularized QDA for binary classification
For ease of presentation, we focus on binary classification problems where we have two distinct classes. We assume that the data follow a Gaussian mixture model, such that observations in class
,are drawn from a multivariate Gaussian distribution with mean
and covariance . More formally, we assume that(1) 
Let
, i = 0, 1 denote the prior probability that
belongs to class . The classification rule associated with the QDA classifier is given by(2) 
which is used to classify the observations based on the following rule:
(3) 
As seen from (2), the classification rule of QDA involves the true parameters of the Gaussian distribution, namely the means and covariances associated with each class. In practice, these parameters are not known. One approach to solve this issue is to estimate them using the available training data. The obtained estimates are then used as plugin estimators in (2). In particular, consider the case in which training observations for each class are available and denote by and their respective samples. The sample estimates of the mean and covariances of each class are then given by:
In case the number of samples or is less than the number of features, the use of the sample covariance matrix as plugin estimator is not permitted since the inverse could not be defined. A popular approach to circumvent this issue is to consider a regularized estimator of the inverse of the covariance matrix given by
(4) 
where is a regularization parameter, which serves to shrink the sample covariance matrix towards identity. Replacing by yields the following classification rule
(5) 
The classifier RQDA assigns wrongly observation if when or if when . Conditioning on the training sample , the classification error associated with class , is thus given by
(6) 
which gives the following expression for the total misclassification error probability
(7) 
2.2 Identification of the problems of the RQDA classifier in unbalanced data settings
In this section, we unveil several issues pertaining to the use of the classification rule (2) of RQDA in high dimensional settings. These issues can be revealed through a careful investigation of the asymptotic distribution of the classification rule associated with RQDA. We first recall that the classification rule associated with RQDA is a quadratic function of the Gaussian test observation and as such behaves like a Gaussian distribution with a certain mean and variance as long as the Lyapunov conditions are met (Billingsley, 1995). To get direct insights into how the RQDA behaves, we assume that there is asymptotically no error in assuming that when belongs to class behaves like a Gaussian distribution with mean and variance where here the expected value and variances are taken with respect to the distribution of the testing observation , and the scaling factor is used to produce fluctuations of order . For the RQDA to lead to appropriate behavior (including perfect classification error rate), the means should be of opposite signs (namely and ) and at least of order while the variances be . This latter condition on the variance is already ensured provided that spectral norms of the covariances is bounded and the difference between mean vectors have a norm at most . Under these assumptions, and taking the expectation over the testing observation, and satisfy:
(8)  
(9) 
It can be easily seen that under the assumption that , and the spectral norms of are bounded uniformly in , the means are asymptotically approximated as:
(10) 
Several important remarks are in order regarding (10). First, we note that the prior probabilities and do not play asymptotically any role in the classification, since the term tends to zero. Second, one can easily see that if the distance between the covariances is such that and which occurs for instance when has at most rank (Elkhalil et al., 2017), the means are given by:
It appears thus that the direct use of RQDA poses two main issues. The first one concerns the bias term, the contribution of which in and is asymptotically independent of the mean vectors and the prior probabilities. This makes RQDA perform classification only on the basis of the covariance matrix. It is thus important to modify the bias term. The second issue is that unlike the balanced case for which and were shown when there are exactly
of eigenvalues with order
(Elkhalil et al., 2017), and are up to order the same for both classes. This can be clearly illustrated through Figure 1 which displays the histogram associated with the classification rule of RQDA and that of QDA with perfect knowledge of the statistics. As can be seen, the use of RQDA does not allow discrimination between both classes since the means of the classification rule under class or class at the highest order is the same. Based on random matrix theory results, we can prove that such a behavior is caused by the use of the same regularization parameter for both and . In light of these observations, we propose to replace the classification rule of RQDA by the following rule:(11) 
where 1) and are two regularization parameters for each class carefully devised so that the means when or are and reflects the class under consideration and 2) is a bias term that will be set to the value that minimizes the asymptotic classification error rate.
3 Design of the improved RQDA classifier
In this section, we propose an improved design of the RQDA classifier that fixes the aforementioned issues met in unbalanced settings. The design will be based on asymptotic analysis of the statistics in (11) under the following asymptotic regime, which was also considered in (Elkhalil et al., 2017):
Assumption. 1 (Data scaling). and
Assumption. 2 (Mean scaling).
Assumption. 3 (Covariance scaling). ,
Assumption. 4. Matrix has exactly eigenvalues of order . The remaining eigenvalues are of order .
Assumption 1 and 3 are standard and are often used to describe a growth regime in which the number of features scales comparably with that of samples and the spectral norm of both covariance matrices remain bounded. Assumption 2 defines the smallest distance between the mean vectors so that they are used to discriminate between both classes, while Assumption 4, introduced in (Elkhalil et al., 2017) is used to ensure that the difference between covariances has a contribution that is of the same order of magnitude as that of the difference between the mean vectors.
Under the asymptotic regime specified by Assumptions 14 and along the same lines as in (Elkhalil et al., 2017), we analyze the classification error rate of the proposed classifier with classification rule 11. Before presenting the corresponding result, we shall first introduce the following notations which defines deterministic objects that naturally appears when using random matrix theory results.
For , let be the unique positive solution to the following equation:
(12) 
The existence and uniqueness of follows from standard results in random matrix theory (Hachem et al., 2008). For , we also define matrices , as:
(13) 
and the scalars and as:
(14) 
With these notations at hand, we are now in position to state the first asymptotic result: Theorem 1 Under Assumption 14, and assuming that the regularization parameters and are , the classification error rate associated with class satisfies:
(15) 
where
(16)  
(17)  
(18)  
(19) 
Proof. The proof follows along the same lines in (Elkhalil et al., 2017) and is as such omitted.
Remark: Under Assumption 4, it can be shown that can asymptotically be simplified to
Moreover, the term is and as such converges to zero as grow to infinity. However, in our simulations, we chose to work with the nonsimplified expressions for and to keep the term , since we observed that in doing so a better accuracy is obtained in finitedimensional simulations.
The result of Theorem 1 allows to provide guidelines on how to choose and and the optimal bias . As discussed before, the design should require the mean of the classification rule to be and to reflect the class under consideration. This mean is represented in the asymptotic expression of the classification error rate by the quantity which, at first sight, is as and . Moreover, the class of the testing observation is not reflected in since under Assumption 34, in case , . To solve this issue, we need to design and such that is or equivalently,
(20) 
so that becomes different from at its highest order. To this end, we prove that it suffices to select the regularization parameter associated with the class with the largest number of samples as:
Theorem 2 Under assumption 14, and assume that , if
(21) 
where is fixed to a given constant
then .
Proof. See Appendix A.
It is worth mentioning that in the balanced case, plugging into (21) yields . It is thus not necessary to use different regularization parameters when the classes are balanced.
With this choice of the regularization parameters being set, the optimal bias can be chosen so that the asymptotic classification error rate given by:
is minimized.
Theorem 3 The optimal bias that allows to minimize the asymptotic classification error rate is given by:
(22) 
where
Proof. See Appendix B.
Before proceeding further, it is important to note that thanks to the careful choice of the regularization parameters and provided in Theorem 2, the term is for i ,
Additionally, it can be shown easily that the term is of order . As a result, both and are .
On another note, it is worth mentioning that even in the case of balanced classes , characterized by as proved in Theorem 2, the optimal bias is different from the one used in RQDA. As such, the proposed design improves on the traditional RQDA studied in (Elkhalil et al., 2017) in the balanced case by optimally adapting the bias term to the case where the covariance matrix are not known.
Theorem 2 and Theorem 3 can be used to obtain an optimized design of the proposed RQDA classifier.
As can be seen, the improved classifier employs only one regularization parameter associated with the class that presents the smallest number of training samples. Assume is such a class. The regularization parameter associated with the other class cannot be arbitrarily chosen and should be set as (21), while the bias is selected according to (22). However, pursuing this design is not possible in practice due to the dependence of (21) and (22) on the true covariance matrices. To solve this issue, we propose in the following theorem a consistent estimator to estimate quantities arising in (21) and (22) that depend only on the training samples.
Theorem 4 Assume and let be the regularization parameter associated with class . Let be given by:
and define as:
(23) 
Then,
where is given in (21). Define , and as:
(24)  
where writes as:
(25) 
Let be given by:
(26) 
Then,
where is given in (22).
Proof. See Appendix C.
It is worth mentioning that unlike , is random. It does not satisfy with equality (21) but ensures (20) with high probability. Its use as a replacement of would lead asymptotically to the same results as the improved classifier using .
With these consistent estimators at hand, we are now in position to present the improved design of the RQDA classifier:
Algorithm 1: Improved design of the RQDA classifier.
Input : Assuming , let the regularization parameter associated with class , training samples in and
output : Estimation of the parameters and to be plugged in (11)

Compute as in (23)

Compute as in (26)

Return and that will be plugged in the classification rule (11)
The improved design described in Algorithm 1 depends on the regularization parameter associated with the class with the smallest number of training samples. One possible way to adjust this parameter is to resort to a traditional crossvalidation approach which consists in estimating using a set of testing data the classification error rate for a set of candidate values for the regularization parameter . Such an approach is however computationally expensive and could not be used to test a large number of candidate values for . As an alternative we propose rather to build a consistent estimator of the classification error rate based on results from random matrix theory. This is the objective of the following theorem:
Theorem 5 Under Assumptions 14, a consistent estimator of the misclassification error rate associated with class is given by:
where is given in (25) and
in the sense that:
It is worth noting that for i=1, is replaced by .
Proof.The proof of this theorem can be derived from the results established in Theorem 2 in (Elkhalil et al., 2017) and as such is omitted.
4 Numerical results
4.1 Validation with synthetic data
In this section, we assess the performance of our improved RQDA classifier and compare it the with standard QDA classifier in the case of unbalanced data. To this end, we start by generating synthetic data for both classes that are compliant with the different assumptions used thoughout this work in order to validate our theoretical results.
In Figure 2 and Figure 3, we plot the classification error rate of the improved classifier and the traditional RQDA classifier with respect to the regularization parameter and the features’ dimension , respectively. As can be seen, we note that the standard RQDA has a classification error rate that converges to the prior of the most dominant class, which reveals that as expected, it tends to assign all observations to the same class, which in this case coincides with the class that presents the highest number of training samples. On the opposite, the proposed RQDA classifier presents a much higher performance, making it more suitable to cope with unbalanced settings. We finally note that the consistent estimator based on the results of Theorem 5 is accurate and as such can be used to properly adjust the regularization parameter .
4.2 Experiment with real data
In this section, we test the performance of the proposed RQDA classifier on the public USPS dataset of handwritten digits(Lecun et al., 1998) and the EEG dataset. The USPS dataset is composed of labeled digit images, and each image has features represented by pixels. The EEG dataset is composed of 5 classes that contain 4097 observations, and each observation has features. We consider the classification of two classes from each dataset composed of and samples. Based on the results of Theorem 3, we tune the regularization factor to the value that minimizes the consistent estimate of the misclassification error rate. The values of and are then computed based in (23) and (26). Fig. 4 and Fig.5 compares the performance of the proposed classifier with other stateoftheart classification algorithms using crossvalidation for different proportions of and . As seen, our classifier, termed in the figure , not only outperforms the standard QDA but also other existing classification algorithms. This suggests that the use of different regularization across classes in the QDA classification rule along with an adequate tune of the bias makes the QDA classifier more robust to the estimation noise of the covariance matrices in unbalanced settings.
5 Conclusion
A common belief holds that the use of RQDA leads in general to lower classification performances than many other existing classification methods, even though it is a classifier derived from the maximum likelihood principle under a general Gaussian mixture model. As a matter of fact, contrary to the other existing classifiers, the main issues of the RQDA lies in its high sensitivity to the estimation noise of the parameters of the Gaussian mixture model. Through a careful investigation of the classification rule of RQDA, we prove that in case of unbalanced training data, the estimation noise lead the RQDA to assign all the observations to the same class, which is behind its inefficiency to classify data under such settings. In this work, we propose to modify the design of RQDA so that it becomes more resilient to the estimation noise. Particularly, we propose to use two regularization parameters for each class as well as a carefully designed bias to optimize the classification performance. Our design, which leverages advanced results from random matrix theory, clearly shows that there is room for improvement of basic classification methods based on the use of advanced statistical tools.
References
 Probability and measure. Wiley. Note: 3rd edition Cited by: §2.2.

Asymptotic probabilities of misclassification of two discriminant functions in cases of high dimensional data
. Statistics & Probability Letters 67, pp. 9–17. External Links: Document Cited by: §1.  Pattern recognition: a statistical approach. Cited by: §1.
 A large dimensional study of regularized discriminant analysis classifiers. abs/1711.00382. External Links: Link, 1711.00382 Cited by: 1.§, §1, §2.2, §2.2, §2, §3, §3, §3, §3, §3, §3.
 The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §1.
 A New Approach for Mutual Information Analysis of Large Dimensional MultiAntenna Channels. IEEE Transactions on Information Theory 54 (9), pp. 3987–4004. Cited by: §3.
 QUDA: a direct approach for sparse quadratic discriminant analysis. Journal of Machine Learning Research 19, pp. . Cited by: §1.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 00189219 Cited by: §4.2.
 Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica 25, pp. . External Links: Document Cited by: §1.

Exact Misclassification Probabilities for PlugIn Normal Quadratic Discriminant Functions.
Journal of Multivariate Analysis
82, pp. 299–330. Cited by: §1.  On determining training sample size of a linear classifier. Computing Systems 28, pp. 79–87. Note: in Russian Cited by: §1.
 The elements of statistical learning. Cited by: §1.
 On the dimension effect of regularized linear discriminant analysis. Electronic Journal of Statistics 12, pp. 2709–2742. Cited by: §1.
 Generalized Consistent Error Estimator of Linear Discriminant Analysis. IEEE Transactions on Signal Processing 63 (11), pp. 2804–2814. Cited by: §1.
Appendix A.
As discussed in the paper, the design of the regularization parameters and should ensure that:
(27) 
where with Using the relation for any two square matrices and , (27) boils down to:
or equivalently:
Using Assumption 4, it can be readily seen that the first term . To satisfy (27), we thus only need to design and such that:
or equivalently:
Under Assumption 4,
Comments
There are no comments yet.