A Large Dimensional Analysis of Regularized Discriminant Analysis Classifiers

11/01/2017
by   Khalil Elkhalil, et al.
0

This article carries out a large dimensional analysis of standard regularized discriminant analysis classifiers designed on the assumption that data arise from a Gaussian mixture model with different means and covariances. The analysis relies on fundamental results from random matrix theory (RMT) when both the number of features and the cardinality of the training data within each class grow large at the same pace. Under mild assumptions, we show that the asymptotic classification error approaches a deterministic quantity that depends only on the means and covariances associated with each class as well as the problem dimensions. Such a result permits a better understanding of the performance of regularized discriminant analsysis, in practical large but finite dimensions, and can be used to determine and pre-estimate the optimal regularization parameter that minimizes the misclassification error probability. Despite being theoretically valid only for Gaussian data, our findings are shown to yield a high accuracy in predicting the performances achieved with real data sets drawn from the popular USPS data base, thereby making an interesting connection between theory and practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

02/03/2016

High-Dimensional Regularized Discriminant Analysis

Regularized discriminant analysis (RDA), proposed by Friedman (1989), is...
11/13/2021

Minimax Supervised Clustering in the Anisotropic Gaussian Mixture Model: A new take on Robust Interpolation

We study the supervised clustering problem under the two-component aniso...
06/11/2020

Improved Design of Quadratic Discriminant Analysis Classifier in Unbalanced Settings

The use of quadratic discriminant analysis (QDA) or its regularized vers...
04/11/2018

Compressive Regularized Discriminant Analysis of High-Dimensional Data with Applications to Microarray Studies

We propose a modification of linear discriminant analysis, referred to a...
09/07/2016

Random matrices meet machine learning: a large dimensional analysis of LS-SVM

This article proposes a performance analysis of kernel least squares sup...
04/28/2020

A Doubly Regularized Linear Discriminant Analysis Classifier with Automatic Parameter Selection

Linear discriminant analysis (LDA) based classifiers tend to falter in m...
01/09/2022

Robust classification with flexible discriminant analysis in heterogeneous data

Linear and Quadratic Discriminant Analysis are well-known classical meth...

Code Repositories

Large-Dimensional-Discriminant-Analysis-Classifiers-with-Random-Matrix-Theory

This Julia code is useful to reproduce results for the paper "A Large Dimensional Analysis of Regularized Discriminant Analysis Classifiers"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Linear Discriminant analysis (LDA) is an old concept that dates back to Fisher that generalizes the Fisher discriminant [2, 3]

. Given two statistically defined datasets, or classes, the Fisher discriminant analysis is designed to maximize the ratio of the variance between classes to the variance within classes and is useful for both classification and dimensionality reduction

[4, 5]. LDA, on the other hand, relying merely on the concept of model based classification [4], is conceived so that the misclassification rate is minimized under a Gaussian assumption for the data. Interestingly, both ideas lead to the same classifier when the data of both classes share the same covariance matrix. Maintaining the Gaussian assumption but considering the general case of distinct covariance matrices, quadratic discriminant analysis (QDA) becomes the optimal classifier in terms of the minimization of the misclassification rate when both statistical means and covariances of the classes are known.

In practice, these parameters are rarely given and only estimated based on training data. Assuming the number of training samples is high enough, QDA and LDA should remain asymptotically optimal. It is however often the case in practice that the data dimension is large, if not larger, than the number of observations. In such circumstances, the covariance matrix estimate becomes ill-conditioned or even non invertible, which leads to poor classification performance.

To overcome this difficulty, many techniques can be considered. One can resort to dimensionality reduction so as to embed the data in a low-dimensional space that retains most of the useful information from a classification point of view [6, 7]. This ensures a higher number of training samples than the effective data size. Questions as to which dimensions to be selected or to what extent dimension should be reduced remain open. Another alternative involves the regularized versions of LDA and QDA denoted, throughout this paper, by R-LDA and R-QDA [5, 8]. Both approaches constitute the main focus of the article.

There exist many works on the performance analysis of discriminant analysis classifiers. In [9], an exact analysis of QDA is made by relying on properties of Wishart matrices. This allows for exact expressions of the probability misclassification rate for all sample size and dimension . This analysis is however only valid as long as . Generalizing this analysis to regularized versions is however beyond analytical reach. This motivated further studies to consider asymptotic regimes. In [10, 11] the authors consider the large asymptotics and observe that LDA and QDA fall short even when the exact covariance matrix is known. [10] thus proposed improved LDA and PCA that exploit sparsity assumptions on the statistical means and covariances, however not necessarily met in practice. This leads us to consider in the present work the double asymptotic regime in which both and tend to infinity with fixed ratio. This regime leverages results from random matrix theory [12, 13, 14, 15, 16]. For LDA analysis, this regime was first considered in [17] under the assumption of equal covariance matrices. It was extended to the analysis of R-LDA in [8] and to the Euclidean distance discriminant rule in [18]. To the best of the authors’ knowledge, the general case in which the covariances across classes are different was never treated. As shown in the course of the paper, a major difficulty for the analysis resides in choosing the assumptions governing the growth rate of means and covariances to avoid nontrivial asymptotic classification performances.

This motivates the present work. Particularly, we propose a large dimensional analysis of both R-LDA and R-QDA in the double asymptotic regime for general Gaussian assumptions. Precisely, under technical, yet mild, assumptions controlling the distances between the class means and covariances, we prove that the probability of misclassification converges to a non-trivial deterministic quantity that only depends on the class statistics as well as the ratio . Interestingly, R-LDA and R-QDA require different growth regimes, reflecting a fundamental difference in the way they leverage the information about the means and covariances. Notably, R-QDA requires a minimal distance between class means of order while R-LDA necessitates a difference in means of order . However, R-LDA does not seem to leverage the information about the distance in covariance matrices. The results of [8] are in particular recovered when the spectral norm of the difference of covariance matrices is small. These findings lead to insights into when LDA or QDA should be preferred in practical scenarios.

To sum up, our main results are as follows:

  • Under mild assumptions, we establish the convergence of the misclassification rate for both R-LDA and R-QDA classifiers to a deterministic error as a function of the statistical parameters associated with each class.

  • We design a consistent estimator for the misclassification rate for both R-LDA and R-QDA classifiers allowing for a pre-estimate of the optimal regularization parameter.

  • We validate our theoretical findings on both synthetic and real data drawn from the USPS dataset and illustrate the good accuracy of our results in both settings.

The remainder is organized as follows. We give an overview of discriminant analysis for binary classification in Section II. The main results are presented in Section III, the proofs of which are deferred to the Appendix. In Section IV, we design a consistent estimator of the misclassification error rate. We validate our analysis for real data in Section V and conclude the article in Section VI.

Notations

Scalars, vectors and matrices are respectively denoted by non-boldface, boldface lowercase and boldface uppercase characters.

and are respectively the matrix of zeros and ones of size , denotes the identity matrix. The notation stands for the Euclidean norm for vectors and the spectral norm for matrices. , and stands for the transpose, the trace and the determinant of a matrix respectively. For two functionals and , we say that , if such that . , , and

respectively denote the probability measure, the convergence in distribution, the convergence in probability and the almost sure convergence of random variables.

denotes the cumulative density function (CDF) of the standard normal distribution, i.e.

.

Ii Discriminant Analysis for Binary Classification

This paper studies binary discriminant analysis techniques which employs a discriminant rule to assign for an input data vector the class to which it most likely belongs. The discriminant rule is designed based on available training data with known class labels. In this paper, we consider the case in which a bayesian discriminant rule is employed. Hence, we assume that observations from class ,

are independent and are sampled from a multivariate Gaussian distribution with mean

and non-negative covariance matrix . Formally speaking, an observation vector is classified to , , if

(1)

As stated in [19], for distinct covariance matrices and , the discriminant rule is equivalent to assigning to class if

(2)

is positive and class otherwise, where

is the prior probability for class

. In particular,

(3)

When the considered classes have the same covariance matrix, i.e., , the discriminant function simplifies to [5, 4, 8]

(4)

Classification obeys hence the following rule:

(5)

Since is linear in , the corresponding classification method is referred to as linear discriminant analysis. As can be seen from (3) and (5), the classification rules assume the knowedge of the class statistics, namely their associated covariance matrices and mean vectors. In practice, these statistics can be estimated using the available training data. As such, we assume that , independent training samples and are respectively available to estimate the mean and the covariance matrix of each class . For that, we consider the following sample estimates

where is the pooled sample covariance matrix for both classes. To avoid singularity issues when , we use the ridge estimator of the inverse of the covariance matrix [5]

(6)
(7)

where is a regularization parameter. Replacing and for by (6) and (7) into (4) and (3), we obtain the following discriminant rules

(8)
(9)

The corresponding classification methods will be denoted respectively by R-LDA and R-QDA. Conditioned on the training samples , , the classification errors associated with R-LDA and R-QDA when belongs to class are given by

(10)
(11)

while the total classification errors are respectively given by

In the following, we propose to analyze the asymptotic classication errors of both R-LDA and R-QDA when grow large at the same rate. For R-LDA, our results cover a more general setting than the one studied in [8], in that they apply to the case where both classes have distinct covariance matrices.

Iii Main Results

The main contributions of the present work are two fold. First, we carry out an asymptotic analysis of the classification error rate for both R-LDA and R-QDA, showing that they converge to some deterministic quantities that depend solely on the observations statistics associated with each class. Such a result allows a better understanding of the impact of these parameters on the performances. Second, we build consistent estimates of the asymptotic misclassification error rates for both estimators. An estimator of the misclassification error rate has been provided in

[8] but for the R-LDA when the classes are assumed to have identical covariance matrices. Our results regarding R-LDA in this respect extends the one in [8] when the covariance matrices are not equal. The treatment of R-QDA is however new and constitute the main contribution of the present work.

Iii-a Asymptotic Performance of R-LDA with Distinct Covariance Matrices

In this section, we present an asymptotic analysis of the R-LDA classifier. Our analysis is mainly based on recent results from RMT concerning some properties of Gram matrices of mixture models [16]. We recall that [8] made a similar analysis of R-LDA in the double asymptotic regime when both classes have a common covariance matrix, thereby not requiring these advanced tools. As such, our results can be viewed as a generalization of [8] when both classes have distinct covariance matrices. This permits to evaluate the performance of R-LDA in practical scenarios when the assumption of common covariance matrices cannot always be guaranteed. To allow derivations, we shall consider the following growth rate assumptions

Assumption. 1 (Data scaling).

.

Assumption. 2 (Class scaling).

, for .

Assumption. 3 (Covariance scaling).

, for .

Assumption. 4 (Mean scaling).

Let . Then, .

These assumptions are mainly considered to achieve an asymptotically non-trivial classification error. Assumption 3 is frequently met within the framework of random matrix theory [16]. Under the setting of Assumption 3, Assumption 4 ensures that a nontrivial classification rate is obtained: If scales faster than , then perfect asymptotic classification is achieved; however, if scales slower than , classification is asymptotically impossible. Assumptions 1 and 2 respectively control the growth rate in the data and the training.

Iii-A1 Deterministic Equivalent

We are in a position to derive a deterministic equivalent of the misclassification error rate of the R-LDA. Indeed, conditioned on the training data , the probability of misclassification is given by: [8]

(12)

where

(13)
(14)

The total misclassification probability is thus given by

(15)

Prior to stating the main result concerning R-LDA, we shall introduce the following quantities, which naturally appear, as a result of applying [16]. Let be the matrix defined as follows

(16)

where , , satisfies the following fixed point equations

(17)

Also define and as

(18)

where

(19)

The quantities in (17) can be computed in an iterative fashion where convergence is guaranteed after few iterations (see [16] for more details). Moreover, define

(20)
(21)

where . With these definitions at hand, we state the following theorem

Theorem. 1.

Under Assumptions 1-4, we have

(22)
(23)

As a consequence, the conditional misclassification probability converges almost surely to a deterministic quantity

(24)

where

(25)
Proof.

See Appendix A. ∎

Remark. 1.

As stated earlier, if scales faster than , perfect asymptotic classification is achieved. This can be seen by noticing that would grow indefinitely large with , thereby making the conditional error rates vanish.

When converges to zero, the asymptotic misclassification error rate of each class coincides with the one derived in [8] obtained when .

Corollary. 1.

In the case where (including the common covariance case where ), the conditional misclassification error rate converges almost surely to

where

where or , and is the unique positive solution to the following equation:

Proof.

When , we first prove that up to an error , the key deterministic equivalents can be simplified to depend only on (or ). In the sequel, we take . As , we have

It follows that where .

The above relations allow to simplify functionals involving matrix . To see that, we decompose as

where follows from the resolvent identity and denotes a matrix with spectral norm converging to zero. It can be easily shown using the inequality for and two matrices in that:

Hence, for ,

and . Using the same notations as in [8] we have in particular for , and , where is the fixed-point solution in [8, Proposition 1]

Moreover,

It follows that

Using the same arguments, it is easy also to show that

Corollary 1 is useful because it allows to specify the range of applications of Theorem 1 in which the information on the covariance matrix is essential for the classification task. Also, it shows how R-LDA is robust against small perturbations in the covariance matrix. Similar observations have been made in [20] where it was shown via a Monte Carlo study that LDA is robust against the modeling assumptions.

Iii-B Asymptotic Performance of R-QDA

In this part, we state the main results regarding the derivation of deterministic approximations of the R-QDA classification error. Such results have been obtained by considering some specific assumptions, carefully chosen such that an asymptotically non-trivial classification error (i.e., neither nor ) is achieved. We particularly highlight how the provided asymptotic approximations depend on such statistical parameters as the means and covariances within classes, thus allowing a better understanding of the performance of the R-QDA classifier. Ultimately, these results can be exploited in order to improve the performances by allowing optimal setting of the regularization parameter.

Iii-B1 Technical Assumptions

We consider the following double asymptotic regime in which for with the following assumptions met

Assumption. 5 (Data scaling).

and .

Assumption. 6 (Mean scaling).

.

Assumption. 7 (Covariance scaling).

.

Assumption. 8.

Matrix has exactly eigenvalues of order . The remaining eigenvalues are of order .

Assumption 5 implies also that for . As we shall see later, if this is not satisfied, the R-QDA perform asymptotically as the classifier that assigns all observations to the same class. The second assumption governs the distance between the two classes in terms of the Euclidean distance between the means. This is mandatory in order to avoid asymptotic perfect classification. This is a much stronger assumption than Assumption 2 in R-LDA since we allow larger values for . This can be understood as R-QDA being subject to strong noise induced when estimating , which requires a large value so that it can play a role in classification. A similar assumption is required to control the distance between the covariance matrices. Particularly, the spectral norm of the covariance matrices are required to be bounded as stated in Assumption 7 while their difference should satisfy Assumption 8.The latter assumption implies that for any matrix of bounded spectral norm,

Iii-B2 Central Limit Theorem (CLT)

It can be easily shown that the R-QDA conditional classification error in (11) can be expressed as

(26)

where

Computing

amounts to the cumulative distribution function (CDF) of quadratic forms of Gaussian random vectors, and hence cannot be derived in closed form in general. However, it can be still approximated by considering asymptotic regimes that allow to exploit results about central limit theorem involving quadratic forms. Under Assumptions

5-8, a central limit theorem (CLT) on the random variable when is established.

Proposition. 1 (Clt).

Assume that assumptions 5-8 hold true. Assume also that for

(27)

Then,

(28)
Proof.

The proof is mainly based on the application of the Lyapunov’s CLT for the sum of independent but non identically distributed random variables [21]. The detailed proof is postponed to Appendix B. ∎

The condition in(27) will be proven to hold almost surely. Hence, as a by-product of the above Proposition, we obtain the following expression for the conditional classification error

Corollary. 2.

Under the setting of Proposition 1 , the conditional classification error in (11) satisfies

(29)

As such an asymptotic equivalent of the conditional classification error can be derived. This is the subject of the next subsection.

Iii-B3 Deterministic Equivalents

This part is devoted to the derivation of deterministic equivalents of some random quantities involved in the R-QDA conditional classification error. Before that, we shall introduce the following notations which basically arise as a result of applying standard results from random matrix theory. We define for , as the unique positive solution to the following fixed point equation111Mathematical details treating the existence and uniqueness of can be found in [14].

Define as

and the scalar and as

Define , and as

(30)
(31)
(32)

As shall be shown in Appendix C, these quantities are deterministic approximations in probability of , and . We therefore get

Theorem. 2.

Under assumptions 5-8, the following convergence holds for

Proof.

The proof is postponed to Appendix C. ∎

At first sight, quantity appears to be of order , since and are . Following this line of thought, the asymptotic misclassification probability error is expected to converge to a trivial misclassification error. This statement is, hopefully false. Assumption 8 and 5 were carefully designed so that and are of order . In particular, the following is proven in Appendix D

Proposition. 2.

Under Assumption 5-8 The deterministic quantities and are uniformly bounded when grows to infinity.

Proof.

The proof is deferred to Appendix D

Remark. 2.

The results of Theorem 2 along with proposition 2 show that the classification error converges to a non-trivial deterministic quantity that depends only on the statistical means and covariances within each class. The major importance of this result is that it allows to find a good choice of the regularization as the value that minimizes the asymptotic classification error. While it seems to be elusive for such value to possess a closed-form expression, it can be numerically approximated by using a simple one-dimensional line search algorithm.

Remark. 3.

Using Assumption 8, it can be shown that can asymptotically simplified to

(33)

where or . The above relation comes from the fact that, up to an error of order , matrices or can be used interchangeably in or and in the terms involved in . This, in particular, implies that and are the same up to a vanishing error. It is noteworthy to see that the same artifice could not work for the terms and because the normalization, being with , is not sufficient to provide vanishing terms. We should also mention that, although (33) takes a simpler form, we chose to work in the simulations and when computing the consistent estimates of with the expression (32) since we found that it provides the highest accuracy.

Iii-B4 Some Special cases

  1. It is important to note that we could have considered . In this case, the classification error rate would still converge to a non trivial limit but would not asymptotically depend on the difference . This is because in this case, the difference in covariance matrices dominate that of the means and as such represent the discriminant metric that asymptotically matters.

  2. Another interesting case to highlight is the one in which . From Theorem 2 and using (33), it is easy to show that the total classification error converges as

    (34)

    where , and have respectively the same definitions as , and upon dropping the class index , since quantities associated with class or class can be used interchangeably in the asymptotic regime. It is easy to see that in this case if scales slower than , classification is asymptotically impossible. This must be contrasted with the results of R-LDA, which provides non-vanishing misclassification rates for . This means that in this particular setting, R-QDA is asymptotically beaten by R-LDA which achieves perfect classification.

  3. When occurring for instance when or is of finite rank, and , then where does not depend on and as such the misclassification error probability associated with both classes converge respectively to and with some probability depending solely on the statistics. The total misclassification error associated with R-QDA converges to .

  4. When , quantities and grow unboundedly as the dimension increases. This unveils that asymptotically, the discriminant score of R-QDA will keep the same sign for all observations. The classifier would thus return the same class regardless of the observation under consideration.

The above remarks should help to draw some hints on when R-LDA or R-QDA should be used. Particularly, if the Frobenius norm of is , using the information on the difference between the class covariance matrices is not recommended. We should rather rely on using the information on the difference between the classes’ means, or in other words favoring the use of R-LDA against R-QDA.

Iv General Consistent Estimator of the Testing Error

In the machine learning field, evaluating the performances of algorithms is a crucial step that not only serves to ensure their efficacy but also to properly set the parameters involved in the design thereof, a process known in the machine learning parlance as model selection. The traditional way to evaluate performances consists in devoting a part of the training data to the design of the underlying method whereas performances are tested on the remaining data called testing data, treated as unseen data since they do not intervene in the design step. Among the many existing computational methods that are built on these ideas are the cross-validation

[22, 23] and the bootstrap [24, 25] techniques. Despite being widely used in the machine learning community, these methods have the drawback of being computationally expensive and most importantly of relying on mere computations, which does not lead to gain a better understanding of the performances of the underlying algorithm. As far as LDA and QDA classifiers are considered, the results of the previous section allow to gain a deeper understanding of the classification performances with respect to the covariances and means associated with both classes. However, as these results are expressed in terms of the unknown covariances and means, they could not be relied upon to assess the classification performances. In this section, we address this question and provide consistent estimators of the classification performances for both R-LDA and R-QDA classifiers that approximate in probability their asymptotic expressions.

Iv-a R-Lda

The following theorem provides the expression of the class-conditional true error estimator , for .

Theorem. 3.

Under Assumptions 1-4, denote