Large-Dimensional-Discriminant-Analysis-Classifiers-with-Random-Matrix-Theory
This Julia code is useful to reproduce results for the paper "A Large Dimensional Analysis of Regularized Discriminant Analysis Classifiers"
view repo
This article carries out a large dimensional analysis of standard regularized discriminant analysis classifiers designed on the assumption that data arise from a Gaussian mixture model with different means and covariances. The analysis relies on fundamental results from random matrix theory (RMT) when both the number of features and the cardinality of the training data within each class grow large at the same pace. Under mild assumptions, we show that the asymptotic classification error approaches a deterministic quantity that depends only on the means and covariances associated with each class as well as the problem dimensions. Such a result permits a better understanding of the performance of regularized discriminant analsysis, in practical large but finite dimensions, and can be used to determine and pre-estimate the optimal regularization parameter that minimizes the misclassification error probability. Despite being theoretically valid only for Gaussian data, our findings are shown to yield a high accuracy in predicting the performances achieved with real data sets drawn from the popular USPS data base, thereby making an interesting connection between theory and practice.
READ FULL TEXT VIEW PDFThis Julia code is useful to reproduce results for the paper "A Large Dimensional Analysis of Regularized Discriminant Analysis Classifiers"
Linear Discriminant analysis (LDA) is an old concept that dates back to Fisher that generalizes the Fisher discriminant [2, 3]
. Given two statistically defined datasets, or classes, the Fisher discriminant analysis is designed to maximize the ratio of the variance between classes to the variance within classes and is useful for both classification and dimensionality reduction
[4, 5]. LDA, on the other hand, relying merely on the concept of model based classification [4], is conceived so that the misclassification rate is minimized under a Gaussian assumption for the data. Interestingly, both ideas lead to the same classifier when the data of both classes share the same covariance matrix. Maintaining the Gaussian assumption but considering the general case of distinct covariance matrices, quadratic discriminant analysis (QDA) becomes the optimal classifier in terms of the minimization of the misclassification rate when both statistical means and covariances of the classes are known.In practice, these parameters are rarely given and only estimated based on training data. Assuming the number of training samples is high enough, QDA and LDA should remain asymptotically optimal. It is however often the case in practice that the data dimension is large, if not larger, than the number of observations. In such circumstances, the covariance matrix estimate becomes ill-conditioned or even non invertible, which leads to poor classification performance.
To overcome this difficulty, many techniques can be considered. One can resort to dimensionality reduction so as to embed the data in a low-dimensional space that retains most of the useful information from a classification point of view [6, 7]. This ensures a higher number of training samples than the effective data size. Questions as to which dimensions to be selected or to what extent dimension should be reduced remain open. Another alternative involves the regularized versions of LDA and QDA denoted, throughout this paper, by R-LDA and R-QDA [5, 8]. Both approaches constitute the main focus of the article.
There exist many works on the performance analysis of discriminant analysis classifiers. In [9], an exact analysis of QDA is made by relying on properties of Wishart matrices. This allows for exact expressions of the probability misclassification rate for all sample size and dimension . This analysis is however only valid as long as . Generalizing this analysis to regularized versions is however beyond analytical reach. This motivated further studies to consider asymptotic regimes. In [10, 11] the authors consider the large asymptotics and observe that LDA and QDA fall short even when the exact covariance matrix is known. [10] thus proposed improved LDA and PCA that exploit sparsity assumptions on the statistical means and covariances, however not necessarily met in practice. This leads us to consider in the present work the double asymptotic regime in which both and tend to infinity with fixed ratio. This regime leverages results from random matrix theory [12, 13, 14, 15, 16]. For LDA analysis, this regime was first considered in [17] under the assumption of equal covariance matrices. It was extended to the analysis of R-LDA in [8] and to the Euclidean distance discriminant rule in [18]. To the best of the authors’ knowledge, the general case in which the covariances across classes are different was never treated. As shown in the course of the paper, a major difficulty for the analysis resides in choosing the assumptions governing the growth rate of means and covariances to avoid nontrivial asymptotic classification performances.
This motivates the present work. Particularly, we propose a large dimensional analysis of both R-LDA and R-QDA in the double asymptotic regime for general Gaussian assumptions. Precisely, under technical, yet mild, assumptions controlling the distances between the class means and covariances, we prove that the probability of misclassification converges to a non-trivial deterministic quantity that only depends on the class statistics as well as the ratio . Interestingly, R-LDA and R-QDA require different growth regimes, reflecting a fundamental difference in the way they leverage the information about the means and covariances. Notably, R-QDA requires a minimal distance between class means of order while R-LDA necessitates a difference in means of order . However, R-LDA does not seem to leverage the information about the distance in covariance matrices. The results of [8] are in particular recovered when the spectral norm of the difference of covariance matrices is small. These findings lead to insights into when LDA or QDA should be preferred in practical scenarios.
To sum up, our main results are as follows:
Under mild assumptions, we establish the convergence of the misclassification rate for both R-LDA and R-QDA classifiers to a deterministic error as a function of the statistical parameters associated with each class.
We design a consistent estimator for the misclassification rate for both R-LDA and R-QDA classifiers allowing for a pre-estimate of the optimal regularization parameter.
We validate our theoretical findings on both synthetic and real data drawn from the USPS dataset and illustrate the good accuracy of our results in both settings.
The remainder is organized as follows. We give an overview of discriminant analysis for binary classification in Section II. The main results are presented in Section III, the proofs of which are deferred to the Appendix. In Section IV, we design a consistent estimator of the misclassification error rate. We validate our analysis for real data in Section V and conclude the article in Section VI.
Scalars, vectors and matrices are respectively denoted by non-boldface, boldface lowercase and boldface uppercase characters.
and are respectively the matrix of zeros and ones of size , denotes the identity matrix. The notation stands for the Euclidean norm for vectors and the spectral norm for matrices. , and stands for the transpose, the trace and the determinant of a matrix respectively. For two functionals and , we say that , if such that . , , andrespectively denote the probability measure, the convergence in distribution, the convergence in probability and the almost sure convergence of random variables.
denotes the cumulative density function (CDF) of the standard normal distribution, i.e.
.This paper studies binary discriminant analysis techniques which employs a discriminant rule to assign for an input data vector the class to which it most likely belongs. The discriminant rule is designed based on available training data with known class labels. In this paper, we consider the case in which a bayesian discriminant rule is employed. Hence, we assume that observations from class ,
are independent and are sampled from a multivariate Gaussian distribution with mean
and non-negative covariance matrix . Formally speaking, an observation vector is classified to , , if(1) |
As stated in [19], for distinct covariance matrices and , the discriminant rule is equivalent to assigning to class if
(2) |
is positive and class otherwise, where
is the prior probability for class
. In particular,(3) |
When the considered classes have the same covariance matrix, i.e., , the discriminant function simplifies to [5, 4, 8]
(4) |
Classification obeys hence the following rule:
(5) |
Since is linear in , the corresponding classification method is referred to as linear discriminant analysis. As can be seen from (3) and (5), the classification rules assume the knowedge of the class statistics, namely their associated covariance matrices and mean vectors. In practice, these statistics can be estimated using the available training data. As such, we assume that , independent training samples and are respectively available to estimate the mean and the covariance matrix of each class . For that, we consider the following sample estimates
where is the pooled sample covariance matrix for both classes. To avoid singularity issues when , we use the ridge estimator of the inverse of the covariance matrix [5]
(6) | ||||
(7) |
where is a regularization parameter. Replacing and for by (6) and (7) into (4) and (3), we obtain the following discriminant rules
(8) | ||||
(9) |
The corresponding classification methods will be denoted respectively by R-LDA and R-QDA. Conditioned on the training samples , , the classification errors associated with R-LDA and R-QDA when belongs to class are given by
(10) | ||||
(11) |
while the total classification errors are respectively given by
In the following, we propose to analyze the asymptotic classication errors of both R-LDA and R-QDA when grow large at the same rate. For R-LDA, our results cover a more general setting than the one studied in [8], in that they apply to the case where both classes have distinct covariance matrices.
The main contributions of the present work are two fold. First, we carry out an asymptotic analysis of the classification error rate for both R-LDA and R-QDA, showing that they converge to some deterministic quantities that depend solely on the observations statistics associated with each class. Such a result allows a better understanding of the impact of these parameters on the performances. Second, we build consistent estimates of the asymptotic misclassification error rates for both estimators. An estimator of the misclassification error rate has been provided in
[8] but for the R-LDA when the classes are assumed to have identical covariance matrices. Our results regarding R-LDA in this respect extends the one in [8] when the covariance matrices are not equal. The treatment of R-QDA is however new and constitute the main contribution of the present work.In this section, we present an asymptotic analysis of the R-LDA classifier. Our analysis is mainly based on recent results from RMT concerning some properties of Gram matrices of mixture models [16]. We recall that [8] made a similar analysis of R-LDA in the double asymptotic regime when both classes have a common covariance matrix, thereby not requiring these advanced tools. As such, our results can be viewed as a generalization of [8] when both classes have distinct covariance matrices. This permits to evaluate the performance of R-LDA in practical scenarios when the assumption of common covariance matrices cannot always be guaranteed. To allow derivations, we shall consider the following growth rate assumptions
.
, for .
, for .
Let . Then, .
These assumptions are mainly considered to achieve an asymptotically non-trivial classification error. Assumption 3 is frequently met within the framework of random matrix theory [16]. Under the setting of Assumption 3, Assumption 4 ensures that a nontrivial classification rate is obtained: If scales faster than , then perfect asymptotic classification is achieved; however, if scales slower than , classification is asymptotically impossible. Assumptions 1 and 2 respectively control the growth rate in the data and the training.
We are in a position to derive a deterministic equivalent of the misclassification error rate of the R-LDA. Indeed, conditioned on the training data , the probability of misclassification is given by: [8]
(12) |
where
(13) |
(14) |
The total misclassification probability is thus given by
(15) |
Prior to stating the main result concerning R-LDA, we shall introduce the following quantities, which naturally appear, as a result of applying [16]. Let be the matrix defined as follows
(16) |
where , , satisfies the following fixed point equations
(17) |
Also define and as
(18) |
where
(19) |
The quantities in (17) can be computed in an iterative fashion where convergence is guaranteed after few iterations (see [16] for more details). Moreover, define
(20) |
(21) |
where . With these definitions at hand, we state the following theorem
See Appendix A. ∎
As stated earlier, if scales faster than , perfect asymptotic classification is achieved. This can be seen by noticing that would grow indefinitely large with , thereby making the conditional error rates vanish.
When converges to zero, the asymptotic misclassification error rate of each class coincides with the one derived in [8] obtained when .
In the case where (including the common covariance case where ), the conditional misclassification error rate converges almost surely to
where
where or , and is the unique positive solution to the following equation:
When , we first prove that up to an error , the key deterministic equivalents can be simplified to depend only on (or ). In the sequel, we take . As , we have
It follows that where .
The above relations allow to simplify functionals involving matrix . To see that, we decompose as
where follows from the resolvent identity and denotes a matrix with spectral norm converging to zero. It can be easily shown using the inequality for and two matrices in that:
Hence, for ,
and . Using the same notations as in [8] we have in particular for , and , where is the fixed-point solution in [8, Proposition 1]
Moreover,
It follows that
Using the same arguments, it is easy also to show that
∎
Corollary 1 is useful because it allows to specify the range of applications of Theorem 1 in which the information on the covariance matrix is essential for the classification task.
Also, it shows how R-LDA is robust against small perturbations in the covariance matrix. Similar observations have been made in [20] where it was shown via a Monte Carlo study that LDA is robust against the modeling assumptions.
In this part, we state the main results regarding the derivation of deterministic approximations of the R-QDA classification error. Such results have been obtained by considering some specific assumptions, carefully chosen such that an asymptotically non-trivial classification error (i.e., neither nor ) is achieved. We particularly highlight how the provided asymptotic approximations depend on such statistical parameters as the means and covariances within classes, thus allowing a better understanding of the performance of the R-QDA classifier. Ultimately, these results can be exploited in order to improve the performances by allowing optimal setting of the regularization parameter.
We consider the following double asymptotic regime in which for with the following assumptions met
and .
.
.
Matrix has exactly eigenvalues of order . The remaining eigenvalues are of order .
Assumption 5 implies also that for . As we shall see later, if this is not satisfied, the R-QDA perform asymptotically as the classifier that assigns all observations to the same class. The second assumption governs the distance between the two classes in terms of the Euclidean distance between the means. This is mandatory in order to avoid asymptotic perfect classification. This is a much stronger assumption than Assumption 2 in R-LDA since we allow larger values for . This can be understood as R-QDA being subject to strong noise induced when estimating , which requires a large value so that it can play a role in classification. A similar assumption is required to control the distance between the covariance matrices. Particularly, the spectral norm of the covariance matrices are required to be bounded as stated in Assumption 7 while their difference should satisfy Assumption 8.The latter assumption implies that for any matrix of bounded spectral norm,
It can be easily shown that the R-QDA conditional classification error in (11) can be expressed as
(26) |
where
Computing
amounts to the cumulative distribution function (CDF) of quadratic forms of Gaussian random vectors, and hence cannot be derived in closed form in general. However, it can be still approximated by considering asymptotic regimes that allow to exploit results about central limit theorem involving quadratic forms. Under Assumptions
5-8, a central limit theorem (CLT) on the random variable when is established.The condition in(27) will be proven to hold almost surely. Hence, as a by-product of the above Proposition, we obtain the following expression for the conditional classification error
As such an asymptotic equivalent of the conditional classification error can be derived. This is the subject of the next subsection.
This part is devoted to the derivation of deterministic equivalents of some random quantities involved in the R-QDA conditional classification error. Before that, we shall introduce the following notations which basically arise as a result of applying standard results from random matrix theory. We define for , as the unique positive solution to the following fixed point equation^{1}^{1}1Mathematical details treating the existence and uniqueness of can be found in [14].
Define as
and the scalar and as
Define , and as
(30) | ||||
(31) | ||||
(32) |
As shall be shown in Appendix C, these quantities are deterministic approximations in probability of , and . We therefore get
The proof is postponed to Appendix C. ∎
At first sight, quantity appears to be of order , since and are . Following this line of thought, the asymptotic misclassification probability error is expected to converge to a trivial misclassification error. This statement is, hopefully false. Assumption 8 and 5 were carefully designed so that and are of order . In particular, the following is proven in Appendix D
The proof is deferred to Appendix D ∎
The results of Theorem 2 along with proposition 2 show that the classification error converges to a non-trivial deterministic quantity that depends only on the statistical means and covariances within each class. The major importance of this result is that it allows to find a good choice of the regularization as the value that minimizes the asymptotic classification error. While it seems to be elusive for such value to possess a closed-form expression, it can be numerically approximated by using a simple one-dimensional line search algorithm.
Using Assumption 8, it can be shown that can asymptotically simplified to
(33) |
where or . The above relation comes from the fact that, up to an error of order , matrices or can be used interchangeably in or and in the terms involved in . This, in particular, implies that and are the same up to a vanishing error. It is noteworthy to see that the same artifice could not work for the terms and because the normalization, being with , is not sufficient to provide vanishing terms. We should also mention that, although (33) takes a simpler form, we chose to work in the simulations and when computing the consistent estimates of with the expression (32) since we found that it provides the highest accuracy.
It is important to note that we could have considered . In this case, the classification error rate would still converge to a non trivial limit but would not asymptotically depend on the difference . This is because in this case, the difference in covariance matrices dominate that of the means and as such represent the discriminant metric that asymptotically matters.
Another interesting case to highlight is the one in which . From Theorem 2 and using (33), it is easy to show that the total classification error converges as
(34) |
where , and have respectively the same definitions as , and upon dropping the class index , since quantities associated with class or class can be used interchangeably in the asymptotic regime. It is easy to see that in this case if scales slower than , classification is asymptotically impossible. This must be contrasted with the results of R-LDA, which provides non-vanishing misclassification rates for . This means that in this particular setting, R-QDA is asymptotically beaten by R-LDA which achieves perfect classification.
When occurring for instance when or is of finite rank, and , then where does not depend on and as such the misclassification error probability associated with both classes converge respectively to and with some probability depending solely on the statistics. The total misclassification error associated with R-QDA converges to .
When , quantities and grow unboundedly as the dimension increases. This unveils that asymptotically, the discriminant score of R-QDA will keep the same sign for all observations. The classifier would thus return the same class regardless of the observation under consideration.
The above remarks should help to draw some hints on when R-LDA or R-QDA should be used. Particularly, if the Frobenius norm of is , using the information on the difference between the class covariance matrices is not recommended. We should rather rely on using the information on the difference between the classes’ means, or in other words favoring the use of R-LDA against R-QDA.
In the machine learning field, evaluating the performances of algorithms is a crucial step that not only serves to ensure their efficacy but also to properly set the parameters involved in the design thereof, a process known in the machine learning parlance as model selection. The traditional way to evaluate performances consists in devoting a part of the training data to the design of the underlying method whereas performances are tested on the remaining data called testing data, treated as unseen data since they do not intervene in the design step. Among the many existing computational methods that are built on these ideas are the cross-validation
[22, 23] and the bootstrap [24, 25] techniques. Despite being widely used in the machine learning community, these methods have the drawback of being computationally expensive and most importantly of relying on mere computations, which does not lead to gain a better understanding of the performances of the underlying algorithm. As far as LDA and QDA classifiers are considered, the results of the previous section allow to gain a deeper understanding of the classification performances with respect to the covariances and means associated with both classes. However, as these results are expressed in terms of the unknown covariances and means, they could not be relied upon to assess the classification performances. In this section, we address this question and provide consistent estimators of the classification performances for both R-LDA and R-QDA classifiers that approximate in probability their asymptotic expressions.The following theorem provides the expression of the class-conditional true error estimator , for .
Comments
There are no comments yet.