Assume we have a dataset of instances with sample size and dimensionality and . The ’s are the class labels. We would like to classify the space of data using these instances. Linear Discriminant Analysis (LDA) and Quadratic discriminant Analysis (QDA) (Friedman et al., 2009) are two well-known supervised classification methods in statistical and probabilistic learning. This paper is a tutorial for these two classifiers where the theory for binary and multi-class classification are detailed. Then, relations of LDA and QDA to metric learning, kernel Principal Component Analysis (PCA), Fisher Discriminant Analysis (FDA), logistic regression, Bayes optimal classifier, Gaussian naive Bayes, and Likelihood Ratio Test (LRT) are explained for better understanding of these two fundamental methods. Finally, some experiments on synthetic datasets are reported and analyzed for illustration.
2 Optimization for the Boundary of Classes
First suppose the data is one dimensional,
. Assume we have two classes with the Cumulative Distribution Functions (CDF)and
, respectively. Let the Probability Density Functions (PDF) of these CDFs be:
We assume that the two classes have normal (Gaussian) distribution which is the most common and default distribution in the real-world applications. The mean of one of the two classes is greater than the other one; we assume. An instance belongs to one of these two classes:
where and denote the first and second class, respectively.
For an instance , we may have an error in estimation of the class it belongs to. At a point, which we denote by , the probability of the two classes are equal; therefore, the point is on the boundary of the two classes. As we have , we can say as shown in Fig. 1. Therefore, if or the instance belongs to the first and second class, respectively. Hence, estimating or to respectively belong to the second and first class is an error in estimation of the class. This probability of the error can be stated as:
As we have , we can say:
which we want to minimize:
by finding the best boundary of classes, i.e., .
According to the definition of CDF, we have:
According to the definition of PDF, we have:
where we denote the priors and by and , respectively.
We take derivative for the sake of minimization:
Another way to obtain this expression is equating the posterior probabilities to have the equation of the boundary of classes:
According to Bayes rule, the posterior is:
where is the number of classes which is two here. The and are the likelihood (class conditional) and prior probabilities, respectively, and the denominator is the marginal probability.
Therefore, Eq. (13) becomes:
Now let us think of data as multivariate data with dimensionality . The PDF for multivariate Gaussian distribution, is:
where , is the mean, is the covariance matrix, and is the determinant of matrix. The in this equation should not be confused with the (prior) in Eq. (12) or (15). Therefore, the Eq. (12) or (15) becomes:
where the distributions of the first and second class are and , respectively.
3 Linear Discriminant Analysis for Binary Classification
In Linear Discriminant Analysis (LDA), we assume that the two classes have equal covariance matrices:
Therefore, the Eq. (17) becomes:
where takes natural logarithm from the sides of equation.
We can simplify this term as:
where is because as it is a scalar and is symmetric so . Thus, we have:
Therefore, if we multiply the sides of equation by , we have:
which is the equation of a line in the form of . Therefore, if we consider Gaussian distributions for the two classes where the covariance matrices are assumed to be equal, the decision boundary of classification is a line. Because of linearity of the decision boundary which discriminates the two classes, this method is named linear discriminant analysis.
the class of an instance is estimated as:
4 Quadratic Discriminant Analysis for Binary Classification
In Quadratic Discriminant Analysis (QDA), we relax the assumption of equality of the covariance matrices:
which means the covariances are not necessarily equal (if they are actually equal, the decision boundary will be linear and QDA reduces to LDA).
Therefore, the Eq. (17) becomes:
where takes natural logarithm from the sides of equation. According to Eq. (19), we have:
Therefore, if we multiply the sides of equation by , we have:
which is in the quadratic form . Therefore, if we consider Gaussian distributions for the two classes, the decision boundary of classification is quadratic. Because of quadratic decision boundary which discriminates the two classes, this method is named quadratic discriminant analysis.
the class of an instance is estimated as the Eq. (22).
5 LDA and QDA for Multi-class Classification
Taking natural logarithm gives:
We drop the constant term which is the same for all classes (note that this term is multiplied before taking the logarithm). Thus, the scaled posterior of the -th class becomes:
In QDA, the class of the instance is estimated as:
because it maximizes the posterior of that class. In this expression, is Eq. (28).
In LDA, we assume that the covariance matrices of the classes are equal:
Therefore, the Eq. (28) becomes:
We drop the constant terms and which are the same for all classes (note that before taking the logarithm, the term is multiplied and the term is multiplied as an exponential term). Thus, the scaled posterior of the -th class becomes:
In conclusion, QDA and LDA deal with maximizing the posterior of classes but work with the likelihoods (class conditional) and priors.
6 Estimation of Parameters in LDA and QDA
In LDA and QDA, we have several parameters which are required in order to calculate the posteriors. These parameters are the means and the covariance matrices of classes and the priors of classes.
The priors of the classes are very tricky to calculate. It is somewhat a chicken and egg problem because we want to know the class probabilities (priors) to estimate the class of an instance but we do not have the priors and should estimate them. Usually, the prior of the -th class is estimated according to the sample size of the -th class:
where and are the number of training instances in the
-th class and in total, respectively. This estimation considers Bernoulli distribution for choosing every instance out of the overall training set to be in the-th class.
The mean of the
-th class can be estimated using the Maximum Likelihood Estimation (MLE), or Method of Moments (MOM), for the mean of a Gaussian distribution:
where is the indicator function which is one and zero if its condition is satisfied and not satisfied, respectively.
In QDA, the covariance matrix of the -th class is estimated using MLE:
Or we can use the unbiased estimation of the covariance matrix:
In LDA, we assume that the covariance matrices of the classes are equal; therefore, we use the weighted average of the estimated covariance matrices as the common covariance matrix in LDA:
where the weights are the cardinality of the classes.
7 LDA and QDA are Metric Learning!
Recall Eq. (28
) which is the scaled posterior for the QDA. First, assume that the covariance matrices are all equal (as we have in LDA) and they all are the identity matrix:
which means that all the classes are assumed to be spherically distributed in the dimensional space. After this assumption, the Eq. (28) becomes:
because , , and . If we assume that the priors are all equal, the term is constant and can be dropped:
where is the Euclidean distance from the mean of the -th class:
Thus, the QDA or LDA reduce to simple Euclidean distance from the means of classes if the covariance matrices are all identity matrix and the priors are equal. Simple distance from the mean of classes is one of the simplest classification methods where the used metric is Euclidean distance.
The Eq. (39) has a very interesting message. We know that in metric Multi-Dimensional Scaling (MDS) (Cox & Cox, 2000) and kernel Principal Component Analysis (PCA), we have (see (Ham et al., 2004) and Chapter 2 in (Strange & Zwiggelaar, 2014)):
where is the distance matrix whose elements are the distances between the data instances, is the kernel matrix over the data instances, is the centering matrix, and . If the elements of the distance matrix are obtained using Euclidean distance, the MDS is equivalent to Principal Component Analysis (PCA) (Jolliffe, 2011).
Comparing Eqs. (39) and (41) shows an interesting connection between the posterior of a class in QDA and the kernel over the the data instances of the class. In this comparison, the Eq. (41) should be considered for a class and not the entire data, so , , and .
Now, consider the case where still the covariance matrices are all identity matrix but the priors are not equal. In this case, we have Eq. (38). If we take an exponential (inverse of logarithm) from this expression, the becomes a scale factor (weight). This means that we still are using distance metric to measure the distance of an instance from the means of classes but we are scaling the distances by the priors of classes. If a class happens more, i.e., its prior is larger, it must have a larger posterior so we reduce the distance from the mean of its class. In other words, we move the decision boundary according to the prior of classes (see Fig. 2).
As the next step, consider a more general case where the covariance matrices are not equal as we have in QDA. We apply Singular Value Decomposition (SVD) to the covariance matrix of the-th class:
where the left and right matrices of singular vectors are equal because the covariance matrix is symmetric. Therefore:
because it is an orthogonal matrix. Therefore, we can simplify the following term:
As is a diagonal matrix with non-negative elements (because it is covariance), we can decompose it as:
where is because because it is diagonal. We define the following transformation:
which also results in the transformation of the mean: . Therefore, the Eq. (28) can be restated as:
Ignoring the terms and , we can see that the transformation has changed the covariance matrix of the class to identity matrix. Therefore, the QDA (and also LDA) can be seen as simple comparison of distances from the means of classes after applying a transformation to the data of every class. In other words, we are learning the metric using the SVD of covariance matrix of every class. Thus, LDA and QDA can be seen as metric learning (Yang & Jin, 2006; Kulis, 2013) in a perspective. Note that in metric learning, a valid distance metric is defined as (Yang & Jin, 2006):
where is a positive semi-definite matrix, i.e., . In QDA, we are also using . The covariance matrix is positive semi-definite according to the characteristics of covariance matrix. Moreover, according to characteristics of a positive semi-definite matrix, the inverse of a positive semi-definite matrix is positive semi-definite so . Therefore, QDA is using metric learning (and as will be discussed in next section, it can be seen as a manifold learning method, too).
where is the covariance matrix of the cloud of data whose mean is
. The intuition of Mahalanobis distance is that if we have several data clouds (e.g., classes), the distance from the class with larger variance should be scaled down because that class is taking more of the space so it is more probable to happen. The scaling down shows in the inverse of covariance matrix. Comparingin QDA or LDA with Eq. (45) shows that QDA and LDA are sort of using Mahalanobis distance.
8 Lda Fda
In the previous section, we saw that LDA and QDA can be seen as metric learning. We know that metric learning can be seen as a family of manifold learning methods. We briefly explain the reason of this assertion: As , we can say . Therefore, Eq. (44) becomes:
which means that metric learning can be seen as comparison of simple Euclidean distances after the transformation which is a projection into a subspace with projection matrix . Thus, metric learning is a manifold learning approach. This gives a hint that the Fisher Discriminant Analysis (FDA) (Fisher, 1936; Welling, 2005), which is a manifold learning approach (Tharwat et al., 2017), might have a connection to LDA; especially, because the names FDA and LDA are often used interchangeably in the literature. Actually, other names of FDA are Fisher LDA (FLDA) and even LDA.
We know that if we project (transform) the data of a class using a projection vector to a dimensional subspace (), i.e.:
for all data instances of the class, the mean and the covariance matrix of the class are transformed as:
because of characteristics of mean and variance.
The Fisher criterion (Xu & Lu, 2006) is the ratio of the between-class variance, , and within-class variance, :
The FDA maximizes the Fisher criterion:
which can be restated as:
where is the Lagrange multiplier. Equating the derivative of to zero gives:
which is a generalized eigenvalue problemaccording to (Ghojogh et al., 2019b)
. The projection vector is the eigenvector of; therefore, we can say:
In LDA, the equality of covariance matrices is assumed. Thus, according to Eq. (18), we can say:
According to Eq. (46), we have:
Comparing Eq. (53) with Eq. (23) shows that LDA and FDA are equivalent up to a scaling factor (note that this term is multiplied as an exponential factor before taking logarithm to obtain Eq. (23), so this term a scaling factor). Hence, we can say:
In other words, FDA projects into a subspace. On the other hand, according to Section 7, LDA can be seen as a metric learning with a subspace where the Euclidean distance is used after projecting onto that subspace. The two subspaces of FDA and LDA are the same subspace. It should be noted that in manifold (subspace) learning, the scale does not matter because all the distances scale similarly.
Note that LDA assumes one (and not several) Gaussian for every class and so does the FDA. That is why FDA faces problem for multi-modal data (Sugiyama, 2007).
9 Relation to Logistic Regression
According to Eqs. (16) and (32), Gaussian and Bernoulli distributions are used for likelihood (class conditional) and prior, respectively, in LDA and QDA. Thus, we are making assumptions for the likelihood and prior, although we finally work with posterior in LDA and QDA according to Eq. (15). Logistic regression (Kleinbaum et al., 2002) says why do we make assumptions on the likelihood and prior when we want to work on posterior finally. Let us make assumption directly for the posterior.
In logistic regression, first a linear function is applied to the data to have where and include the intercept. Then, logistic function is used in order to have a value in range to simulate probability. Therefore, in logistic regression, the posterior is assumed to be:
where for the two classes. Logistic regression considers the coefficient as the parameter to be optimized and uses Newton’s method (Boyd & Vandenberghe, 2004) for the optimization. Therefore, in summary, logistic regression makes assumption on the posterior while LDA and QDA make assumption on likelihood and prior.
10 Relation to Bayes Optimal Classifier and Gaussian Naive Bayes
The Bayes classifier maximizes the posteriors of the classes (Murphy, 2012):