Fisher and Kernel Fisher Discriminant Analysis: Tutorial

06/22/2019 ∙ by Benyamin Ghojogh, et al. ∙ 11

This is a detailed tutorial paper which explains the Fisher discriminant Analysis (FDA) and kernel FDA. We start with projection and reconstruction. Then, one- and multi-dimensional FDA subspaces are covered. Scatters in two- and then multi-classes are explained in FDA. Then, we discuss on the rank of the scatters and the dimensionality of the subspace. A real-life example is also provided for interpreting FDA. Then, possible singularity of the scatter is discussed to introduce robust FDA. PCA and FDA directions are also compared. We also prove that FDA and linear discriminant analysis are equivalent. Fisher forest is also introduced as an ensemble of fisher subspaces useful for handling data with different features and dimensionality. Afterwards, kernel FDA is explained for both one- and multi-dimensional subspaces with both two- and multi-classes. Finally, some simulations are performed on AT&T face dataset to illustrate FDA and compare it with PCA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Assume we have a dataset of instances or data points with sample size and dimensionality and . The are the input data to the model and the are the observations (labels). We define and . We can also have an out-of-sample data point, , which is not in the training set. If there are out-of-sample data points, , we define . Usually, the data points exist on a subspace or sub-manifold. Subspace or manifold learning tries to learn this sub-manifold (Ghojogh et al., 2019b).

Here, we consider the case where the observations come from a discrete set so that the task is classification. Assume the dataset consists of classes, where denotes the sample size (cardinality) of the the -th class.

We want to find a subspace (or sub-manifold) which separates the classes as much as possible while the data also become as spread as possible. Fisher Discriminant Analysis (FDA) (Friedman et al., 2009) pursues this goal. It was first proposed in (Fisher, 1936)

by Sir. Ronald Aylmer Fisher (1890 – 1962) who was a genius in statistics. He proposed many important concepts in the modern statistics, such as variance

(Fisher, 1919), FDA (Fisher, 1936), Fisher information (Frieden, 2004), Analysis of Variance (ANOVA) (Fisher, 1992), etc. The paper (Fisher, 1936), which proposed FDA, was the first paper introducing the well-known Iris flower dataset. Note that Fisher’s work was mostly concentrating on the statistics in the area of genetics. Much of his work was about variance making no wonder for us why FDA is all about variance and scatters.

Kernel FDA (Mika et al., 1999, 2000)

performs the goal of FDA in the feature space. The FDA and kernel FDA have had many different applications. Some examples for applications of FDA are face recognition (Fisherfaces)

(Belhumeur et al., 1997; Etemad & Chellappa, 1997; Zhao et al., 1999), action recognition (Fisherposes) (Ghojogh et al., 2017; Mokari et al., 2018), and gesture recognition (Samadani et al., 2013). Some examples for applications of kernel FDA are face recognition (kernel Fisherfaces) (Yang, 2002; Liu et al., 2004) and palmprint Recognition (Wang & Ruan, 2006).

In the literature, sometimes, FDA is referred to as Linear Discriminant Analysis (LDA) or Fisher LDA (FLDA). This is because FDA and LDA (Ghojogh & Crowley, 2019a) are equivalent although LDA is a classification method and not a subspace learning algorithm. In this paper, we will prove why they are equivalent.

2 Projection Formulation

2.1 Projection

Assume we have a data point

. We want to project this data point onto the vector space spanned by

vectors where each vector is -dimensional and usually . We stack these vectors column-wise in matrix . In other words, we want to project onto the column space of , denoted by .

The projection of onto and then its representation in the (its reconstruction) can be seen as a linear system of equations:

(1)

where we should find the unknown coefficients .

If the lies in the or , this linear system has exact solution, so . However, if does not lie in this space, there is no any solution for and we should solve for projection of onto or and then its reconstruction. In other words, we should solve for Eq. (1). In this case, and are different and we have a residual:

(2)

which we want to be small. As can be seen in Fig. 1, the smallest residual vector is orthogonal to ; therefore:

(3)

It is noteworthy that the Eq. (3

) is also the formula of coefficients in linear regression

(Friedman et al., 2009) where the input data are the rows of and the labels are ; however, our goal here is different.

Plugging Eq. (3) in Eq. (1) gives us:

We define:

(4)

as “projection matrix” because it projects onto (and reconstructs back). Note that is also referred to as the “hat matrix” in the literature because it puts a hat on top of .

If the vectors are orthonormal (the matrix is orthogonal), we have and thus . Therefore, Eq. (4) is simplified:

(5)

So, we have:

(6)
Figure 1: The residual and projection onto the column space of .

2.2 Projection onto a Subspace

In subspace learning, the projection of a vector onto the column space of (a -dimensional subspace spanned by where ) is defined as:

(7)
(8)

where and denote the projection and reconstruction of , respectively.

If we have data points, , which can be stored column-wise in a matrix , the projection and reconstruction of are defined as:

(9)
(10)

respectively.

If we have an out-of-sample data point which was not used in calculation of , the projection and reconstruction of it are defined as:

(11)
(12)

respectively.

In case we have out-of-sample data points, , which can be stored column-wise in a matrix , the projection and reconstruction of are defined as:

(13)
(14)

respectively.

For the properties of the projection matrix , refer to (Ghojogh & Crowley, 2019c).

2.2.1 Projection onto a One-dimensional Subspace

Considering the data , the mean of data is:

(15)

and the centered data point is:

(16)

The centered data is:

(17)

where and is the centering matrix (see Appendix A in (Ghojogh & Crowley, 2019c)).

In Eq. (8), if , we are projecting onto only one vector and reconstruct it. If the data point is centered, the reconstruction is:

The squared length (squared -norm) of this reconstructed vector is:

(18)

where is because is a unit (normal) vector, i.e., , and is because .

Suppose we have data points where are the centered data. The summation of the squared lengths of their projections is:

(19)

Considering , we have:

(20)

where is called the “covariance matrix” or “scatter matrix”. If the data were already centered, we would have .

Plugging Eq. (20) in Eq. (19) gives us:

(21)

Note that we can also say that is the variance of the projected data onto PCA subspace. In other words, . This makes sense because when some non-random thing (here ) is multiplied to the random data (here ), it will have squared (quadratic) effect on variance, and is quadratic in .

Therefore, can be interpreted in two ways: (I) the squared length of reconstruction and (II) the variance of projection.

If we consider the data points in the matrix , the squared length of reconstruction of the centered data is:

where denotes the trace of matrix, is because is a unit vector, is because of the cyclic property of the trace, and is because is a scalar. Hence, we have:

(22)

2.2.2 Projection Onto a Multi-dimensional Subspace

In Eq. (10), if , we are projecting the data onto a subspace with dimensionality more than one (spanned by ) and then reconstruct back. If the data are assumed to be centered, the reconstruction is:

The squared length (squared Frobenius Norm) of this reconstructed matrix is:

where is because

is an orthogonal matrix (its columns are orthonormal) and

is because of the cyclic property of trace. Thus, we have:

(23)

3 Fisher Discriminant Analysis

3.1 One-dimensional Subspace

3.1.1 Scatters in Two-Class Case

Assume we have two classes, and , where and denote the sample size of the first and second class, respectively, and denotes the -th instance of the -th class.

If the data instances of the -th class are projected onto a one-dimensional subspace (vector ) by , the mean and the variance of the projected data are and , respectively, where and are the mean and covariance matrix (scatter) of the -th class. The mean of the -th class is:

(24)

According to Appendix A, after projection onto the one-dimensional subspace, the distance between the means of classes is:

(25)

where is because is a scalar, is because of the cyclic property of trace, is because is a scalar, and is because we define:

(26)

as the between-scatter of classes.

The Eq. (25) can also be interpreted according to Eq. (22): the is the variance of projection of the class means or the squared length of reconstruction of the class means.

We saw that the variance of projection is for the -th class. If we add up the variances of projections of the two classes, we have:

(27)

where:

(28)

is the within-scatter of classes. According to Eq. (22), the is the summation of projection variance of class instances or the summation of the reconstruction length of class instances.

3.1.2 Scatters in Multi-Class Case: Variant 1

Assume are the instances of the -th class where we have multiple number of classes. In this case, the between-scatter is defined as:

(29)

where is the number of classes and:

(30)

is the weighted mean of means of classes or the total mean of data.

It is noteworthy that some researches define the between-scatter in a weighted way:

(31)

If we extend the Eq. (28) to number of classes, the within-scatter is defined as:

(32)
(33)

where is the sample size of the -th class.

In this case, the and are:

(34)
(35)

where and are Eqs. (29) and (33).

3.1.3 Scatters in Multi-Class Case: Variant 2

There is another variant for multi-class case in FDA. In this variant, the within-scatter is the same as Eq. (33). The between-scatter is, however, different.

The total-scatter is defined as the covariance matrix of the whole data, regardless of classes (Welling, 2005):

(36)

where the total mean is the Eq. (30). We can also use the scaled total-scatter by dropping the factor. On the other hand, the total scatter is equal to the summation of the within- and between-scatters:

(37)

Therefore, the between-scatter, in this variant, is obtained as:

(38)

3.1.4 Fisher Subspace: Variant 1

In FDA, we want to maximize the projection variance (scatter) of means of classes and minimize the projection variance (scatter) of class instances. In other words, we want to maximize and minimize . The reason is that after projection, we want the within scatter of every class to be small and the between scatter of classes to be large; therefore, the instances of every class get close to one another and the classes get far from each other. The two mentioned optimization problems are:

(39)
(40)

We can merge these two optimization problems as a regularized optimization problem:

(41)

where is the regularization parameter. Another way of merging Eqs. (39) and (40) is:

(42)

where is referred to as the Fisher criterion (Xu & Lu, 2006). The Fisher criterion is a generalized Rayleigh-Ritz Quotient (see Appendix B):

(43)

According to Eq. (165) in Appendix B, the optimization in Eq. (42) is equivalent to:

(44)
subject to

The Lagrangian (Boyd & Vandenberghe, 2004) is:

where is the Lagrange multiplier. Equating the derivative of to zero gives:

(45)

which is a generalized eigenvalue problem

according to (Ghojogh et al., 2019a). The

is the eigenvector with the largest eigenvalue (because the optimization is maximization) and the

is the corresponding eigenvalue. The is referred to as the Fisher direction or Fisher axis. The projection and reconstruction are according to Eqs. (9) and (10), respectively, where is used instead of . The out-of-sample projection and reconstruction are according to Eqs. (13) and (14), respectively, with rather than .

One possible solution to the generalized eigenvalue problem is (Ghojogh et al., 2019a):

(46)

where denotes the eigenvector of the matrix with the largest eigenvalue. Although the solution in Eq. (46) is a little dirty (Ghojogh et al., 2019a) because might be singular and not invertible, but this solution is very common for FDA. In some researches, the diagonal of is strengthened slightly to make it full rank and invertible (Ghojogh et al., 2019a):

(47)

where is a very small positive number, large enough to make full rank.

In a future section, we will cover robust FDA which tackles this problem. On the other hand, the generalized eigenvalue problem has a rigorous solution (Ghojogh et al., 2019a; Wang, 2015) which does not require non-singularity of .

Another way to solve the optimization in Eq. (42) is taking derivative from the Fisher criterion:

(48)

where is because is a scalar. The Eq. (48) which is a generalized eigenvalue problem (Ghojogh et al., 2019a) with and as the eigenvector with the largest eigenvalue (because the optimization is maximization) and the corresponding eigenvalue, respectively. Therefore, the Fisher criterion is the eigenvalue of the Fisher direction.

3.1.5 Fisher Subspace: Variant 2

Another way to find the FDA direction is to consider another version of Fisher criterion. According to Eq. (38) for , the Fisher criterion becomes (Welling, 2005):

(49)

The is a constant and is dropped in the optimization; therefore:

(50)
subject to

whose solution is similarly obtained as:

(51)

which is a generalized eigenvalue problem according to (Ghojogh et al., 2019a).

3.2 Multi-dimensional Subspace

In case the Fisher subspace is the span of several Fisher directions, where , the and are defined as:

(52)
(53)

where . In this case, maximizing the Fisher criterion is:

(54)

The Fisher criterion is a generalized Rayleigh-Ritz Quotient (see Appendix B). According to Eq. (165) in Appendix B, the optimization in Eq. (54) is equivalent to:

(55)
subject to

The Lagrangian (Boyd & Vandenberghe, 2004) is:

where is a diagonal matrix whose diagonal entries are the Lagrange multipliers. Equating the derivative of to zero gives:

(56)

which is a generalized eigenvalue problem according to (Ghojogh et al., 2019a). The columns of are the eigenvectors sorted by largest to smallest eigenvalues (because the optimization is maximization) and the diagonal entries of are the corresponding eigenvalues. The columns of are referred to as the Fisher directions or Fisher axes. The projection and reconstruction are according to Eqs. (9) and (10), respectively. The out-of-sample projection and reconstruction are according to Eqs. (13) and (14), respectively.

One possible solution to the generalized eigenvalue problem is (Ghojogh et al., 2019a):

(57)

where denotes the eigenvectors of the matrix stacked column-wise. Again, we can have (Ghojogh et al., 2019a):

(58)

Another way to solve the optimization in Eq. (54) is taking derivative from the Fisher criterion:

(59)

where is because is a scalar. The Eq. (59) which is a generalized eigenvalue problem (Ghojogh et al., 2019a) with columns of as the eigenvectors and as the -th largest eigenvalue (because the optimization is maximization).

Again, another way to find the FDA directions is to consider another version of Fisher criterion. According to Eq. (38) for , the Fisher criterion becomes (Welling, 2005):

(60)

The is a constant and is dropped in the optimization; therefore:

(61)
subject to

whose solution is similarly obtained as:

(62)

which is a generalized eigenvalue problem according to (Ghojogh et al., 2019a).

3.3 Discussion on Dimensionality of the Fisher Subspace

In general, the rank of a covariance (scatter) matrix over the -dimensional data with sample size is at most . The is because the covariance matrix is a matrix and the is because we iterate over data instances for calculating the covariance matrix. The

is because of subtracting the mean in calculation of the covariance matrix. For clarification, assume we only have one instance which becomes zero after removing the mean. This makes the covariance matrix a zero matrix.

According to Eq. (33), the rank of the is at most because all the instances of all the classes are considered. Hence, the rank of is also at most . According to Eq. (29), the rank of the is at most because we have iterations in its calculation.

In Eq. (57), we have whose rank is:

(63)

where is because we usually have . Therefore, the rank of is limited because of the rank of which is at most .

According to Eq. (57), the leading eigenvalues will be valid and the rest are zero or very small. Therefore, the , which is the dimensionality of the Fisher subspace, is at most . The leading eigenvectors are considered as the Fisher directions and the rest of eigenvectors are invalid and ignored.

4 Interpretation of FDA: The Example of a Man with Weak Eyes

In this section, we interpret the FDA using a real-life example in order to better understand the essence of Fisher’s method. Consider a man which has two eye problems: (1) he is color-blind and (2) his eyes are also very weak.

Suppose there are two sets of balls with red and blue colors. The man wants to discriminate the balls into red and blue classes; however, he needs help because of his eye problems.

First, consider his color-blindness. In order to help him, we separate the balls into two sets of red and blue. In other words, we increase the distances of the balls with different colors to give him a clue that which balls belong to the same class. This means that we are increasing the between-scatter of the two classes to help him.

Second, consider his very weak eyes. although the balls with different colors are almost separated, everything is blue to him. Thus, we put the balls of the same color closer to one another. In other words, we decrease the within-scatter of every class. In this way, the man sees every class as almost one blurry ball so he can discriminate the classes better.

Recall Eq. (57) which includes . The implies that we want to increase the between-scatter as we did in the first help. The implies that we want to decrease the within-scatter as done in the second help to the man. In conclusion, FDA increases the between-scatter and decreases the within-scatter (collapses each class (Globerson & Roweis, 2006)), at the same time, for better discrimination of the classes.

5 Robust Fisher Discriminant Analysis

Robust FDA (RFDA) (Deng et al., 2007; Guo & Wang, 2015), has also addressed the problem of singularity (or close to singularity) of . In RFDA, the is decomposed using eigenvalue decomposition (Ghojogh et al., 2019a):

(64)

where and include the eigenvectors and eigenvalues of , respectively. The eigenvalues are sorted as and the eigenvectors (columns of ) are sorted accordingly. If is close to singularity, the first eigenvalues are valid and the rest eigenvalues are either very small or zero. The appropriate is obtained as:

(65)

In RFDA, the invalid eigenvalues are replaced with :

(66)

where (Deng et al., 2007):

(67)

Hence, the is replaced with :

(68)

and the robust Fisher directions are the eigenvectors of the generalized eigenvalue problem (Ghojogh et al., 2019a).

Figure 2: Comparison of FDA and PCA directions for two-dimensional data with two classes: (a) a case where FDA and PCA are orthogonal, (b) a case where FDA and PCA are equivalent (parallel), and (c) a case between the two extreme cases of (a) and (b).

6 Comparison of FDA and PCA Directions

The FDA directions capture the directions where the instances of different classes fall apart and the instances in one class fall close to each other. On the other hand, the PCA directions capture the directions where the data have maximum variance (spread) regardless of the classes (Ghojogh & Crowley, 2019c). In some datasets, the FDA and PCA are orthogonal and in some datasets, they are parallel. Other cases between these two extreme cases can happen for some datasets. This depends on the spread of classes in the dataset. Figure 2 shows these cases for some two-dimensional datasets.

Moreover, considering the Eq. (38) for , the Fisher criterion becomes Eqs. (49) and (60) for one-dimensional and multi-dimensional Fisher subspaces, respectively. In these equations, the is a constant and is dropped in the optimization. This has an important message about FDA: the Fisher direction is maximizing the total variance (spread) of data, as also done in PCA, while at the same time, it minimizes the within-scatters of classes (by making use of the class labels). In other words, the optimization of FDA is equivalent to (we repeat Eq. (61) here):

(69)
subject to

while the optimization of the PCA is (Ghojogh & Crowley, 2019c):

(70)
subject to

The solutions to Eqs. (69) and (70) are the generalized eigenvalue problem and the eigenvalue problem for , respectively (Ghojogh et al., 2019a).

7 Fda Lda

The FDA is also referred to as Linear Discriminant Analysis (LDA) and Fisher LDA (FLDA). Note that FDA is a manifold (subspace) learning method and LDA (Ghojogh & Crowley, 2019a) is a classification method. However, LDA can be seen as a metric learning method (Ghojogh & Crowley, 2019a) and as metric learning is a manifold learning method (see Appendix A), there is a connection between FDA and LDA.

We know that FDA is a projection-based subspace learning method. Consider the projection vector . According to Eq. (7), the projection of data is:

(71)

which can be done for all the data instances of every class. Thus, the mean and the covariance matrix of the class are transformed as:

(72)
(73)

because of characteristics of mean and variance.

According to Eq. (42), the Fisher criterion is the ratio of the between-class variance, , and within-class variance, :

(74)

The FDA maximizes the Fisher criterion: