In this paper, we discuss a new type of discriminant analysis based on a projection onto the generalized difference subspace (GDS) that represents difference among multiple class subspaces [gds]. GDS is defined as a generalization of the difference subspace (DS) that represents the difference between two subspaces. DS is a natural extension of the difference vector between two vectors.
The orthogonal projection of data or subspaces onto a GDS, which is called GDS projection, has two natures in feature extraction. These natures can be alternatively exploited. One nature is to enlarge the angles among class subspaces to make their relationship closer to the orthogonal status [gds]. As a result, GDS projection works as quasi-orthogonalization and an effective feature extraction technique for subspace based classifiers such as the subspace method and the mutual subspace method [cmsm, kcmsm]. The other nature is to serve for discriminative feature extraction, through a mechanism similar to the Fisher discriminant analysis (FDA) [fda1, fda2].
In this paper, we clarify the latter nature both theoretically and empirically by exploring the close connection between GDS projection and FDA. However, a direct proof of their close connection would not be straightforward due to the significant difference in their formulations. To circumvent the complication, we introduce geometrical Fisher discriminant analysis (gFDA) that is a discriminant analysis based on a simplified Fisher criterion in terms of class representation. Then, we indirectly prove the close connection between GDS projection and FDA via gFDA, where gFDA serves as an intermediate concept between FDA and GDS projection since gFDA inherits the intrinsic mechanism from GDS projection and the discriminant ability from FDA.
Our simplification starts with the introduction of a heuristic principle that the directions of the sample mean vector and the first principal component vector of a class are nearly equivalent, given the condition that the PCA without data centering (subtracting the mean) is applied to calculate the principal component vectors.
This heuristic principle enables us to reasonably represent the original Fisher criterion using the principal component vectors and their weights (eigenvalues) of all the classes involved in the classification task. That is, based on this representation, we simplify the original Fisher criterion in terms of class representation by adding certain constraints on data distribution step by step, and finally approximate it compactly with only several principal component vectors of all the classes. Since a set of the principal component vectors of each class span a class subspace, our simplified criterion is defined on the basis of the geometrical relationship between the class subspaces. In this sense, we name this new type of Fisher discriminant analysis based on the simplified criterion geometrical Fisher discriminant analysis (gFDA).
We further introduce the normalization of the projected data on the discriminant space. This normalization is in principle required to get the best performances out of gFDA and GDS projection, which will be clearly explained through the geometrical structure in Sec.5.5.
The discriminant criterion of gFDA leads to a generalized eigenvalue problem for the matrix product of between-class and within-class matrices. This formulation makes it difficult to examine the connection of gFDA and GDS projection. Thus, we transform the generalized eigenvalue problem to a simpler regular eigenvalue problem for the linear combination of between-class and within-class matrices. The linear combination formation leads us to an observation that gFDA is equivalent to GDS projection with a small correction term under a condition of no overlap between class subspaces. As a consequence, we can verify the close connection between FDA and GDS projection via gFDA, as gFDA can be regarded as an approximation of FDA.
The linear combination formation also enables gFDA to deal with the situation where only a few samples are available. In such a situation, the within-class matrix becomes singular so that FDA cannot in principle be computed. This problem is called the small size sample (SSS) problem of FDA [fda2]. To address the SSS problem, many types of extensions of FDA have been proposed [overviewLDA, newLDA, regFDA, subspaceLDA]. gFDA is largely different from these conventional methods in that it bypasses the SSS problem by representing the discriminant criterion in a form of linear combination, which can be solved without depending on the number of samples. gFDA can work even with only one sample without any specific modification, unlike most of the above extensions.
Besides, the subspace representation can enhance the robustness of gFDA against the SSS problem. In many applications, a class subspace can be stably generated from even few data; for example, in 3D object recognition, it is well known that a set of the images of a 3D convex object with Lambertian reflectance under various illumination conditions can be represented by a subspace with low dimension (from 3 to 9), which is called illumination subspace [basri, Belhumeur, illuminationSpace]
. This means that an illumination subspace of a 3D object like face can be stably and accurately estimated from only a small number (from 3 to 9) of face images under different illuminations. This characteristic of subspace representation works effectively against the SSS problem.
Our main contributions are summarized as follows:
We verify that the projection of data onto the generalized difference subspace, GDS projection, works as a discriminant analysis through a mechanism similar to the Fisher discriminant analysis.
To show the above nature,
We propose a new discriminant analysis, geometrical Fisher discriminant analysis (gFDA), which maximizes a simplified Fisher criterion under a heuristic principle: the directions of sample mean vector and first principal component vector of a class are nearly the same.
We prove that gFDA is equivalent to GDS projection with a small correction term.
We show the close connection of GDS projection and FDA indirectly by regarding gFDA as an intermediate concept between them.
We demonstrate the effectiveness of gFDA and GDS projection, through extensive comparison experiments with several extensions of FDA on public databases, the Yale face B+ database and the CMU face database, focusing on the small size sample (SSS) problem under few samples.
The rest of this paper is organized as follows. Section 2 and Section 3 provide preliminary concepts. In Section 2, we describe the concept and the definition of the generalized difference subspace (GDS). In Section 3, we overview the fundamentals of FDA with the Fisher criterion. In Section 4, we introduce a heuristic principle on the relationship between the first principal component vector and the mean vector of a class. Then, we simplify the Fisher criterion by using the heuristic relationship and construct the geometrical Fisher discriminant analysis (gFDA) with the simplified criterion. In Section 5, we describe the geometrical mechanism of gFDA and prove that gFDA has dual forms of objective function. In Section 6, we show the close connection between FDA and GDS projection via gFDA. In Section 7, we demonstrate the effectiveness of gFDA through evaluation experiments, focusing on the situation of a small sample size. Section 8 concludes the paper.
2 Generalized difference subspace
In this section, we describe the concept of generalized difference subspace (GDS). As a preliminary to its definition, we describe how to generate a class subspace from the data set for each class. We then define the difference subspace (DS) for two subspaces and extend DS to GDS.
2.1 Generation of class subspace
The principal component vectors of a class are obtained by applying the principal component analysis (PCA) without data centering to a set of data from the class.
Given a set of -dimensional data of class , where an image with pixels is regarded as an dimensional data , the principal component vectors of class are obtained by the following procedure:
An auto-correlation matrix is computed as from .
The principal component vectors of class
are obtained as the unit eigenvectors corresponding to thelargest eigenvalues of . If we use all the eigenvalues, we obtain the spectral decomposition of the matrix .
Throughout the whole paper, the principal component vectors of a class are used as the orthonormal basis vector of the corresponding class subspace. In the following, we will interchangeably use the terms of principal component vector and orthonormal basis vector as of the same meaning.
2.2 Geometrical definition of DS
We formulate the between -dimensional subspace and -dimensional subspace in -dimensional vector space. In the case that there is no overlap between these subspaces, canonical angles (for convenience can be obtained between them [cangle1, cangle2]. Let be the difference vector, , between canonical vectors and , which form the th canonical angle . All are orthogonal to each other. Thus, after normalizing the length of each difference vector to 1, we regard the normalized difference vectors as the orthonormal basis vectors of the . Thus, is defined as .
2.3 Analytical definition of DS
The defined geometrically in Sec.2.2 can also be analytically defined by using the orthogonal projection matrices of two class subspaces [gds].
Theorem. The -th basis vector of the difference subspace is equal to the normalized eigenvector of that corresponds to the -th smallest eigenvalue smaller than 1, where and are the orthogonal projection matrices of the two class subspaces, defined by and , respectively.
eigenvectors of matrix corresponding to eigenvalues smaller than 1 span the .
eigenvectors of matrix corresponding to eigenvalues larger than 1 span the .
The relations lead to the conclusion that the sum subspace of and , spanned by all the eigenvectors of matrix , is represented by the orthogonal direct sum of the principal component subspace and the as . Fig.2 shows the conceptual diagram of this direct sum decomposition. This means that the can be defined as the subspace that is produced by removing the principal component subspace from the sum subspace . Hence, the can be regarded as the subspace that does not include the principal component of the two subspaces, that is, it contains only the difference component between them.
2.4 Definition of GDS
To deal with the difference between two or more subspaces, the concept of the difference subspace was generalized under the analytical definition [gds]. Fig.3 shows the conceptual diagram of the generalized difference subspace (GDS) for subspaces.
Given -dimensional subspaces in -dimensional vector space, a generalized difference subspace can be defined as such a subspace that is produced by removing the principal component subspace , of all the subspaces, from the sum subspace of . Thus, the generalized difference subspace is spanned by eigenvectors, corresponding to the smallest eigenvalues, of the following sum matrix :
where denotes the orthogonal projection matrix of the class subspace.
The generalized difference subspace contains only the essential component for discriminating all the classes, since it is the orthogonal complement of the principal component subspace that represents the common information of all the class subspaces.
3 Fisher discriminant analysis
Fisher discriminant analysis (FDA) is a method for obtaining a discriminant space , which can distinguish multiple classes effectively [fda1, fda2]. Such a discriminant space can be found out by maximizing the Fisher criterion of the projected data on the discriminant space .
The Fisher criterion consists of within-class covariance matrix and between-class covariance matrix. Given classes, each of which contains the data set (), the within-class covariance matrix is defined as
where and indicates the mean vector of class . The between-class covariance matrix is defined by the following equation:
where indicates the mean vector over all the classes; can also be represented by
The Fisher criterion of the data projected on a 1-dimensional subspace spanned by vector is defined as
where the vector that maximizes function can be obtained by solving the generalized eigenvalue problem
Discriminant space is spanned by eigenvectors, , corresponding to the largest eigenvalues of the above eigenvalue problem.
4 Geometrical fisher discriminant analysis
In this section, we first approximate the Fisher criterion, in Eq.(5), based on a heuristic principle. We then simplify it by adding constraints on data distribution in incremental steps. Finally, we construct the proposed gFDA by maximizing the simplified Fisher criterion.
4.1 Equivalence between the class mean and first principal component vector
We introduce a heuristic principle on the equivalence between the first principal component vector and the mean vector of class .
Heuristic principle: For each class subspace, the first principal component vector and the mean vector can be in a very close correspondence with each other in terms of their directions, under the condition that , where
is the maximum variance of the class distribution among all the dimensions.
Under this heuristic principle, the first and the remaining principal component vectors have different characteristics; the projected data on the first principal component vector should have a positive mean value, while those on the remaining vectors should have zero mean value. Fig.4 shows an example of the histograms of the projections on each principal component vector in the case of front face images. We can observe that the means of the projections onto the principal component vectors are all nearly zero except the first one, which supports that the heuristic principle should be valid in real data.
The mechanism that the condition of yields our heuristic principle can be considered as follows. Given a data set, of class , the autocorrelation matrix and the covariance matrix are defined as
where indicates the mean vector of the class. There is the following relationship between the two matrices and the mean vector:
The condition of ensures , thus is dominant in constructing . Therefore, the direction of the first principal component vector corresponding to the largest eigenvalue of almost coincides with that of vector .
The heuristic principle can work with high degree of coincidence even under a loose condition. According to a simulation using randomly generated Gaussian distributions in vector spaces with various dimensions, the directions of them coincided with high correlation of over 0.998 even under the condition that. In many tasks, for example, object image classification, the condition of can hold in most cases. Furthermore, we will experimentally confirm the validity of the heuristic principle on face data in Section 7.
4.2 Simplification of the Fisher criterion
The within-class covariance matrix defined in Eq.(2) can be rewritten by the autocorrelation matrix and the mean vectors of the th class as follows:
where . By using the spectral decomposition of , can be rewritten as
where , and and indicate the th eigenvalue of the autocorrelation matrix of the class and its corresponding eigenvector, respectively.
Furthermore, by using the heuristic principle, where , and assuming that = for all the classes, we replace with :
where represents the variance of the data projected on the th principal component vector , and .
With the heuristic principle, the between-class covariance can be represented with as follows:
We refer to an FDA based on the Fisher criterion of as approximated FDA (aFDA), which is more or less equivalent to the original FDA. We simplify the representation of and in the following two steps.
Simplification-I: First, we use only eigenvectors corresponding to the larger eigenvalues than a specified threshold and discard the eigenvectors corresponding to smaller eigenvalues:
Moreover, assuming that the norms of the mean vectors of all the classes are equal to , we can simplify to using as follows:
We refer to an FDA based on the simplified Fisher criterion of as simplified FDA (sFDA).
Simplification-II: Next, we assume that all the values of are equally . This assumption can enhance sFDA’s robustness against few training data as will be shown later, although one may feel it extreme. Under this assumption, we further simplify to using as
It is possible to define several types of Fisher-like criteria as combinations of the above simplified matrices. In this paper, we are interested in the simplest criterion defined by and and consider the following objective function :
Since the term of is constant, we can ignore it and obtain vector by maximizing the following objective function instead of :
The process of the set of simplifications is summarized in Fig.5. We define an FDA based on the above simplified Fisher criterion as geometrical FDA (gFDA). Finally, vectors that maximizes are obtained by solving the following generalized eigenvalue problem:
4.3 Small sample size problem
In many practical applications, the dimension of data is much larger than the total number of data, . In such a case, Eq.(24) cannot be solved since is singular. This issue is called the small sample size (SSS) problem [fda2], which has been well known as a critical limitation of FDA.
To overcome the SSS problem, various types of extensions of FDA have been proposed [overviewLDA]. There are two typical solutions widely used due to their simple implementation. One is to use PCA to reduce the dimension before applying FDA [fda2]. The other is to add a regularization term to matrix [regFDA] as follows:
where is a parameter that controls the strength of the regularization and
is the identity matrix.
In addition to the above simple methods, nullLDA [newLDA] is also often used to address the SSS problem. In this method, all the data are first projected onto the null space spanned by the within-class scatter matrix, and then a between-class scatter matrix is calculated from the projections. Finally, a discriminant space is obtained by solving the eigenvalue problem of the between-class scatter matrix. Many extensions of FDA based on similar ideas have been proposed to circumvent the SSS problem [sssLDA, subspaceLDA, overviewLDA].
For gFDA, the objective function can be rewritten in the linear combination form of the two symmetrical matrices, as will be proved in Sec.5.3. This enables gFDA to avoid the SSS problem and work even with only one sample without any modification. However, in terms of computational cost, it is desirable to use the PCA based dimensionality reduction together, as it can largely reduce the data dimension. For gFDA, the dimension of the original dimension can be in fact reduced to the number of the orthonormal basis vectors without losing any structural information of the class subspaces, since the orthonormal basis vectors over all the classes are linearly independent, assuming no overlap among class subspaces.
4.4 Comparison of FDA and gFda
Fig.6 shows the comparisons of projections onto discriminant spaces generated by FDA (left) and gFDA (right), where we used sets of face images from the Yale face database. In this database, each subject class contains 45 front face images which were collected under different lighting conditions. It is known that all the possible images of a face under various lighting conditions are contained in an illumination cone [9PL]. The illumination cone of a subject can be accurately approximated by a convex cone formed by a set of nine front face images of the subject under nine specific lighting conditions. These nine images are called the 9PL images [9PL] in the Yale face database. Further, the illumination cone is contained in a 9-dimensional illumination subspace, which can be generated by applying PCA to a set of the 9PL images. Hence, a 9-dimensional illumination subspace can in principle contain other 36 images under different illumination conditions. For more details of the Yale database, see Section 7.
We used the 9PL images as training data, and used the remaining face images as test data. The dimension of each class subspace was set to 9. In Fig.6, a row represents the case of 2, 3 or 4 classes. We used FDA with PCA dimensionality reduction [fda2], since the original FDA cannot be used under this setting due to the SSS problem. In contrast, gFDA avoids the SSS problem by using the linear combination form, which will be described in Sec. 5.3.
5 Geometrical mechanism of gFda
5.1 Criterion based on class subspaces
Our Fisher-like criterion is defined by using only the principal component vectors . This can be interpreted as that is determined based on the geometry of the class subspaces, which are spanned by the principal component vectors of each class .
More specifically, the denominator of indicates the sum of the orthogonal projection matrices of all the class subspaces and the numerator the autocorrelation matrix of all the difference vectors among the first orthogonal basis vectors, namely, their mean vectors. This indicates that the maximization of can be realized, by minimizing the sum of projections of all the class subspaces while maximizing the projections of the differences between the mean vectors at the same time. Reflecting this mechanism, we name the discriminant analysis based on our Fisher-like criterion ”geometrical FDA (gFDA)”.
In the following, we look at the geometrical characteristics of gFDA through the simplest case that two classes, where their distributions, and , are on one-dimensional subspaces, , , and . Assume that the two classes have an identical number of data, , and the same variance, and the norms of their mean vectors are 1.0, as shown in Fig.7.
A basis vector of discriminant space is obtained by solving , where and in this case. The has the same direction as that of . Thus, we can see from the geometry shown in Fig.7 that the projection onto the discriminant space spanned by minimizes the sum of projections of the two class subspaces, while maintaining the difference vector, . For the data distributions, and
, we can see that the projection reduces the within-class variances, while maintaining the between-class variance. This mechanism can work independently of the type of class distribution, although it may be generally approximated by a normal distribution. In more general cases withclasses and multiple dimensions, gFDA still has a similar geometry as in this simple case.
5.2 Two-steps process
It is well known that the whole process of FDA consists of two steps: whitening and PCA. The process of gFDA in the form of can be also divided into these two steps as shown in Fig.8.
We consider the case that -dimensional class subspaces in are given, assuming that there is no overlap between class subspaces. For the simplicity of discussion, to make the matrix full rank, we assume that the dimensionality of the vector space can be reduced from to by applying PCA-based dimension reduction. Thus, in the following, we consider -dimensional class subspaces in . The details of each step are as follows:
In the first step, whitening such that is applied to orthonormal basis vectors of -dimensional class subspaces. As a result, the orthonormal basis vectors of all the classes are orthogonalized to each other. A subspace spanned by these orthogonalized basis vectors in the first step is called normalized space in contrast with the original feature vector space. Let the orthogonalized basis vectors be in the normalized space.
In the second step, PCA is applied to a set of the difference vectors between the first principal component vectors, , where =, where . We obtain principal component vectors from , since the rank of is . Note that can be also represented by .
The obtained principal component vectors span the discriminant space .
5.3 Dual forms of objective function
The objective function of our simplified Fisher criterion is represented as a generalized eigenvalue problem for the matrix product . In the following, we prove that the objective function can also be represented as a simpler regular eigenvalue problem for the linear combination of and under the same setting as in the previous section. We consider a set of -dimensional class subspaces in .
The flow of our proof is summarized as follows:
eigenvalues of matrix are all equal to without depending on the dimensionality of each class subspace.
The characteristic C1 above leads to the following equivalent relationship:
where we note that in the former equation we need to take the eigenvectors corresponding to largest eigenvalues, while in the latter we need to take the eigenvectors corresponding to smallest eigenvalues (zero).
The two sets of the eigenvectors obtained from the two eigenvalue problems in C2 are different. In fact, the former eigenvectors are not orthogonal, since the matrix is not symmetrical. In contrast, the latter eigenvectors are orthogonal to each other, since matrix is symmetrical. However, the two subspaces spanned by the respective sets of the eigenvectors coincide completely. Therefore, we will confirm that gFDA has dual forms of objective function.
Proof of C1. The characteristic C1 can be proved as follows: has the same eigenvalues as , where is the whitening such that , as described in the previous section. is represented with the difference vectors between the first orthonormal basis vectors, , which are orthogonalized by whitening :
Let be the autocorrelation matrix of the difference vectors among the standard basis of :
Since both the basis of and the standard basis of are orthonormal basis of , the two autocorrelation matrices and have the same eigenvalues, though the corresponding eigenvectors are different.
can be written as
In the above equation, the first matrix has eigenvalues of and the second one has one and zeros as the eigenvalues. Hence, matrix has eigenvalues of as non-zero eigenvalue. Therefore, we can confirm that has eigenvalues of as well.
Proof of C2. Next, we shall prove characteristics C2. By substituting into Eq.(24), we obtain
Further, we can rewrite the equation as
where, by considering that is not a zero vector, has zero eigenvalues.
This characteristic means that the eigenspace (solution space) ofcorresponding to zero eigenvalue is equivalent to that of corresponding to the eigenvalue of . In other words, the discriminant space spanned by the eigenvectors of corresponding to zero eigenvalues coincides with that spanned by the eigenvectors of . In the following, we use to indicate .
Thus far, we considered the vector space with the reduced dimension of . However, we should note that the above discussion can hold so that the linear combination form is also valid in the original vector space with the dimension of .
Here, we reiterate that the linear combination form can mitigate the SSS problem, since it can be stably calculated independently of the number of data and the dimension of vector space, unlike the matrix product form.
5.4 Generation of discriminant space
As stated in the previous section, all of discriminant vectors have the same discriminant ability, . This characteristic suggests that each individual vector of does not have a meaning, rather a subspace spanned by them should be considered to be essential. Hence, we define a subspace spanned by discriminant vectors as discriminant space , where the discriminant vectors are orthogonalized to each other by using the Gram-Schmidt orthonormalization.
5.5 Effectiveness of normalization
To get the maximum performance out of gFDA in a classification task, we essentially need to incorporate the normalization of orthogonal projection of data on the discriminant space into the mechanism of gFDA, where the normalization is defined as .
In the following, we discuss the reason for this from the viewpoint of geometrical structure of gFDA. In a normalized space, only the first basis vectors are selected from the orthogonalized basis vectors of all the class subspaces, and the remaining basis vectors are discarded. This operation results in that the data of class are projected onto only the in the normalized space as shown in Fig.9a, when all the data of class are completely contained within the -th class subspace spanned by . Such a situation corresponds to that the illumination subspace of an object contains any images of the object under various illumination conditions as described in Sec.4.4. However, as it is in general difficult to generate such an illumination subspace in practical applications, the projected data points of the -the class in the normalized space can have nonzero components on the basis vectors of other classes. As a result, they are projected at a remove from as shown in Fig.9c. This geometrical relationship remains in the discriminant space as shown in Figs.9b and d.
We should note that in the above process, the variation of the projections in the direction of can necessarily remain even if we could generate the illumination subspace of class . Namely, we cannot in principle remove them. A valid way for ignoring this extra variation is to measure the similarity between the projections of an orthogonalized input and onto the discriminant space by the angle between them. Here, we need to use instead of as a similarity in order to deal with the cases that the angle between and is over 90 degrees, i.e., . This angle based classification corresponds to the Euclidean distance based classification with the normalization according to the cosine theorem.
6 Connection of gFDA and GDS projection
In this section, we show a close connection between gFDA and GDS projection. To this end, we prove that gFDA is equivalent to GDS projection with a small correction item.
6.1 GDS projection with a small correction term
According to the new form of for gFDA in the previous section, we notice that gFDA is closely related to GDS projection [gds] that uses smallest eigenvector of only , because given . From the standpoint of GDS projection, can be regarded as a small correction on itself. Thus, we can regard gFDA as GDS projection with a small correction term of . Fig.10 summarizes the whole flow of the simplification from gFDA to GDS projection that has been discussed so far. The close connection suggests that GDS projection has a discriminant ability and the robustness against the SSS problem as well as gFDA. Fig.11 shows the comparison between gFDA and GDS projection on the examples that were used for the comparison of FDA and gFDA in Fig.6. We can see high similarity between the results of these two methods.
6.2 Geometry gap between gFDA and GDS
We now discuss the relationship between gFDA and GDS projection in more detail. For this purpose, we introduce a pair of vectors, and , between the -th orthonormal basis vectors and of classes and , where and . Note that .
With and , we rewrite matrix in Eq.(1) for GDS projection as follows:
Eq.(32) indicates that finding the smallest eigenvalues of can be regarded as the minimization problem on the sum of the projections of all the vectors and . In the same way, we rewrite for gFDA as follows:
We notice that only the weights on , that is, on the difference vectors ), are different between Eq.(33) and Eq.(34), which are and , respectively. This difference in the weights produces a geometrical gap between gFDA and GDS projection. We measure the gap by using an index , which is defined as follows: