Fisher Discriminant Analysis (FDA) [friedman2001elements], first proposed in [fisher1936use]
, is a powerful subspace learning method which tries to minimize the intra-class scatter and maximize the inter-class scatter of data for better separation of classes. FDA treats all pairs of the classes the same way; however, some classes might be much further from one another compared to other classes. In other words, the distances of classes are different. Treating closer classes need more attention because classifiers may more easily confuse them whereas classes far from each other are generally easier to separate. The same problem exists in Kernel FDA (KFDA)[mika1999fisher]
and in most of subspace learning methods that are based on generalized eigenvalue problem such as FDA and KFDA[ghojogh2019roweis]; hence, a weighting procedure might be more appropriate.
In this paper, we propose several weighting procedures for FDA and KFDA. The contributions of this paper are three-fold: (1) proposing Cosine-Weighted FDA (CW-FDA) as a new modification of FDA, (2) proposing Automatically Weighted FDA (AW-FDA) as a new version of FDA in which the weights are set automatically, and (3) proposing Weighted KFDA (W-KFDA) to have weighting procedures in the feature space, where both the existing and the newly proposed weighting methods can be used in the feature space.
The paper is organized as follows: In Section 2, we briefly review the theory of FDA and KFDA. In Section 3, we formulate the weighted FDA, review the existing weighting methods, and then propose CW-FDA and AW-FDA. Section 4 proposes weighted KFDA in the feature space. In addition to using the existing methods for weighted KFDA, two versions of CW-KFDA and also AW-KFDA are proposed. Section 5 reports the experiments. Finally, Section 6 concludes the paper.
2 Fisher and Kernel Discriminant Analysis
2.1 Fisher Discriminant Analysis
Let denote the samples of the -th class where is the class’s sample size. Suppose , , , and denote the mean of -th class, the number of classes, the total sample size, and the projection matrix in FDA, respectively. Although some methods solve FDA using least squares problem [zhang2010regularized, diaz2019deep], the regular FDA [fisher1936use] maximizes the Fisher criterion [xu2006analysis]:
where is the trace of matrix. The Fisher criterion is a generalized Rayleigh-Ritz Quotient [parlett1998symmetric]. We may recast the problem to [ghojogh2019fisher]:
where the and are the intra- (within) and inter-class (between) scatters, respectively [ghojogh2019fisher]:
where , , and . The mean of the -th class is . The Lagrange relaxation [boyd2004convex] of the optimization problem is: , where is a diagonal matrix which includes the Lagrange multipliers. Setting the derivative of Lagrangian to zero gives:
which is the generalized eigenvalue problem where the columns of and the diagonal of
are the eigenvectors and eigenvalues, respectively[ghojogh2019eigenvalue]. The leading columns of (so to have ) are the FDA projection directions where is the dimensionality of the subspace. Note that because of the ranks of the inter- and intra-class scatter matrices [ghojogh2019fisher].
2.2 Kernel Fisher Discriminant Analysis
Let the scalar and matrix kernels be denoted by and , respectively, where and are the pulling functions. According to the representation theory [alperin1993local]
, any solution must lie in the span of all the training vectors, hence,where contains the coefficients. The optimization of kernel FDA is [mika1999fisher, ghojogh2019fisher]:
where and are the intra- and inter-class scatters in the feature space, respectively [mika1999fisher, ghojogh2019fisher]:
where is the centering matrix, the -th entry of is , the -th entry of is , and .
The leading columns of (so to have ) are the KFDA projection directions which span the subspace. Note that because of the ranks of the inter- and intra-class scatter matrices in the feature space [ghojogh2019fisher].
3 Weighted Fisher Discriminant Analysis
The optimization of Weighted FDA (W-FDA) is as follows:
where the weighted inter-class scatter, , is defined as:
where is the weight for the pair of the -th and -th classes, . In FDA, we have . However, it is better for the weights to be decreasing with the distances of classes to concentrate more on the nearby classes. We denote the distances of the -th and -th classes by . The solution to Eq. (9) is the generalized eigenvalue problem and the leading columns of span the subspace.
3.1 Existing Manual Methods
In the following, we review some of the existing weights for W-FDA.
Approximate Pairwise Accuracy Criterion: The Approximate Pairwise Accuracy Criterion (APAC) method [loog2001multiclass] has the weight function:
where is the error function:
This method approximates the Bayes error for class pairs.
Powered Distance Weighting: The powered distance (POW) method [lotlikar2000fractional] uses the following weight function:
where is an integer. As is supposed to drop faster than the increase of , we should have (we use in the experiments).
Confused Distance Maximization: The Confused Distance Maximization (CDM) [zhang2012confused]
method uses the confusion probability among the classes as the weight function:
where is the number of points of class classified as class by a classifier such as quadratic discriminant analysis [zhang2012confused, ghojogh2019linear]. One problem of the CDM method is that if the classes are classified perfectly, all weights become zero. Conditioning the performance of a classifier is also another flaw of this method.
-Nearest Neighbors Weighting: The -Nearest Neighbor (NN) method [zhang2013evaluation] tries to put every class away from its -nearest neighbor classes by defining the weight function as
The NN and CDM methods are sparse to make use of the betting on sparsity principle [friedman2001elements, hastie2015statistical]. However, these methods have some shortcomings. For example, if two classes are far from one another in the input space, they are not considered in NN or CDM, but in the obtained subspace, they may fall close to each other, which is not desirable. Another flaw of NN method is the assignment of to all NN pairs, but in the NN, some pairs might be comparably closer.
3.2 Cosine Weighted Fisher Discriminant Analysis
Literature has shown that cosine similarity works very well with the FDA, especially for face recognition[perlibakas2004distance, mohammadzade2012projection]. Moreover, according to the opposition-based learning [tizhoosh2005opposition], capturing similarity and dissimilarity of data points can improve the performance of learning. A promising operator for capturing similarity and dissimilarity (opposition) is cosine. Hence, we propose CW-FDA, as a manually weighted method, with cosine to be the weight defined as
to have . Hence, the -th weight matrix is , which is used in Eq. (10). Note that as we do not care about , because inter-class scatter for is zero, we can set .
3.3 Automatically Weighted Fisher Discriminant Analysis
In AW-FDA, there are matrix optimization variables which are and because at the same time where we want to maximize the Fisher criterion, the optimal weights are found. Moreover, to use the betting on sparsity principle [friedman2001elements, hastie2015statistical], we can make the weight matrix sparse, so we use “” norm for the weights to be sparse. The optimization problem is as follows
We use alternating optimization [jain2017non] to solve this problem:
where denotes the iteration.
Since we use an iterative solution for the optimization, it is better to normalize the weights in the weighted inter-class scatter; otherwise, the weights gradually explode to maximize the objective function. We use (or Frobenius) norm for normalization for ease of taking derivatives. Hence, for OW-FDA, we slightly modify the weighted inter-class scatter as
where because is diagonal, and is Frobenius norm.
As discussed before, the solution to Eq. (18) is the generalized eigenvalue problem . We use a step of gradient descent [nocedal2006numerical] to solve Eq. (19) followed by satisfying the “” norm constraint [jain2017non]. The gradient is calculated as follows. Let
. Using the chain rule, we have:
where we use the Magnus-Neudecker convention in which matrices are vectorized, vectorizes the matrix, and is de-vectorization to matrix. We have whose vetorization has dimensionality . For the second derivative, we have:
where denotes the Kronecker product. The third derivative is:
The learning rate of gradient descent is calculated using line search [nocedal2006numerical].
After the gradient descent step, to satisfy the condition , the solution is projected onto the set of this condition. Because should be maximized, this projection is to set the smallest diagonal entries of to zero [jain2017non]. In case , the projection of the solution is itself, and all the weights are kept.
After solving the optimization, the leading columns of are the OW-FDA projection directions that span the subspace.
4 Weighted Kernel Fisher Discriminant Analysis
We define the optimization for Weighted Kernel FDA (W-KFDA) as:
where the weighted inter-class scatter in the feature space, , is defined as:
The solution to Eq. (25) is the generalized eigenvalue problem and the leading columns of span the subspace.
4.1 Manually Weighted Methods in the Feature Space
All the existing weighting methods in the literature for W-FDA can be used as weights in W-KFDA to have W-FDA in the feature space. Therefore, Eqs. (11), (13), (14), and (15) can be used as weights in Eq. (26) to have W-KFDA with APAC, POW, CDM, and NN weights, respectively. To the best of our knowledge, W-KFDA is novel and has not appeared in the literature. Note that there is a weighted KFDA in the literature [hamid2012weighted], but that is for data integration, which is for another purpose and has an entirely different approach.
The CW-FDA can be used in the feature space to have CW-KFDA. For this, we propose two versions of CW-KFDA: (I) In the first version, we use Eq. (16) or in the Eq. (26). (II) In the second version, we notice that cosine is based on inner product so the normalized kernel matrix between the means of classes can be used instead to use the similarity/dissimilarity in the feature space rather than in the input space. Let . Let be the normalized kernel matrix [ah2010normalized] where denotes the -th element of the kernel matrix . The weights are or . We set .
4.2 Automatically Weighted Kernel Fisher Discriminant Analysis
Similar to before, the optimization in AW-KFDA is:
where . This optimization is solved similar to how Eq. (17) was solved where we have rather than . Here, the solution to Eq. (18) is the generalized eigenvalue problem . Let . The Eq. (19) is solved similarly but we use and
After solving the optimization, the leading columns of span the OW-KFDA subspace. Recall . The projection of some data is .
For experiments, we used the public ORL face recognition dataset [web_att_dataset] because face recognition has been a challenging task and FDA has numerously been used for face recognition (e.g., see [belhumeur1997eigenfaces, perlibakas2004distance, mohammadzade2012projection]). This dataset includes 40 classes, each having ten different poses of the facial picture of a subject, resulting in 400 total images. For computational reasons, we selected the first 20 classes and resampled the images to pixels. Please note that massive datasets are not feasible for the KFDA/FDA because of having a generalized eigenvalue problem in it. Some samples of this dataset are shown in Fig. 1. The data were split into training and test sets with
portions and were standardized to have mean zero and variance one.
5.2 Evaluation of the Embedding Subspaces
For the evaluation of the embedded subspaces, we used the 1-Nearest Neighbor (1NN) classifier because it is useful to evaluate the subspace by the closeness of the projected data samples. The training and out-of-sample (test) accuracy of classifications are reported in Table 1. In the input space, NN with have the best results but in , AW-FDA outperforms it in generalization (test) result. The performances of CW-FDA and AW-FDA with are promising, although not the best. For instance, AW-FDA with outperforms weighted FDA with APAC, POW, and CDM methods in the training embedding, while has the same performance as NN. In most cases, AW-FDA with all values has better performance than the FDA, which shows the effectiveness of the obtained weights compared to equal weights in FDA. Also, the sparse in AWF-FDA outperforming FDA (with dense weights equal to one) validates the betting on sparsity.
|()||()||()||version 1||version 2||()||()||()|
5.3 Comparison of Fisherfaces
In the feature space, where we used the radial basis kernel, AW-KFDA has the best performance with entirely accurate recognition. Both versions of CW-KFDA outperform regular KFDA and KFDA with CDM, and NN (with ) weighting. They also have better generalization than APAC, NN with all values. Overall, the results show the effectiveness of the proposed weights in the input and feature spaces. Moreover, the existing weighting methods, which were for the input space, have outstanding performance when used in our proposed weighted KFDA (in feature space). This shows the validness of the proposed weighted KFDA even for the existing weighting methods.
Figure 2 depicts the four leading eigenvectors obtained from the different methods, including the FDA itself. These ghost faces, or so-called Fisherfaces [belhumeur1997eigenfaces], capture the critical discriminating facial features to discriminant the classes in subspace. Note that Fisherfaces cannot be shown in kernel FDA as its projection directions are dimensional. CDM has captured some pixels as features because its all weights have become zero for its explained flaw (see Section 3.1 and Fig. 3). The Fisherfaces, in most of the methods including CW-FDA, capture information of facial organs such as hair, forehead, eyes, chin, and mouth. The features of AW-FDA are more akin to the Haar wavelet features, which are useful for facial feature detection [wang2014analysis].
5.4 Comparison of the Weights
We show the obtained weights in different methods in Fig. 3. The weights of APAC and POW are too small, while the range of weights in the other methods is more reasonable. The weights of CDM have become all zero because the samples were purely classified (recall the flaw of CDM). The weights of NN method are only zero and one, which is a flaw of this method because, amongst the neighbors, some classes are closer. This issue does not exist in AW-FDA with different values. Moreover, although not all the obtained weights are visually interpretable, some non-zero weights in AW-FDA or AW-KFDA, with e.g. , show the meaningfulness of the obtained weights (noticing Fig. 1). For example, the non-zero pairs , , , , in AW-FDA and the pairs , , , in AW-KFDA make sense visually because of having glasses so their classes are close to one another.
In this paper, we discussed that FDA and KFDA have a fundamental flaw, and that is treating all pairs of classes in the same way while some classes are closer to each other and should be processed with more care for a better discrimination. We proposed CW-FDA with cosine weights and also AW-FDA in which the weights are found automatically. We also proposed a weighted KFDA to weight FDA in the feature space. We proposed AW-KFDA and two versions of CW-KFDA as well as utilizing the existing weighting methods for weighted KFDA. The experiments in which we evaluated the embedding subspaces, the Fisherfaces, and the weights, showed the effectiveness of the proposed methods. The proposed weighted FDA methods outperformed regular FDA and many of the existing weighting methods for FDA. For example, AW-FDA with outperformed weighted FDA with APAC, POW, and CDM methods in the training embedding. In feature space, AW-KFDA obtained perfect discrimination.