1 Introduction
Kernel alignment [2]
is a way to incorporate class label information into kernels which are traditionally directly constructed from data without using class labels. Kernel alignment can be viewed as a measurement of consistency between the similarity function (the kernel) and class structure in the data. Improving this consistency helps to enforce data become more separated when using the class label aligned kernel. Kernel alignment has been applied to pattern recognition and feature selection recently
[3, 28, 10, 11, 4].In this paper, we find that if we use the widely used linear kernel and a kernel built from class indicators, the resulting kernel alignment function is very similar to the widely used linear discriminant analysis (LDA), using the wellknown betweenclass scatter matrix and total scatter matrix . We call this objective function as kernel alignment induced LDA (kaLDA). If we transform data into a linear subspace, the optimal solution is to maximize this kaLDA.
We further analyze this kaLDA and propose a Stiefelmanifold gradient descent algorithm to solve it. We also extend kaLDA to multilabel problems. Surprisingly, the scatter matrices arising in multilabel kernel alignment are identical those matrices developed in Multilabel LDA [21].
We perform extensive experiments by comparing kaLDA with other approaches on 8 singlelabel datasets and 6 multilabel data sets. Results show that kernel alignment LDA approach has good performance in terms of classification accuracy and F1 score.
2 From Kernel Alignment to LDA
Kernel Alignment is a similarity measurement between a kernel function and a target function. In other words, kernel alignment evaluates the degree of fitness between the data in kernel space and the target function. For this reason, we usually set the target function to be the class indicator function. The other kernel function is the data matrix. By measuring the similarity between data kernel and class indicator kernel, we can get a sense of how easily this data can be separated in kernel subspace. The alignment of two kernels and is given as [2]:
(1) 
We first introduce some notations, and then present Theorem 2.1 and kernel alignment projective function.
Let data matrix be and , where is data dimension, is number of data points, is a data point. Let normalized class indicator matrix be
, which was used to prove the equivalence between PCA and Kmeans clustering
[26, 5], and(2) 
where is total class number, is the number of data points in class . Class mean is and total mean of data is .
Theorem 2.1
Define data kernel and class label kernel as follows:
(3) 
we have
(4) 
where is a constant independent of .
Furthermore, let
be a linear transformation to a
dimensional subspace(5) 
we have
(6) 
where
(7)  
(8) 
Theorem 2.1 shows that kernel alignment can be expressed using scatter matrices and . In applications, we adjust such that kernel alignment is maximized, i.e., we solve the following problem:
(9) 
In general, columns of are assumed to be linearly independent.
A striking feature of this kernel alignment problem is that it is very similar to classic LDA.
2.1 Proof of Theorem 1 and Analysis
Here we note a useful lemma and then prove Theorem 2.1.
In most data analysis, data are centered, i.e., . Here we assume data is already centered. The following results remain correct if data is not centered. We have the following relations:
Lemma 1
Scatter matrices can be expressed as:
(10)  
(11) 
These results are previously known, for example, Theorem 3 of [5].
2.2 Relation to Classical LDA
In classical LDA, the betweenclass scatter matrix is defined as Eq.(7), and the withinclass scatter matrix and total scatter matrix are defined as:
(12) 
where and are class means. Classical LDA finds a projection matrix that minimizes and maximizes using the following objective:
(13) 
or
(14) 
Eq.(14) is also called trace ratio (TR) problem [22]. It is easy to see ^{1}^{1}1 Eq.(14) is equivalent to , which is . Reversing to maximization and using , we obtain Eq.(15). that Eq.(14) can be expressed as
(15) 
As we can see, kernel alignment LDA objective function Eq.(9) is very similar to Eq.(15). Thus kernel alignment provides an interesting alternative explanation of LDA. In fact, we can similarly show that in Eq.(9), is also maximized as in the standard LDA. First, Eq.(9) is equivalent to
where is a fixedvalue. The precise value of is unimportant, since the scale of is undefined in LDA: if is an optimal solution, and is any real number, is also an optimal solution with the same optimal objective function value. The above optimization is approximately equivalent to
This is same as
In other words, is maximized while is minimized — recovering the LDA main theme.
3 Computational Algorithm
In this section, we develop efficient algorithm to solve kaLDA objective function Eq.(9):
(16) 
The condition ensures different columns of mutually independent. The gradient of is
(17) 
where
Constraint enforces on the Stiefel manifold. Variations of on this manifold is parallel transport, which gives some restriction to the gradient. This has been been worked out in [6]. The gradient that reserves the manifold structure is
(18) 
Thus the algorithm computes the new is given as follows:
(19) 
The step size is usually chosen as:
(20) 
where
Occasionally, due to the loss of numerical accuracy, we use projection to restore . Starting with the standard LDA solution of , this algorithm is iterated until the algorithm converges to a local optimal solution. In fact, objective function will converge quickly when choosing properly. Figure 1 shows that converges in about 200 iterations when , for datasets ATT, Binalpha, Mnist, and Umist (more details about the datasets will be introduced in experiment section). In summary, kernel alignment LDA (kaLDA) procedure is shown in Algorithm 1.
To show the effectiveness of proposed kaLDA, we visualize a real dataset in 2D subspace in Figure 2. In this example, we take 3 classes of 644dimension Umist data, 18 data points in each class. Figure 1(a) shows the original data projected in 2D PCA subspace. Blue points are in class 1; red circle points are in class 2; black square points are in class 3. Data points from the three classes are mixed together in 2D PCA subspace. It is difficult to find a linear boundary to separate points of different classes. Figure 1(b) shows the data in 2D standard LDA subspace. We can see that data points in different classes have been projected into different clusters. Figure 1(c) shows the data projected in 2D kaLDA subspace. Compared to Figure 1(b), the withinclass distance in Figure 1(c) is much smaller. The distance between different classes is larger.
4 Extension to Multilabel Data
Multilabel problem arises frequently in image and video annotations, multitopic text categorization, music classification. etc.[21]
. In multilabel data, a data point could have several class labels (belonging to several classes). For example, an image could have “cloud”, “building”, “tree” labels. This is different from the case of singlelabel problem, where one point can have only one class label. Multilabel is very natural and common in our everyday life. For example, a film can be simultaneously classified as “drama”, “romance”, “historic” (if it is about a true story). A news article can have topic labels such as “economics”, “sports”, etc.
Kernel alignment approach can be easily and naturally extended to multilabel data, because the class label kernel can be clearly and unambiguously defined using class label matrix on both single label and multilabel data sets. The data kernel is defined as usual. In the following we further develop this approach.
One important result of our kernel alignment approach for single label data is that it has close relationship with LDA. For multilabel data, each data point could belong to several classes. The standard scatter matrices are ambiguous, because are only defined for single label data where each data point belongs to one class only. However, our kernel alignment approach on multilabel data leads to new definitions of scatter matrices and similar objective function; this can be viewed as the generalization of LDA from singlelabel data to multilabel data via kernel alignment approach.
Indeed, the new scatter matrices we obtained from kernel alignment approach are identical to the socalled “multilabel LDA” [21] developed from a classseparate, probabilistic point of view, very different from our point of view. The fact that these two approaches lead to the same set of scatter matrices show that the resulting multilabel LDA framework has a broad theoretical basis.
We first present some notations for multilabel data and then describe the kernel alignment approach for multilabel data in Theorem 4.1. The class label matrix for data is given as:
(21) 
Let be the number of data points in class . Note that for multilabel data, . The normalized class indicator matrix is given as:
(22) 
Let be the number of classes that belongs to. Thus are the weights of . Define the diagonal weight matrix . The kernel alignment formulation for multilabel data can be stated as
Theorem 4.1
For multilabel data , let the data kernel and class label kernel be
(23) 
We have the alignment
(24) 
where is a constant independent of data , and are given in Eqs.(27, 28).
Furthermore, let be the linear transformation to a dimensional subspace,
(25) 
we have
(26) 
The matrices in Theorem 4.1 are defined as:
(27)  
(28) 
where is the mean of class and is global mean, defined as:
(29) 
Therefore, we can seek an optimal subspace for multilabel data by solving Eq.(16) with given in Eqs.(27,28)
4.1 Proof of Theorem 2 and Equivalence to Multilabel LDA
Here we note a useful lemma for multilabel data and then prove Theorem 4.1. We consider the case the data is centered, i.e., . The results also hold when data is not centered, but the proofs are slightly complicated.
Proof
To prove Eq.(31), note that thus
To prove Eq.(26),
For singlelabel data, , Eqs.(30, 31) reduce to Eqs.(10, 11), and Theorem 4.1 reduces to Theorem 2.1.
As we can see, surprisingly, the scatter matrices of Eqs.(27, 28) arising in Theorem 4.1 are identical to that in Multilabel LDA proposed in [21].
Data  n  p  k 

Caltec07  210  432  7 
Caltec20  1230  432  20 
MSRC  210  432  7 
ATT  400  644  40 
Binalpha  1014  320  26 
Mnist  150  784  10 
Umist  360  644  20 
Pie  680  1024  68 
Data  kaLDA  LDA  TR  sdpLDA  MMC  RLDA  OCM 

Caltec07  0.7524  0.6619  0.6762  0.5619  0.6000  0.7952  0.7619 
Caltec20  0.7068  0.6320  0.4465  0.3386  0.5838  0.6812  0.6696 
MSRC  0.7762  0.6857  0.5714  0.5952  0.5667  0.7333  0.7286 
ATT  0.9775  0.9750  0.9675  0.9750  0.9750  0.9675  0.9675 
Binalpha  0.7817  0.6078  0.4620  0.2507  0.7638  0.7983  0.8204 
Mnist  0.8800  0.8733  0.8667  0.8467  0.8467  0.8667  0.8467 
Umist  0.9900  0.9900  0.9917  0.9133  0.9633  0.9800  0.9783 
Pie  0.8765  0.8838  0.8441  0.8632  0.8676  0.6515  0.6515 
5 Related Work
Linear Discriminant Analysis (LDA) is a widelyused dimension reduction and subspace learning algorithm. There are many LDA reformulation publications in recent years. Trace Ratio problem is to find a subspace transformation matrix such that the withinclass distance is minimized and the betweenclass distance is maximized. Formally, Trace Ratio maximizes the ratio of two trace terms, [22, 13], where is total scatter matrix and is betweenclass scatter matrix. Other popular LDA approach includes, regularized LDA(RLDA) [9], Orthogonal Centroid Method (OCM) [18], Uncorrelated LDA(ULDA) [23], Orthogonal LDA (OLDA) [23], etc.. These approaches mainly compute the eigendecomposition of matrix , but use different formulation of total scatter matrix [24].
Maximum Margin Criteria (MMC) [17] is a simpler and more efficient method. MMC finds a subspace projection matrix to maximize . Though in a different way, MMC also maximizes betweenclass distance while minimizing withinclass distance. SemiDefinite Positive LDA (sdpLDA) [14] solves the maximization of , where
is the largest eigenvalue of
. sdpLDA is derived from the maximum margin principle.Multilabel problem arise frequently in image and video annotations and many other related applications, such as multitopic text categorization [21]
. There are many Multilabel dimension reduction approaches, such as Multilabel Linear Regression (MLR), Multilabel informed Latent Semantic Indexing (MLSI)
[25], Multilabel Dimensionality reduction via Dependence Maximization (MDDM) [27], MultiLabel Least Square (MLLS) [12], Multilabel Linear Discriminant Analysis (MLDA) [21].6 Experiments
Data  n  p  k 

MSRCMOM  591  384  23 
Barcelona  139  48  4 
Emotion  593  72  6 
Yeast  2,417  103  14 
MSRCSIFT  591  240  23 
Scene  2,407  294  6 
Data  kaLDA  MLSI  MDDM  MLLS  MLDA 

MSRCMOM  0.9150  0.8962  0.9044  0.8994  0.9036 
Barcelona  0.6579  0.6436  0.6470  0.6524  0.6290 
Emotion  0.7634  0.7397  0.7540  0.7529  0.7619 
Yeast  0.7405  0.7317  0.7371  0.7364  0.7368 
MSRCSIFT  0.8839  0.8762  0.8800  0.8807  0.8858 
Scene  0.8870  0.8534  0.8713  0.8229  0.8771 
In this section, we first compare kernel alignment LDA (kaLDA) with other six different methods on 8 single label data sets and compare kaLDA multilabel version with four other methods on 6 multilabel data sets.
6.1 Comparison with Trace Ratio w.r.t. subspace dimension
Eight singlelabel datasets are used in this experiment. These datasets come from different domains, such as image scene Caltec [8] and MSRC [16], face datasets ATT, Umist, Pie [19], and digit datasets Mnist [15] and Binalpha. Table 1 summarizes the attributes of those datasets.
Caltec07 and Caltec20 are subsets of Caltech 101 data. Only the HOG feature is used in this paper.
MSRC is a image scene data, includes tree, building, plane, cow, face, car and so on. It has 210 images from 7 classes and each image has 432 dimension.
ATT data contains 400 images of 40 persons, with 10 images for each person. The images has been resized to .
Binalpha data contains 26 binary handwritten alphabets. It has 1014 images in total and each image has 320 dimension.
Mnist is a handwritten digits dataset. The digits have been sizenormalized and centred. It has 10 classes and 150 images in total, with 784 dimension each image.
Umist is a face image dataset (Sheffield Face database) with 360 images from 20 individuals with mixed race, gender and appearance.
Pie is a face database collected by Carnegie Mellon Robotics Institute between October and December 2000. In total, it has 68 different persons.
In this part, we compare the classification accuracy of kaLDA and Trace Ratio [22] with respect to subspace dimension. The dimension of the subspace that kaLDA can find is not restricted to
. After subspace projection, KNN classifier (
) is applied to perform classification. Results are shown in Figure 3. Solid line denotes kaLDA accuracy and dashed line denotes Trace Ratio accuracy. As we can see, in Figures 2(a), 2(b), 2(c), 2(g), and 2(h), kaLDA has higher accuracy than Trace Ratio when using the same number of reduced features. In Figures 2(d), 2(e), 2(f), kaLDA has competitive classification accuracy with Trace Ratio. However, kaLDA is more stable than Trace Ratio. For example, in Figure 2(f) and 2(g), we observe a decrease in accuracy when feature number increases using Trace Ratio.6.2 Comparison with other LDA methods
Dataset  kaLDA  MLSI  MDDM  MLLS  MLDA 

MSRCMOM  0.6104  0.5244  0.5593  0.5426  0.5571 
Barcelona  0.7377  0.7286  0.7301  0.7341  0.7169 
Emotion  0.6274  0.5873  0.6101  0.6041  0.6200 
Yeast  0.5757  0.5568  0.5696  0.5691  0.5693 
MSRCSIFT  0.4712  0.4334  0.4522  0.4544  0.4773 
Scene  0.6851  0.5911  0.6411  0.5048  0.6568 
Dataset  kaLDA  MLSI  MDDM  MLLS  MLDA 

MSRCMOM  0.5138  0.4064  0.4432  0.4370  0.4448 
Barcelona  0.6969  0.6891  0.6861  0.6904  0.6772 
Emotion  0.6203  0.5779  0.6030  0.5961  0.6151 
Yeast  0.4249  0.4026  0.4205  0.4216  0.4213 
MSRCSIFT  0.3943  0.3510  0.3637  0.3667  0.3959 
Scene  0.6966  0.6006  0.6493  0.5062  0.6643 
We compare kaLDA with six other different methods, including LDA, Trace Ratio (TR), spdLDA, Maximum Margin Criteria (MMC), regularized LDA (RLDA), and Orthogonal Centroid Method (OCM). All LDA will reduce data to dimension. KNN () will be applied to do the classification after data is projected into the selected subspace. The other algorithms have already been introduced in related work section. The final classification accuracy is the average of 5fold cross validation, and is reported in Table 2. The first column “kaLDA” reports kaLDA classification accuracy. kaLDA has the highest accuracy on 4 out of 8 datasets, including Caltec20, MSRCMOM, ATT and Mnist. For Umist and Pie, kaLDA results are very close to the highest accuracy. Overall, kaLDA performs better than all other methods.
6.3 Multilabel Classification
Six multilabel datasets are used in this part. These datasets include images features, music emotion and so on. Table 3 summarizes the attributes of those datasets.
MSRCMOM and MSRCSIFT data set is provided by Microsoft Research in Cambridge. It includes 591 images of 23 classes. MSRCMOM
is the Moment invariants (MOM) feature of images and each image has 384 dimensions.
MSRCSIFT is the SIFT feature and each image has 240 dimensions. About 80% of the images are annotated with at least one classes and about three classes per image on average.Barcelona data set contains 139 images with 4 classes, i.e., “building”, “flora”, “people” and “sky”. Each image has at least two labels.
Emotion [20] is a music emotion data, which comprises 593 songs with 6 emotions. The dimension of Emotion is 72.
Yeast [7] is a multilabel data set which contains functional classes of genes in the Yeast Saccharomyces cerevisiae.
Scene [1] contains images of still scenes with semantic indexing. It has 2407 images from 6 classes.
We use 5fold cross validation to evaluate classification performance of different algorithms. KNearest Neighbour (KNN) classifier is used after the subspace projection. The algorithms we compared in this section includes Multilabel informed Latent Semantic Indexing (MLSI), Multilabel Dimensionality reduction via Dependence Maximization (MDDM), MultiLabel Least Square (MLLS), Multilabel Linear Discriminant Analysis (MLDA). These algorithms have been introduced in related work section.
We compare the performance of kaLDA and other algorithms using macro accuracy (Table 4), macroaveraged F1score (Table 5) and microaveraged (Table 6) F1score. Accuracy and F1 score are computed using standard binary classification definitions. In multilabel classification, macro average is a standard classwise average, and it is related to number of samples in each class. However, micro average gives equal weight to all classes [21]. kaLDA achieves highest classification accuracy on 5 out of 6 datasets. On the remaining MSRCSIFT dataset, kaLDA result is very close to the best method MLDA and beat all rest methods. kaLDA achieves highest macro and micro F1 score on 5 out of 6 datasets. Furthermore, kaLDA has the second highest macro and micro F1 score on dataset MSRCSIFT. Overall, kaLDA outperforms other multilabel algorithms in terms of classification accuracy and macro and micro F1 score.
7 Conclusions
In this paper, we propose a new kernel alignment induced LDA (kaLDA). The objective function of kaLDA is very similar to classical LDA objective. The Stifelmanifold gradient descent algorithm can solve kaLDA objective efficiently. We have also extended kaLDA to multilabel problems. Extensive experiments show the effectiveness of kaLDA in both singlelabel and multilabel problems.
Acknowledgment. This work is partially supported by US NSF CCF0917274 and NSF DMS0915228.
References

[1]
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multilabel scene classification. Pattern recognition 37(9), 1757–1771 (2004)
 [2] Cristianini, N., Shawetaylor, J., Elisseeff, A., Kandola, J.S.: On kernel target alignment. Advances in neural information processing systems 14, 367 (2002)
 [3] Cristianini, N., et al.: Method of using kernel alignment to extract significant features from a large dataset (2007), US Patent 7,299,213

[4]
Cuturi, M.: Fast global alignment kernels. In: Proceedings of the 28th International Conference on Machine Learning (ICML11). pp. 929–936 (2011)

[5]
Ding, C., He, X.: Kmeans clustering via principal component analysis. In: Proc of international conference on Machine learning (ICML 2004) (2004)
 [6] Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2), 303–353 (1998)
 [7] Elisseeff, A., Weston, J.: A kernel method for multilabelled classification. In: NIPS. vol. 14, pp. 681–687 (2001)

[8]
FeiFei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding 106(1), 59–70 (2007)
 [9] Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)
 [10] Hoi, S.C., Lyu, M.R., Chang, E.Y.: Learning the unified kernel machines for classification. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 187–196. ACM (2006)
 [11] Howard, A., Jebara, T.: Transformation learning via kernel alignment. In: Machine Learning and Applications, 2009. ICMLA’09. International Conference on. pp. 301–308. IEEE (2009)
 [12] Ji, S., Tang, L., Yu, S., Ye, J.: Extracting shared subspace for multilabel classification. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 381–389. ACM (2008)

[13]
Jia, Y., Nie, F., Zhang, C.: Trace ratio problem revisited. Neural Networks, IEEE Transactions on 20(4), 729–735 (2009)
 [14] Kong, D., Ding, C.: A semidefinite positive linear discriminant analysis and its applications. In: 2012 IEEE 12th International Conference on Data Mining (ICDM). pp. 942–947. IEEE (2012)
 [15] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)

[16]
Lee, Y.J., Grauman, K.: Foreground focus: Unsupervised learning from partially matching images. International Journal of Computer Vision 85(2), 143–166 (2009)

[17]
Li, H., Jiang, T., Zhang, K.: Efficient and robust feature extraction by maximum margin criterion. Neural Networks, IEEE Transactions on 17(1), 157–165 (2006)
 [18] Park, H., Jeon, M., Rosen, J.B.: Lower dimensional representation of text data based on centroids and least squares. BIT Numerical mathematics 43(2), 427–448 (2003)
 [19] Sim, T., Baker, S., Bsat, M.: The cmu pose, illumination, and expression (pie) database of human faces. Tech. Rep. CMURITR0102, Robotics Institute, Pittsburgh, PA (January 2001)
 [20] Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.P.: Multilabel classification of music into emotions. In: ISMIR. vol. 8, pp. 325–330 (2008)
 [21] Wang, H., Ding, C., Huang, H.: Multilabel linear discriminant analysis. In: Computer Vision–ECCV 2010, pp. 126–139. Springer (2010)
 [22] Wang, H., Yan, S., Xu, D., Tang, X., Huang, T.: Trace ratio vs. ratio trace for dimensionality reduction. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. pp. 1–8. IEEE (2007)
 [23] Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. In: Journal of Machine Learning Research. pp. 483–502 (2005)
 [24] Ye, J., Ji, S.: Discriminant analysis for dimensionality reduction: An overview of recent developments. Biometrics: Theory, Methods, and Applications. WileyIEEE Press, New York (2010)
 [25] Yu, K., Yu, S., Tresp, V.: Multilabel informed latent semantic indexing. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 258–265. ACM (2005)
 [26] Zha, H., Ding, C., Gu, M., He, X., Simon, H.: Spectral relaxation for Kmeans clustering. Advances in Neural Information Processing Systems 14 (NIPS’01) pp. 1057–1064
 [27] Zhang, Y., Zhou, Z.H.: Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data (TKDD) 4(3), 14 (2010)

[28]
Zhu, X., Kandola, J., Ghahramani, Z., Lafferty, J.D.: Nonparametric transforms of graph kernels for semisupervised learning. In: Advances in neural information processing systems. pp. 1641–1648 (2004)
Comments
There are no comments yet.