Kernel alignment 
is a way to incorporate class label information into kernels which are traditionally directly constructed from data without using class labels. Kernel alignment can be viewed as a measurement of consistency between the similarity function (the kernel) and class structure in the data. Improving this consistency helps to enforce data become more separated when using the class label aligned kernel. Kernel alignment has been applied to pattern recognition and feature selection recently[3, 28, 10, 11, 4].
In this paper, we find that if we use the widely used linear kernel and a kernel built from class indicators, the resulting kernel alignment function is very similar to the widely used linear discriminant analysis (LDA), using the well-known between-class scatter matrix and total scatter matrix . We call this objective function as kernel alignment induced LDA (kaLDA). If we transform data into a linear subspace, the optimal solution is to maximize this kaLDA.
We further analyze this kaLDA and propose a Stiefel-manifold gradient descent algorithm to solve it. We also extend kaLDA to multi-label problems. Surprisingly, the scatter matrices arising in multi-label kernel alignment are identical those matrices developed in Multi-label LDA .
We perform extensive experiments by comparing kaLDA with other approaches on 8 single-label datasets and 6 multi-label data sets. Results show that kernel alignment LDA approach has good performance in terms of classification accuracy and F1 score.
2 From Kernel Alignment to LDA
Kernel Alignment is a similarity measurement between a kernel function and a target function. In other words, kernel alignment evaluates the degree of fitness between the data in kernel space and the target function. For this reason, we usually set the target function to be the class indicator function. The other kernel function is the data matrix. By measuring the similarity between data kernel and class indicator kernel, we can get a sense of how easily this data can be separated in kernel subspace. The alignment of two kernels and is given as :
We first introduce some notations, and then present Theorem 2.1 and kernel alignment projective function.
Let data matrix be and , where is data dimension, is number of data points, is a data point. Let normalized class indicator matrix be
, which was used to prove the equivalence between PCA and K-means clustering[26, 5], and
where is total class number, is the number of data points in class . Class mean is and total mean of data is .
Define data kernel and class label kernel as follows:
where is a constant independent of .
Furthermore, let be a linear transformation to a
be a linear transformation to a-dimensional subspace
Theorem 2.1 shows that kernel alignment can be expressed using scatter matrices and . In applications, we adjust such that kernel alignment is maximized, i.e., we solve the following problem:
In general, columns of are assumed to be linearly independent.
A striking feature of this kernel alignment problem is that it is very similar to classic LDA.
2.1 Proof of Theorem 1 and Analysis
Here we note a useful lemma and then prove Theorem 2.1.
In most data analysis, data are centered, i.e., . Here we assume data is already centered. The following results remain correct if data is not centered. We have the following relations:
Scatter matrices can be expressed as:
These results are previously known, for example, Theorem 3 of .
where we used Lemma 1. is a constant independent of data .
2.2 Relation to Classical LDA
In classical LDA, the between-class scatter matrix is defined as Eq.(7), and the within-class scatter matrix and total scatter matrix are defined as:
where and are class means. Classical LDA finds a projection matrix that minimizes and maximizes using the following objective:
Eq.(14) is also called trace ratio (TR) problem . It is easy to see 111 Eq.(14) is equivalent to , which is . Reversing to maximization and using , we obtain Eq.(15). that Eq.(14) can be expressed as
As we can see, kernel alignment LDA objective function Eq.(9) is very similar to Eq.(15). Thus kernel alignment provides an interesting alternative explanation of LDA. In fact, we can similarly show that in Eq.(9), is also maximized as in the standard LDA. First, Eq.(9) is equivalent to
where is a fixed-value. The precise value of is unimportant, since the scale of is undefined in LDA: if is an optimal solution, and is any real number, is also an optimal solution with the same optimal objective function value. The above optimization is approximately equivalent to
This is same as
In other words, is maximized while is minimized — recovering the LDA main theme.
3 Computational Algorithm
In this section, we develop efficient algorithm to solve kaLDA objective function Eq.(9):
The condition ensures different columns of mutually independent. The gradient of is
Constraint enforces on the Stiefel manifold. Variations of on this manifold is parallel transport, which gives some restriction to the gradient. This has been been worked out in . The gradient that reserves the manifold structure is
Thus the algorithm computes the new is given as follows:
The step size is usually chosen as:
Occasionally, due to the loss of numerical accuracy, we use projection to restore . Starting with the standard LDA solution of , this algorithm is iterated until the algorithm converges to a local optimal solution. In fact, objective function will converge quickly when choosing properly. Figure 1 shows that converges in about 200 iterations when , for datasets ATT, Binalpha, Mnist, and Umist (more details about the datasets will be introduced in experiment section). In summary, kernel alignment LDA (kaLDA) procedure is shown in Algorithm 1.
To show the effectiveness of proposed kaLDA, we visualize a real dataset in 2-D subspace in Figure 2. In this example, we take 3 classes of 644-dimension Umist data, 18 data points in each class. Figure 1(a) shows the original data projected in 2-D PCA subspace. Blue points are in class 1; red circle points are in class 2; black square points are in class 3. Data points from the three classes are mixed together in 2-D PCA subspace. It is difficult to find a linear boundary to separate points of different classes. Figure 1(b) shows the data in 2-D standard LDA subspace. We can see that data points in different classes have been projected into different clusters. Figure 1(c) shows the data projected in 2-D kaLDA subspace. Compared to Figure 1(b), the within-class distance in Figure 1(c) is much smaller. The distance between different classes is larger.
4 Extension to Multi-label Data
Multi-label problem arises frequently in image and video annotations, multi-topic text categorization, music classification. etc.
. In multi-label data, a data point could have several class labels (belonging to several classes). For example, an image could have “cloud”, “building”, “tree” labels. This is different from the case of single-label problem, where one point can have only one class label. Multi-label is very natural and common in our everyday life. For example, a film can be simultaneously classified as “drama”, “romance”, “historic” (if it is about a true story). A news article can have topic labels such as “economics”, “sports”, etc.
Kernel alignment approach can be easily and naturally extended to multi-label data, because the class label kernel can be clearly and unambiguously defined using class label matrix on both single label and multi-label data sets. The data kernel is defined as usual. In the following we further develop this approach.
One important result of our kernel alignment approach for single label data is that it has close relationship with LDA. For multi-label data, each data point could belong to several classes. The standard scatter matrices are ambiguous, because are only defined for single label data where each data point belongs to one class only. However, our kernel alignment approach on multi-label data leads to new definitions of scatter matrices and similar objective function; this can be viewed as the generalization of LDA from single-label data to multi-label data via kernel alignment approach.
Indeed, the new scatter matrices we obtained from kernel alignment approach are identical to the so-called “multi-label LDA”  developed from a class-separate, probabilistic point of view, very different from our point of view. The fact that these two approaches lead to the same set of scatter matrices show that the resulting multi-label LDA framework has a broad theoretical basis.
We first present some notations for multi-label data and then describe the kernel alignment approach for multi-label data in Theorem 4.1. The class label matrix for data is given as:
Let be the number of data points in class . Note that for multi-label data, . The normalized class indicator matrix is given as:
Let be the number of classes that belongs to. Thus are the weights of . Define the diagonal weight matrix . The kernel alignment formulation for multi-label data can be stated as
For multi-label data , let the data kernel and class label kernel be
We have the alignment
Furthermore, let be the linear transformation to a -dimensional subspace,
The matrices in Theorem 4.1 are defined as:
where is the mean of class and is global mean, defined as:
4.1 Proof of Theorem 2 and Equivalence to Multi-label LDA
Here we note a useful lemma for multi-label data and then prove Theorem 4.1. We consider the case the data is centered, i.e., . The results also hold when data is not centered, but the proofs are slightly complicated.
From the definition of and in multi-label data, we have
Thus recovers of Eq.(27).
To prove Eq.(31), note that thus
To prove Eq.(26),
5 Related Work
Linear Discriminant Analysis (LDA) is a widely-used dimension reduction and subspace learning algorithm. There are many LDA reformulation publications in recent years. Trace Ratio problem is to find a subspace transformation matrix such that the within-class distance is minimized and the between-class distance is maximized. Formally, Trace Ratio maximizes the ratio of two trace terms, [22, 13], where is total scatter matrix and is between-class scatter matrix. Other popular LDA approach includes, regularized LDA(RLDA) , Orthogonal Centroid Method (OCM) , Uncorrelated LDA(ULDA) , Orthogonal LDA (OLDA) , etc.. These approaches mainly compute the eigendecomposition of matrix , but use different formulation of total scatter matrix .
Maximum Margin Criteria (MMC)  is a simpler and more efficient method. MMC finds a subspace projection matrix to maximize . Though in a different way, MMC also maximizes between-class distance while minimizing within-class distance. Semi-Definite Positive LDA (sdpLDA)  solves the maximization of , where
is the largest eigenvalue of. sdpLDA is derived from the maximum margin principle.
Multi-label problem arise frequently in image and video annotations and many other related applications, such as multi-topic text categorization 
. There are many Multi-label dimension reduction approaches, such as Multi-label Linear Regression (MLR), Multi-label informed Latent Semantic Indexing (MLSI), Multi-label Dimensionality reduction via Dependence Maximization (MDDM) , Multi-Label Least Square (MLLS) , Multi-label Linear Discriminant Analysis (MLDA) .
In this section, we first compare kernel alignment LDA (kaLDA) with other six different methods on 8 single label data sets and compare kaLDA multi-label version with four other methods on 6 multi-label data sets.
6.1 Comparison with Trace Ratio w.r.t. subspace dimension
Eight single-label datasets are used in this experiment. These datasets come from different domains, such as image scene Caltec  and MSRC , face datasets ATT, Umist, Pie , and digit datasets Mnist  and Binalpha. Table 1 summarizes the attributes of those datasets.
Caltec07 and Caltec20 are subsets of Caltech 101 data. Only the HOG feature is used in this paper.
MSRC is a image scene data, includes tree, building, plane, cow, face, car and so on. It has 210 images from 7 classes and each image has 432 dimension.
ATT data contains 400 images of 40 persons, with 10 images for each person. The images has been resized to .
Binalpha data contains 26 binary hand-written alphabets. It has 1014 images in total and each image has 320 dimension.
Mnist is a handwritten digits dataset. The digits have been size-normalized and centred. It has 10 classes and 150 images in total, with 784 dimension each image.
Umist is a face image dataset (Sheffield Face database) with 360 images from 20 individuals with mixed race, gender and appearance.
Pie is a face database collected by Carnegie Mellon Robotics Institute between October and December 2000. In total, it has 68 different persons.
In this part, we compare the classification accuracy of kaLDA and Trace Ratio  with respect to subspace dimension. The dimension of the subspace that kaLDA can find is not restricted to
. After subspace projection, KNN classifier () is applied to perform classification. Results are shown in Figure 3. Solid line denotes kaLDA accuracy and dashed line denotes Trace Ratio accuracy. As we can see, in Figures 2(a), 2(b), 2(c), 2(g), and 2(h), kaLDA has higher accuracy than Trace Ratio when using the same number of reduced features. In Figures 2(d), 2(e), 2(f), kaLDA has competitive classification accuracy with Trace Ratio. However, kaLDA is more stable than Trace Ratio. For example, in Figure 2(f) and 2(g), we observe a decrease in accuracy when feature number increases using Trace Ratio.
6.2 Comparison with other LDA methods
We compare kaLDA with six other different methods, including LDA, Trace Ratio (TR), spdLDA, Maximum Margin Criteria (MMC), regularized LDA (RLDA), and Orthogonal Centroid Method (OCM). All LDA will reduce data to dimension. KNN () will be applied to do the classification after data is projected into the selected subspace. The other algorithms have already been introduced in related work section. The final classification accuracy is the average of 5-fold cross validation, and is reported in Table 2. The first column “kaLDA” reports kaLDA classification accuracy. kaLDA has the highest accuracy on 4 out of 8 datasets, including Caltec20, MSRC-MOM, ATT and Mnist. For Umist and Pie, kaLDA results are very close to the highest accuracy. Overall, kaLDA performs better than all other methods.
6.3 Multi-label Classification
Six multi-label datasets are used in this part. These datasets include images features, music emotion and so on. Table 3 summarizes the attributes of those datasets.
MSRC-MOM and MSRC-SIFT data set is provided by Microsoft Research in Cambridge. It includes 591 images of 23 classes. MSRC-MOM
is the Moment invariants (MOM) feature of images and each image has 384 dimensions.MSRC-SIFT is the SIFT feature and each image has 240 dimensions. About 80% of the images are annotated with at least one classes and about three classes per image on average.
Barcelona data set contains 139 images with 4 classes, i.e., “building”, “flora”, “people” and “sky”. Each image has at least two labels.
Emotion  is a music emotion data, which comprises 593 songs with 6 emotions. The dimension of Emotion is 72.
Yeast  is a multi-label data set which contains functional classes of genes in the Yeast Saccharomyces cerevisiae.
Scene  contains images of still scenes with semantic indexing. It has 2407 images from 6 classes.
We use 5-fold cross validation to evaluate classification performance of different algorithms. K-Nearest Neighbour (KNN) classifier is used after the subspace projection. The algorithms we compared in this section includes Multi-label informed Latent Semantic Indexing (MLSI), Multi-label Dimensionality reduction via Dependence Maximization (MDDM), Multi-Label Least Square (MLLS), Multi-label Linear Discriminant Analysis (MLDA). These algorithms have been introduced in related work section.
We compare the performance of kaLDA and other algorithms using macro accuracy (Table 4), macro-averaged F1-score (Table 5) and micro-averaged (Table 6) F1-score. Accuracy and F1 score are computed using standard binary classification definitions. In multi-label classification, macro average is a standard class-wise average, and it is related to number of samples in each class. However, micro average gives equal weight to all classes . kaLDA achieves highest classification accuracy on 5 out of 6 datasets. On the remaining MSRC-SIFT dataset, kaLDA result is very close to the best method MLDA and beat all rest methods. kaLDA achieves highest macro and micro F1 score on 5 out of 6 datasets. Furthermore, kaLDA has the second highest macro and micro F1 score on dataset MSRC-SIFT. Overall, kaLDA outperforms other multi-label algorithms in terms of classification accuracy and macro and micro F1 score.
In this paper, we propose a new kernel alignment induced LDA (kaLDA). The objective function of kaLDA is very similar to classical LDA objective. The Stifel-manifold gradient descent algorithm can solve kaLDA objective efficiently. We have also extended kaLDA to multi-label problems. Extensive experiments show the effectiveness of kaLDA in both single-label and multi-label problems.
Acknowledgment. This work is partially supported by US NSF CCF-0917274 and NSF DMS-0915228.
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern recognition 37(9), 1757–1771 (2004)
-  Cristianini, N., Shawe-taylor, J., Elisseeff, A., Kandola, J.S.: On kernel target alignment. Advances in neural information processing systems 14, 367 (2002)
-  Cristianini, N., et al.: Method of using kernel alignment to extract significant features from a large dataset (2007), US Patent 7,299,213
Cuturi, M.: Fast global alignment kernels. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 929–936 (2011)
Ding, C., He, X.: K-means clustering via principal component analysis. In: Proc of international conference on Machine learning (ICML 2004) (2004)
-  Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2), 303–353 (1998)
-  Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: NIPS. vol. 14, pp. 681–687 (2001)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding 106(1), 59–70 (2007)
-  Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)
-  Hoi, S.C., Lyu, M.R., Chang, E.Y.: Learning the unified kernel machines for classification. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 187–196. ACM (2006)
-  Howard, A., Jebara, T.: Transformation learning via kernel alignment. In: Machine Learning and Applications, 2009. ICMLA’09. International Conference on. pp. 301–308. IEEE (2009)
-  Ji, S., Tang, L., Yu, S., Ye, J.: Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 381–389. ACM (2008)
Jia, Y., Nie, F., Zhang, C.: Trace ratio problem revisited. Neural Networks, IEEE Transactions on 20(4), 729–735 (2009)
-  Kong, D., Ding, C.: A semi-definite positive linear discriminant analysis and its applications. In: 2012 IEEE 12th International Conference on Data Mining (ICDM). pp. 942–947. IEEE (2012)
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Lee, Y.J., Grauman, K.: Foreground focus: Unsupervised learning from partially matching images. International Journal of Computer Vision 85(2), 143–166 (2009)
Li, H., Jiang, T., Zhang, K.: Efficient and robust feature extraction by maximum margin criterion. Neural Networks, IEEE Transactions on 17(1), 157–165 (2006)
-  Park, H., Jeon, M., Rosen, J.B.: Lower dimensional representation of text data based on centroids and least squares. BIT Numerical mathematics 43(2), 427–448 (2003)
-  Sim, T., Baker, S., Bsat, M.: The cmu pose, illumination, and expression (pie) database of human faces. Tech. Rep. CMU-RI-TR-01-02, Robotics Institute, Pittsburgh, PA (January 2001)
-  Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.P.: Multi-label classification of music into emotions. In: ISMIR. vol. 8, pp. 325–330 (2008)
-  Wang, H., Ding, C., Huang, H.: Multi-label linear discriminant analysis. In: Computer Vision–ECCV 2010, pp. 126–139. Springer (2010)
-  Wang, H., Yan, S., Xu, D., Tang, X., Huang, T.: Trace ratio vs. ratio trace for dimensionality reduction. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. pp. 1–8. IEEE (2007)
-  Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. In: Journal of Machine Learning Research. pp. 483–502 (2005)
-  Ye, J., Ji, S.: Discriminant analysis for dimensionality reduction: An overview of recent developments. Biometrics: Theory, Methods, and Applications. Wiley-IEEE Press, New York (2010)
-  Yu, K., Yu, S., Tresp, V.: Multi-label informed latent semantic indexing. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 258–265. ACM (2005)
-  Zha, H., Ding, C., Gu, M., He, X., Simon, H.: Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 14 (NIPS’01) pp. 1057–1064
-  Zhang, Y., Zhou, Z.H.: Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data (TKDD) 4(3), 14 (2010)
Zhu, X., Kandola, J., Ghahramani, Z., Lafferty, J.D.: Nonparametric transforms of graph kernels for semi-supervised learning. In: Advances in neural information processing systems. pp. 1641–1648 (2004)