Kernel Alignment Inspired Linear Discriminant Analysis

Kernel alignment measures the degree of similarity between two kernels. In this paper, inspired from kernel alignment, we propose a new Linear Discriminant Analysis (LDA) formulation, kernel alignment LDA (kaLDA). We first define two kernels, data kernel and class indicator kernel. The problem is to find a subspace to maximize the alignment between subspace-transformed data kernel and class indicator kernel. Surprisingly, the kernel alignment induced kaLDA objective function is very similar to classical LDA and can be expressed using between-class and total scatter matrices. This can be extended to multi-label data. We use a Stiefel-manifold gradient descent algorithm to solve this problem. We perform experiments on 8 single-label and 6 multi-label data sets. Results show that kaLDA has very good performance on many single-label and multi-label problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/08/2020

Saliency-based Weighted Multi-label Linear Discriminant Analysis

In this paper, we propose a new variant of Linear Discriminant Analysis ...
10/14/2016

A Harmonic Mean Linear Discriminant Analysis for Robust Image Classification

Linear Discriminant Analysis (LDA) is a widely-used supervised dimension...
09/16/2017

Subset Labeled LDA for Large-Scale Multi-Label Classification

Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standa...
08/19/2021

A Framework for an Assessment of the Kernel-target Alignment in Tree Ensemble Kernel Learning

Kernels ensuing from tree ensembles such as random forest (RF) or gradie...
02/19/2018

Weighted Linear Discriminant Analysis based on Class Saliency Information

In this paper, we propose a new variant of Linear Discriminant Analysis ...
11/18/2019

A Multi-Task Gradient Descent Method for Multi-Label Learning

Multi-label learning studies the problem where an instance is associated...
03/02/2012

Algorithms for Learning Kernels Based on Centered Alignment

This paper presents new and effective algorithms for learning kernels. I...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kernel alignment [2]

is a way to incorporate class label information into kernels which are traditionally directly constructed from data without using class labels. Kernel alignment can be viewed as a measurement of consistency between the similarity function (the kernel) and class structure in the data. Improving this consistency helps to enforce data become more separated when using the class label aligned kernel. Kernel alignment has been applied to pattern recognition and feature selection recently 

[3, 28, 10, 11, 4].

In this paper, we find that if we use the widely used linear kernel and a kernel built from class indicators, the resulting kernel alignment function is very similar to the widely used linear discriminant analysis (LDA), using the well-known between-class scatter matrix and total scatter matrix . We call this objective function as kernel alignment induced LDA (kaLDA). If we transform data into a linear subspace, the optimal solution is to maximize this kaLDA.

We further analyze this kaLDA and propose a Stiefel-manifold gradient descent algorithm to solve it. We also extend kaLDA to multi-label problems. Surprisingly, the scatter matrices arising in multi-label kernel alignment are identical those matrices developed in Multi-label LDA [21].

We perform extensive experiments by comparing kaLDA with other approaches on 8 single-label datasets and 6 multi-label data sets. Results show that kernel alignment LDA approach has good performance in terms of classification accuracy and F1 score.

2 From Kernel Alignment to LDA

Kernel Alignment is a similarity measurement between a kernel function and a target function. In other words, kernel alignment evaluates the degree of fitness between the data in kernel space and the target function. For this reason, we usually set the target function to be the class indicator function. The other kernel function is the data matrix. By measuring the similarity between data kernel and class indicator kernel, we can get a sense of how easily this data can be separated in kernel subspace. The alignment of two kernels and is given as [2]:

(1)

We first introduce some notations, and then present Theorem 2.1 and kernel alignment projective function.

Let data matrix be and , where is data dimension, is number of data points, is a data point. Let normalized class indicator matrix be

, which was used to prove the equivalence between PCA and K-means clustering

[26, 5], and

(2)

where is total class number, is the number of data points in class . Class mean is and total mean of data is .

Theorem 2.1

Define data kernel and class label kernel as follows:

(3)

we have

(4)

where is a constant independent of .

Furthermore, let

be a linear transformation to a

-dimensional subspace

(5)

we have

(6)

where

(7)
(8)

Theorem 2.1 shows that kernel alignment can be expressed using scatter matrices and . In applications, we adjust such that kernel alignment is maximized, i.e., we solve the following problem:

(9)

In general, columns of are assumed to be linearly independent.

A striking feature of this kernel alignment problem is that it is very similar to classic LDA.

2.1 Proof of Theorem 1 and Analysis

Here we note a useful lemma and then prove Theorem 2.1.

In most data analysis, data are centered, i.e., . Here we assume data is already centered. The following results remain correct if data is not centered. We have the following relations:

Lemma 1

Scatter matrices can be expressed as:

(10)
(11)

These results are previously known, for example, Theorem 3 of [5].

Proof of Theorem 2.1. To prove Eq.(4), we substitute into Eq.(1) and obtain, noting ,

where we used Lemma 1. is a constant independent of data .

To prove Eq.(6),

thus we obtain Eq.(6) using Lemma 1.

2.2 Relation to Classical LDA

In classical LDA, the between-class scatter matrix is defined as Eq.(7), and the within-class scatter matrix and total scatter matrix are defined as:

(12)

where and are class means. Classical LDA finds a projection matrix that minimizes and maximizes using the following objective:

(13)

or

(14)

Eq.(14) is also called trace ratio (TR) problem [22]. It is easy to see 111 Eq.(14) is equivalent to , which is . Reversing to maximization and using , we obtain Eq.(15). that Eq.(14) can be expressed as

(15)

As we can see, kernel alignment LDA objective function Eq.(9) is very similar to Eq.(15). Thus kernel alignment provides an interesting alternative explanation of LDA. In fact, we can similarly show that in Eq.(9), is also maximized as in the standard LDA. First, Eq.(9) is equivalent to

where is a fixed-value. The precise value of is unimportant, since the scale of is undefined in LDA: if is an optimal solution, and is any real number, is also an optimal solution with the same optimal objective function value. The above optimization is approximately equivalent to

This is same as

In other words, is maximized while is minimized — recovering the LDA main theme.

3 Computational Algorithm

In this section, we develop efficient algorithm to solve kaLDA objective function Eq.(9):

(16)

The condition ensures different columns of mutually independent. The gradient of is

(17)

where

Constraint enforces on the Stiefel manifold. Variations of on this manifold is parallel transport, which gives some restriction to the gradient. This has been been worked out in [6]. The gradient that reserves the manifold structure is

(18)

Thus the algorithm computes the new is given as follows:

(19)

The step size is usually chosen as:

(20)

where

Occasionally, due to the loss of numerical accuracy, we use projection to restore . Starting with the standard LDA solution of , this algorithm is iterated until the algorithm converges to a local optimal solution. In fact, objective function will converge quickly when choosing properly. Figure 1 shows that converges in about 200 iterations when , for datasets ATT, Binalpha, Mnist, and Umist (more details about the datasets will be introduced in experiment section). In summary, kernel alignment LDA (kaLDA) procedure is shown in Algorithm 1.

1:Data matrix , class indicator matrix
2:Projection matrix
3:Compute and using Eq.(10) and Eq.(11)
4:Initialize using classical LDA solution
5:repeat
6:     Compute gradient using Eq.(17)
7:     Update using Eq.(19)
8:until  Converges
Algorithm 1
(a) ATT
(b) Binalpha
(c) Mnist
(d) Umist
Figure 1: Objective J1 converges using Stiefel-manifold gradient descent algorithm ().

To show the effectiveness of proposed kaLDA, we visualize a real dataset in 2-D subspace in Figure 2. In this example, we take 3 classes of 644-dimension Umist data, 18 data points in each class. Figure 1(a) shows the original data projected in 2-D PCA subspace. Blue points are in class 1; red circle points are in class 2; black square points are in class 3. Data points from the three classes are mixed together in 2-D PCA subspace. It is difficult to find a linear boundary to separate points of different classes. Figure 1(b) shows the data in 2-D standard LDA subspace. We can see that data points in different classes have been projected into different clusters. Figure 1(c) shows the data projected in 2-D kaLDA subspace. Compared to Figure 1(b), the within-class distance in Figure 1(c) is much smaller. The distance between different classes is larger.

(a) 2-D PCA subspace.
(b) 2-D LDA subspace.
(c) 2-D kaLDA subspace.
Figure 2: Visualization of Umist data in 2-D PCA, 2-D LDA and 2-D kaLDA subspace.

4 Extension to Multi-label Data

Multi-label problem arises frequently in image and video annotations, multi-topic text categorization, music classification. etc.[21]

. In multi-label data, a data point could have several class labels (belonging to several classes). For example, an image could have “cloud”, “building”, “tree” labels. This is different from the case of single-label problem, where one point can have only one class label. Multi-label is very natural and common in our everyday life. For example, a film can be simultaneously classified as “drama”, “romance”, “historic” (if it is about a true story). A news article can have topic labels such as “economics”, “sports”, etc.

Kernel alignment approach can be easily and naturally extended to multi-label data, because the class label kernel can be clearly and unambiguously defined using class label matrix on both single label and multi-label data sets. The data kernel is defined as usual. In the following we further develop this approach.

One important result of our kernel alignment approach for single label data is that it has close relationship with LDA. For multi-label data, each data point could belong to several classes. The standard scatter matrices are ambiguous, because are only defined for single label data where each data point belongs to one class only. However, our kernel alignment approach on multi-label data leads to new definitions of scatter matrices and similar objective function; this can be viewed as the generalization of LDA from single-label data to multi-label data via kernel alignment approach.

Indeed, the new scatter matrices we obtained from kernel alignment approach are identical to the so-called “multi-label LDA” [21] developed from a class-separate, probabilistic point of view, very different from our point of view. The fact that these two approaches lead to the same set of scatter matrices show that the resulting multi-label LDA framework has a broad theoretical basis.

We first present some notations for multi-label data and then describe the kernel alignment approach for multi-label data in Theorem 4.1. The class label matrix for data is given as:

(21)

Let be the number of data points in class . Note that for multi-label data, . The normalized class indicator matrix is given as:

(22)

Let be the number of classes that belongs to. Thus are the weights of . Define the diagonal weight matrix . The kernel alignment formulation for multi-label data can be stated as

Theorem 4.1

For multi-label data , let the data kernel and class label kernel be

(23)

We have the alignment

(24)

where is a constant independent of data , and are given in Eqs.(27, 28).

Furthermore, let be the linear transformation to a -dimensional subspace,

(25)

we have

(26)

The matrices in Theorem 4.1 are defined as:

(27)
(28)

where is the mean of class and is global mean, defined as:

(29)

Therefore, we can seek an optimal subspace for multi-label data by solving Eq.(16) with given in Eqs.(27,28)

4.1 Proof of Theorem 2 and Equivalence to Multi-label LDA

Here we note a useful lemma for multi-label data and then prove Theorem 4.1. We consider the case the data is centered, i.e., . The results also hold when data is not centered, but the proofs are slightly complicated.

Lemma 2

For multi-label data, of Eqs.(27,28) can be expressed as

(30)
(31)
Proof

From the definition of and in multi-label data, we have

Thus recovers of Eq.(27).

To prove Eq.(31), note that thus

Proof of Theorem 4.1. Using Lemma 2, to prove Eq.(24),

where is independent of .

To prove Eq.(26),

For single-label data, , Eqs.(30, 31) reduce to Eqs.(10, 11), and Theorem 4.1 reduces to Theorem 2.1.

As we can see, surprisingly, the scatter matrices of Eqs.(27, 28) arising in Theorem 4.1 are identical to that in Multi-label LDA proposed in [21].

Data n p k
Caltec07 210 432 7
Caltec20 1230 432 20
MSRC 210 432 7
ATT 400 644 40
Binalpha 1014 320 26
Mnist 150 784 10
Umist 360 644 20
Pie 680 1024 68
Table 1: Single-label datasets attributes.
Data kaLDA LDA TR sdpLDA MMC RLDA OCM
Caltec07 0.7524 0.6619 0.6762 0.5619 0.6000 0.7952 0.7619
Caltec20 0.7068 0.6320 0.4465 0.3386 0.5838 0.6812 0.6696
MSRC 0.7762 0.6857 0.5714 0.5952 0.5667 0.7333 0.7286
ATT 0.9775 0.9750 0.9675 0.9750 0.9750 0.9675 0.9675
Binalpha 0.7817 0.6078 0.4620 0.2507 0.7638 0.7983 0.8204
Mnist 0.8800 0.8733 0.8667 0.8467 0.8467 0.8667 0.8467
Umist 0.9900 0.9900 0.9917 0.9133 0.9633 0.9800 0.9783
Pie 0.8765 0.8838 0.8441 0.8632 0.8676 0.6515 0.6515
Table 2: Classification accuracy on Single-label datasets ( dimension).

5 Related Work

Linear Discriminant Analysis (LDA) is a widely-used dimension reduction and subspace learning algorithm. There are many LDA reformulation publications in recent years. Trace Ratio problem is to find a subspace transformation matrix such that the within-class distance is minimized and the between-class distance is maximized. Formally, Trace Ratio maximizes the ratio of two trace terms,   [22, 13], where is total scatter matrix and is between-class scatter matrix. Other popular LDA approach includes, regularized LDA(RLDA) [9], Orthogonal Centroid Method (OCM) [18], Uncorrelated LDA(ULDA) [23], Orthogonal LDA (OLDA) [23], etc.. These approaches mainly compute the eigendecomposition of matrix , but use different formulation of total scatter matrix  [24].

Maximum Margin Criteria (MMC) [17] is a simpler and more efficient method. MMC finds a subspace projection matrix to maximize . Though in a different way, MMC also maximizes between-class distance while minimizing within-class distance. Semi-Definite Positive LDA (sdpLDA) [14] solves the maximization of , where

is the largest eigenvalue of

. sdpLDA is derived from the maximum margin principle.

Multi-label problem arise frequently in image and video annotations and many other related applications, such as multi-topic text categorization [21]

. There are many Multi-label dimension reduction approaches, such as Multi-label Linear Regression (MLR), Multi-label informed Latent Semantic Indexing (MLSI)

[25], Multi-label Dimensionality reduction via Dependence Maximization (MDDM) [27], Multi-Label Least Square (MLLS) [12], Multi-label Linear Discriminant Analysis (MLDA) [21].

6 Experiments

Data n p k
MSRC-MOM 591 384 23
Barcelona 139 48 4
Emotion 593 72 6
Yeast 2,417 103 14
MSRC-SIFT 591 240 23
Scene 2,407 294 6
Table 3: Multi-label datasets attributes.
Data kaLDA MLSI MDDM MLLS MLDA
MSRC-MOM 0.9150 0.8962 0.9044 0.8994 0.9036
Barcelona 0.6579 0.6436 0.6470 0.6524 0.6290
Emotion 0.7634 0.7397 0.7540 0.7529 0.7619
Yeast 0.7405 0.7317 0.7371 0.7364 0.7368
MSRC-SIFT 0.8839 0.8762 0.8800 0.8807 0.8858
Scene 0.8870 0.8534 0.8713 0.8229 0.8771
Table 4: Classification accuracy on Multi-label datasets ( dimension).

In this section, we first compare kernel alignment LDA (kaLDA) with other six different methods on 8 single label data sets and compare kaLDA multi-label version with four other methods on 6 multi-label data sets.

6.1 Comparison with Trace Ratio w.r.t. subspace dimension

(a) Caltec07
(b) Caltec20
(c) MSRC
(d) ATT
(e) Binalpha
(f) Mnist
(g) Umist
(h) Pie
Figure 3: Classification accuracy w.r.t. dimension of the subspace.

Eight single-label datasets are used in this experiment. These datasets come from different domains, such as image scene Caltec [8] and MSRC [16], face datasets ATT, Umist, Pie [19], and digit datasets Mnist [15] and Binalpha. Table 1 summarizes the attributes of those datasets.

Caltec07 and Caltec20 are subsets of Caltech 101 data. Only the HOG feature is used in this paper.

MSRC is a image scene data, includes tree, building, plane, cow, face, car and so on. It has 210 images from 7 classes and each image has 432 dimension.

ATT data contains 400 images of 40 persons, with 10 images for each person. The images has been resized to .

Binalpha data contains 26 binary hand-written alphabets. It has 1014 images in total and each image has 320 dimension.

Mnist is a handwritten digits dataset. The digits have been size-normalized and centred. It has 10 classes and 150 images in total, with 784 dimension each image.

Umist is a face image dataset (Sheffield Face database) with 360 images from 20 individuals with mixed race, gender and appearance.

Pie is a face database collected by Carnegie Mellon Robotics Institute between October and December 2000. In total, it has 68 different persons.

In this part, we compare the classification accuracy of kaLDA and Trace Ratio [22] with respect to subspace dimension. The dimension of the subspace that kaLDA can find is not restricted to

. After subspace projection, KNN classifier (

) is applied to perform classification. Results are shown in Figure 3. Solid line denotes kaLDA accuracy and dashed line denotes Trace Ratio accuracy. As we can see, in Figures 2(a), 2(b), 2(c), 2(g), and 2(h), kaLDA has higher accuracy than Trace Ratio when using the same number of reduced features. In Figures 2(d), 2(e), 2(f), kaLDA has competitive classification accuracy with Trace Ratio. However, kaLDA is more stable than Trace Ratio. For example, in Figure 2(f) and 2(g), we observe a decrease in accuracy when feature number increases using Trace Ratio.

6.2 Comparison with other LDA methods

Dataset kaLDA MLSI MDDM MLLS MLDA
MSRC-MOM 0.6104 0.5244 0.5593 0.5426 0.5571
Barcelona 0.7377 0.7286 0.7301 0.7341 0.7169
Emotion 0.6274 0.5873 0.6101 0.6041 0.6200
Yeast 0.5757 0.5568 0.5696 0.5691 0.5693
MSRC-SIFT 0.4712 0.4334 0.4522 0.4544 0.4773
Scene 0.6851 0.5911 0.6411 0.5048 0.6568
Table 5: Macro F1 score on Multi-label datasets ( dimension).
Dataset kaLDA MLSI MDDM MLLS MLDA
MSRC-MOM 0.5138 0.4064 0.4432 0.4370 0.4448
Barcelona 0.6969 0.6891 0.6861 0.6904 0.6772
Emotion 0.6203 0.5779 0.6030 0.5961 0.6151
Yeast 0.4249 0.4026 0.4205 0.4216 0.4213
MSRC-SIFT 0.3943 0.3510 0.3637 0.3667 0.3959
Scene 0.6966 0.6006 0.6493 0.5062 0.6643
Table 6: Micro F1 score on Multi-label datasets ( dimension).

We compare kaLDA with six other different methods, including LDA, Trace Ratio (TR), spdLDA, Maximum Margin Criteria (MMC), regularized LDA (RLDA), and Orthogonal Centroid Method (OCM). All LDA will reduce data to dimension. KNN () will be applied to do the classification after data is projected into the selected subspace. The other algorithms have already been introduced in related work section. The final classification accuracy is the average of 5-fold cross validation, and is reported in Table 2. The first column “kaLDA” reports kaLDA classification accuracy. kaLDA has the highest accuracy on 4 out of 8 datasets, including Caltec20, MSRC-MOM, ATT and Mnist. For Umist and Pie, kaLDA results are very close to the highest accuracy. Overall, kaLDA performs better than all other methods.

6.3 Multi-label Classification

Six multi-label datasets are used in this part. These datasets include images features, music emotion and so on. Table 3 summarizes the attributes of those datasets.

MSRC-MOM and MSRC-SIFT data set is provided by Microsoft Research in Cambridge. It includes 591 images of 23 classes. MSRC-MOM

is the Moment invariants (MOM) feature of images and each image has 384 dimensions.

MSRC-SIFT is the SIFT feature and each image has 240 dimensions. About 80% of the images are annotated with at least one classes and about three classes per image on average.

Barcelona data set contains 139 images with 4 classes, i.e., “building”, “flora”, “people” and “sky”. Each image has at least two labels.

Emotion [20] is a music emotion data, which comprises 593 songs with 6 emotions. The dimension of Emotion is 72.

Yeast [7] is a multi-label data set which contains functional classes of genes in the Yeast Saccharomyces cerevisiae.

Scene [1] contains images of still scenes with semantic indexing. It has 2407 images from 6 classes.

We use 5-fold cross validation to evaluate classification performance of different algorithms. K-Nearest Neighbour (KNN) classifier is used after the subspace projection. The algorithms we compared in this section includes Multi-label informed Latent Semantic Indexing (MLSI), Multi-label Dimensionality reduction via Dependence Maximization (MDDM), Multi-Label Least Square (MLLS), Multi-label Linear Discriminant Analysis (MLDA). These algorithms have been introduced in related work section.

We compare the performance of kaLDA and other algorithms using macro accuracy (Table 4), macro-averaged F1-score (Table 5) and micro-averaged (Table 6) F1-score. Accuracy and F1 score are computed using standard binary classification definitions. In multi-label classification, macro average is a standard class-wise average, and it is related to number of samples in each class. However, micro average gives equal weight to all classes [21]. kaLDA achieves highest classification accuracy on 5 out of 6 datasets. On the remaining MSRC-SIFT dataset, kaLDA result is very close to the best method MLDA and beat all rest methods. kaLDA achieves highest macro and micro F1 score on 5 out of 6 datasets. Furthermore, kaLDA has the second highest macro and micro F1 score on dataset MSRC-SIFT. Overall, kaLDA outperforms other multi-label algorithms in terms of classification accuracy and macro and micro F1 score.

7 Conclusions

In this paper, we propose a new kernel alignment induced LDA (kaLDA). The objective function of kaLDA is very similar to classical LDA objective. The Stifel-manifold gradient descent algorithm can solve kaLDA objective efficiently. We have also extended kaLDA to multi-label problems. Extensive experiments show the effectiveness of kaLDA in both single-label and multi-label problems.

Acknowledgment. This work is partially supported by US NSF CCF-0917274 and NSF DMS-0915228.

References

  • [1]

    Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern recognition 37(9), 1757–1771 (2004)

  • [2] Cristianini, N., Shawe-taylor, J., Elisseeff, A., Kandola, J.S.: On kernel target alignment. Advances in neural information processing systems 14, 367 (2002)
  • [3] Cristianini, N., et al.: Method of using kernel alignment to extract significant features from a large dataset (2007), US Patent 7,299,213
  • [4]

    Cuturi, M.: Fast global alignment kernels. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 929–936 (2011)

  • [5]

    Ding, C., He, X.: K-means clustering via principal component analysis. In: Proc of international conference on Machine learning (ICML 2004) (2004)

  • [6] Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2), 303–353 (1998)
  • [7] Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: NIPS. vol. 14, pp. 681–687 (2001)
  • [8]

    Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding 106(1), 59–70 (2007)

  • [9] Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)
  • [10] Hoi, S.C., Lyu, M.R., Chang, E.Y.: Learning the unified kernel machines for classification. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 187–196. ACM (2006)
  • [11] Howard, A., Jebara, T.: Transformation learning via kernel alignment. In: Machine Learning and Applications, 2009. ICMLA’09. International Conference on. pp. 301–308. IEEE (2009)
  • [12] Ji, S., Tang, L., Yu, S., Ye, J.: Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 381–389. ACM (2008)
  • [13]

    Jia, Y., Nie, F., Zhang, C.: Trace ratio problem revisited. Neural Networks, IEEE Transactions on 20(4), 729–735 (2009)

  • [14] Kong, D., Ding, C.: A semi-definite positive linear discriminant analysis and its applications. In: 2012 IEEE 12th International Conference on Data Mining (ICDM). pp. 942–947. IEEE (2012)
  • [15] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
  • [16]

    Lee, Y.J., Grauman, K.: Foreground focus: Unsupervised learning from partially matching images. International Journal of Computer Vision 85(2), 143–166 (2009)

  • [17]

    Li, H., Jiang, T., Zhang, K.: Efficient and robust feature extraction by maximum margin criterion. Neural Networks, IEEE Transactions on 17(1), 157–165 (2006)

  • [18] Park, H., Jeon, M., Rosen, J.B.: Lower dimensional representation of text data based on centroids and least squares. BIT Numerical mathematics 43(2), 427–448 (2003)
  • [19] Sim, T., Baker, S., Bsat, M.: The cmu pose, illumination, and expression (pie) database of human faces. Tech. Rep. CMU-RI-TR-01-02, Robotics Institute, Pittsburgh, PA (January 2001)
  • [20] Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.P.: Multi-label classification of music into emotions. In: ISMIR. vol. 8, pp. 325–330 (2008)
  • [21] Wang, H., Ding, C., Huang, H.: Multi-label linear discriminant analysis. In: Computer Vision–ECCV 2010, pp. 126–139. Springer (2010)
  • [22] Wang, H., Yan, S., Xu, D., Tang, X., Huang, T.: Trace ratio vs. ratio trace for dimensionality reduction. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. pp. 1–8. IEEE (2007)
  • [23] Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. In: Journal of Machine Learning Research. pp. 483–502 (2005)
  • [24] Ye, J., Ji, S.: Discriminant analysis for dimensionality reduction: An overview of recent developments. Biometrics: Theory, Methods, and Applications. Wiley-IEEE Press, New York (2010)
  • [25] Yu, K., Yu, S., Tresp, V.: Multi-label informed latent semantic indexing. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 258–265. ACM (2005)
  • [26] Zha, H., Ding, C., Gu, M., He, X., Simon, H.: Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 14 (NIPS’01) pp. 1057–1064
  • [27] Zhang, Y., Zhou, Z.H.: Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data (TKDD) 4(3),  14 (2010)
  • [28]

    Zhu, X., Kandola, J., Ghahramani, Z., Lafferty, J.D.: Nonparametric transforms of graph kernels for semi-supervised learning. In: Advances in neural information processing systems. pp. 1641–1648 (2004)