In real-world classification applications, an instance is often associated with more than one class labels. For example, a scene image can be annotated with several tags boutell2004learning , a document may belong to multiple topics ueda2002parametric , and a piece of music may be associated with different genres turnbull2008semantic . Thus, multi-label learning has attracted a lot of attention in recent years zhang2014review .
Current studies on multi-label learning try to incorporate label correlations of different orders zhang2014review . However, existing approaches mostly focus on global label correlations shared by all instances furnkranz2008multilabel ; ji2008extracting ; read2011classifier . For example, labels “fish” and “ocean” are highly correlated, and so are “stock” and “finance”. On the other hand, certain label correlations are only shared by a local data subset huang2012 . For example, “apple” is related to “fruit” in gourmet magazines, but is related to “digital devices” in technology magazines. Previous studies focus on exploiting either global or local label correlations. However, considering both of them is obviously more beneficial and desirable.
Another problem with label correlations is that they are usually difficult to specify manually. As label correlations may vary in different contexts and there is no unified measure for specifying appropriate correlations, they are usually estimated from the observed data. Some approaches learn the label hierarchies by hierarchical clusteringPunera2005Automatically
or Bayesian network structure learningzhang2010multi . However, the hierarchical structure may not exist in some applications. For example, labels such as “desert”, “mountains”, “sea”, “sunset” and “trees” do not have any natural hierarchical correlations, and label hierarchies may not be useful. Others estimate label correlations by the co-occurrence of labels in training data NIPS2011_4239 . However, it may cause overfitting. Moreover, co-occurrence is less meaningful for labels with very few positive instances.
In multi-label learning, some labels may be missing from the training set. For example, human labelers may ignore object classes they do not know or of little interest. Recently, multi-label learning with missing labels has become a hot topic. Xu et al. xu2013speedup and Yu et al. Yu2014 considered using the low-rank structure on the instance-label mapping. A more direct approach to model the label dependency approximates the label matrix as a product of two low-rank matrices goldberg2010transduction . This leads to simpler recovery of the missing labels, and produces a latent representation of the label matrix.
In the missing label cases, estimation of label correlation becomes even more difficult, as the observed label distribution is different from the true one. As a result, the aforementioned methods (based on hierarchical clustering and co-occurrence, for example) will produce biased estimates of label correlations.
In this paper, we propose a new approach called “Multi-Label Learning with GLObal and loCAL Correlation” (GLOCAL
), which simultaneously recovers the missing labels, trains the linear classifiers and exploits both global and local label correlations. It learns a latent label representation. Classifier outputs are encouraged to be similar on highly positively correlated labels, and dissimilar on highly negatively correlated labels. We do not assume the presence of external knowledge sources specifying the label correlations. Instead, these correlations are learned simultaneously with the latent label representations and instance-label mapping.
The rest of the paper is organized as follows. In Section 2, related works of multi-label learning with label correlations are introduced. In Section 3, the problem formulation and the GLOCAL approach are proposed. Experimental results are presented in Section 4. Finally, Section 5 concludes the work.
Notations For a matrix , denotes its transpose, is its trace, is its Frobenius norm, and
returns a vector containing the diagonal elements of. For two matrices and , denotes the Hadamard (element-wise) product. For a vector , is its -norm, and returns a diagonal matrix with on the diagonal.
2 Related Work
Multi-label learning has been widely studied in recent years. Based on the degree of label correlations used, it can be divided into three categories zhang2014review : (i) first-order; (ii) second-order; and (iii) high-order. For the first-order strategy, label correlations are not considered, and the multi-label problem is transformed into multiple independent binary classification problems. For example, BR boutell2004learning trains a classifier for each label independently. For the second-order strategy, pairwise label relations are considered. For example, CLR furnkranz2008multilabel transforms the multi-label learning problem into the pairwise label ranking problem. For the high-order strategy, all other labels’ influences imposed on each label are taken into account. For example, CC read2011classifier transforms the multi-label learning problem into a chain of binary classification problems, with the ground-truth labels encoded into the features.
Most previous studies focus on global label correlations. However, MLLOC huang2012 demonstrates that sometimes label correlations may only be shared by a local data subset. Specifically, it enhances the feature representation of each instance by embedding a code into the feature space, which encodes the influence of labels of an instance to the local label correlations. This has some limitations. First, when the dimensionality of the feature space is large, the code is less discriminative and will be dominated by the original features. Second, MLLOC considers only the local label correlations, but not the global ones. Third, MLLOC cannot learn with missing labels.
In some real-world applications, labels are partially observed, and multi-label learning with missing labels has attracted much attention. MAXIDE xu2013speedup is based on fast low-rank matrix completion, and has strong theoretical guarantees. However, it only works in the transductive setting. Moreover, a label correlation matrix has to be specified manually. LEML Yu2014 also relies on a low-rank structure, and works in an inductive setting. However, it only implicitly uses global label correlations. ML-LRC xu2014learning adopts a low-rank structure to capture global label correlations, and addresses the missing labels by introducing a supplementary label matrix. However, only global label correlations are taken into account. Obviously, it would be more desirable to learn both global and local label correlations simultaneously.
Manifold regularization belkin2006manifold exploits instance similarity by forcing the predicted values on similar instances to be similar. A similar idea can be adapted to the label manifold, and so predicted values for correlated labels should be similar. However, the Laplacian matrix is based on some label similarity or correlation matrix, which can be hard to specify as discussed in Section 1.
3 The Proposed Approach
In multi-label learning, an instance can be associated with multiple class labels. Let be the class label set of labels. We denote the feature vector of an instance by , and denote the ground-truth label vector by , where if is with class label , and otherwise. As mentioned in Section 1, instances in the training data may be partially labeled, i.e., some labels may be missing. We adopt the general setting that both positive and negative labels can be missing goldberg2010transduction ; xu2013speedup ; Yu2014 . The observed label vector is denoted , where if class label is not labeled (i.e. it is missing), and otherwise. Given the training data , our goal is to learn a mapping function .
In this paper, we propose the GLOCAL algorithm, which learns and exploits both global and local label correlations via label manifolds. To recover the missing labels, learning of the latent label representation and classifier training are performed simultaneously.
3.1 Basic Model
Let be the ground-truth label matrix, where each is the label vector for instance . As discussed in Section 1, is low-rank. Let its rank be . Thus, can be written as the low-rank decomposition , where and . Intuitively, represents the latent labels that are more compact and more semantically abstract than the original labels, while matrix projects the original labels to the latent label space.
In general, the labels are only partially observed. Let the observed label matrix be , and be the set containing indices of the observed labels in (i.e., indices of the nonzero elements in ). We focus on minimizing the reconstruction error on the observed labels, i.e., , where if , and 0 otherwise. Moreover, we use a linear mapping to map instances to the latent labels. This is learned by minimizing , where is the instance matrix. Combining these two, we obtain the following optimization problem:
where is a regularizer and , are tradeoff parameters. While the square loss has been used in Eqn (1
), it can be replaced by any differentiable loss function. The prediction onis , where . Let , thus denotes the predictive value on -th label for . We concatenate all , denoted by , thus .
3.2 Global and Local Manifold Regularizers
Exploiting label correlations is an essential ingredient in multi-label learning. Here, we use label correlations to regularize the model. Intuitively, the more positively correlated two labels are, the closer are the corresponding classifier outputs, and vice versa. Let be the global label correlation matrix. The manifold regularizer should have a small value melacci2011primallapsvm . Here, , the th row of , is the vector of classifier outputs for the th label on the samples. Let be the diagonal matrix with diagonal , where is the vector of ones. The manifold regularizer can be equivalently written as luo2009non , where is the Laplacian matrix of .
As discussed in Section 1, label correlations may vary from one local region to another. Assume that the data is partitioned into groups , where has size . This partitioning can be obtained by domain knowledge (e.g., gene pathways subramanian2005gene and networks chuang2007network in bioinformatics applications) or clustering. Let be the label submatrix in corresponding to , and be the local label correlation matrix of group . Similar to global label correlation, to encourage the classifier outputs to be similar on the positively correlated labels and dissimilar on the negatively correlated ones, we minimize , where is the Laplacian matrix of and is the classifier output matrix for group .
Combining global and local label correlations with Eqn. (1), we have the following optimization problem:
where are tradeoff parameters.
Intuitively, a large local group contributes more to the global label correlations. In particular, the following Lemma shows that when the cosine similarity is used to compute, we have .
Let and , where is the th row of , and is the th row of . Then, .
In general, when the global label correlation matrix is a linear combination of the local label correlation matrices, the following Proposition shows that the global label Laplacian matrix is also a linear combination of the local label Laplacian matrices with the same combination coefficients.
If , then .
The success of label manifold regularization hinges on a good correlation matrix (or equivalently, a good Laplacian matrix). In multi-label learning, one rudimentary approach is to compute the correlation coefficient between two labels by cosine distance wang2009image . However, this can be noisy since some labels may only have very few positive instances in the training data. When labels can be missing, this computation may even become misleading, since the label distribution of observed labels may be much different from that of the ground-truth label distribution due to the missing labels.
In this paper, instead of specifying any correlation metric or label correlation matrix, we learn the Laplacian matrices directly. Note that the Laplacian matrices are symmetric positive definite. Thus, for , we decompose as , where . For simplicity, is set to the dimensionality of the latent representation . As a result, learning the Laplacian matrices is transformed to learning . Note that optimization w.r.t. may lead to the trivial solution . To avoid this problem, we add the constraint that the diagonal entries in are 1, for . This constraint also enables us to obtain a normalized Laplacian matrix chung1997spectral of .
Let be the indicator matrix with if , and 0 otherwise. can be rewritten as the Hadamard product . Combining the decomposition of Laplacian matrices and the diagonal constraints of , we obtain the optimization problem as:
Moreover, we will use .
3.3 Learning by Alternating Minimization
Problem (4) can be solved by alternating minimization (Algorithm 1). In each iteration, we update one of the variables in with gradient descent, and leave the others fixed. Specifically, the MANOPT toolbox manopt is utilized to implement gradient descent with line search on the Euclidean space for the update of , and on the manifolds for the update of .
With fixed, problem (4) reduces to
for each . Due to the constraint , it has no closed-form solution, and we will solve it with projected gradient descent. The gradient of the objective w.r.t. is
To satisfy the constraint , we project each row of onto the unit norm ball after each update:
where is the th row of .
With ’s and fixed, problem (4) reduces to
Notice that each column of is independent to each other, and thus can be solved column-by-column. Let and be th column of and , respectively. The optimization problem for can be written as:
Setting the gradient w.r.t. to 0, we obtain the following closed-form solution of :
This involves computing a matrix inverse for each . If this is expensive, we can use gradient descent instead. The gradient of the objective in (6) w.r.t. is
With ’s and fixed, problem (4) reduces to
Again, we use gradient descent, and the gradient w.r.t. is:
With ’s and fixed, problem (4) reduces to
The gradient w.r.t. is:
In this section, extensive experiments are performed on text and image datasets. Performance on both the full-label and missing-label cases are discussed.
4.1.1 Data sets
On text, eleven Yahoo datasets111http://www.kecl.ntt.co.jp/as/members/ueda/yahoo.tar (Arts, Business, Computers, Education, Entertainment, Health, Recreation, Reference, Science, Social and Society) and the Enron dataset222http://mulan.sourceforge.net/datasets-mlc.html are used. On images, the Corel5k33footnotemark: 3 and Image333http://cse.seu.edu.cn/people/zhangml/files/Image.rar datasets are used. In the sequel, each dataset is denoted by its first three letters.444“Society” is denoted “Soci”, so as to distinguish it from “Social”. Detailed information of the datasets are shown in Table 1. For each dataset, we randomly select of the instances for training, and the rest for testing.
In the GLOCAL algorithm, we use the kmeans clustering algorithm to partition the data into local groups. The solution of Eqn. (1) is used to warm-start and . The ’s are randomly initialized. GLOCAL is compared with the following state-of-the-art multi-label learning algorithms:
MLLOC huang2012 , which exploits local label correlations by encoding them into the instance’s feature representation;
LEML Yu2014 , which learns a linear instance-to-label mapping with low-rank structure, and implicitly takes advantage of global label correlation;
ML-LRC xu2014learning , which learns and exploits low-rank global label correlations for multi-label classification with missing labels.
Note that BR does not take label correlation into account. MLLOC considers only local label correlations; LEML implicitly uses global label correlations, whereas ML-LRC models global label correlation directly. On the ability to handle missing labels, BR and MLLOC can only learn with full labels.
For simplicity, we set in GLOCAL. The other parameters, as well as those of the baseline methods, are selected via 5-fold cross-validation on the training set. All the algorithms are implemented in Matlab (with some C++ code for LEML).
4.1.3 Performance Evaluation
Let be the number of test instances, be the sets of positive and negative labels associated with the th instance; and be the sets of positive and negative instances belonging to the th label. Given input , let be the rank of label in the predicted label ranking (sorted in descending order). For performance evaluation, we use the following popular metrics in multi-label learning zhang2014review :
Ranking loss (Rkl): This is the fraction that a negative label is ranked higher than a positive label. For instance , define . Then,
Average AUC (Auc): This is the fraction that a positive instance is ranked higher than a negative instance, averaged over all labels. Specifically, for label , define . Then,
Coverage (Cvg): This counts how many steps are needed to move down the predicted label ranking so as to cover all the positive labels of the instances.
Average precision (Ap): This is the average fraction of positive labels ranked higher than a particular positive label. For instance , define . Then,
For Auc and Ap, the higher the better; whereas for Rkl and Cvg, the lower the better. To reduce statistical variability, results are averaged over 10 independent repetitions.
is significantly better (paired t-tests at 95% significance level).
4.2 Learning with Full Labels
In this experiment, all elements in the training label matrix are observed. Performance on the test data is shown in Table 2. As expected, BR is the worst , since it treats each label independently without considering label correlations. MLLOC only considers local label correlations and LEML only makes use of the low-rank structure. Though ML-LRC takes advantage of both the low-rank structure and label correlations, only global label correlations are considered. As a result, GLOCAL is the best overall, as it models both global and local label correlations.
To show the example correlations learned by GLOCAL, we use two local groups extracted from the Image dataset. Figure 1 shows that local label correlation does vary from group to group, and is different from global correlation. For group 1, “sunset” is highly correlated with “desert” and “sea” (Figure 1(c)). This can also be seen from the images in Figure 1(a). Moreover, “trees” sometimes co-occurs with “deserts” (first and last images in Figure 1(a)). However, in group 2 (Figure 1(d)), “mountain” and “sea” often occur together and “trees” occurs less often with “desert” (Figure 1(b)). Figure 1(e) shows the learned global label correlation: “sea” and “sunset”, “mountain” and “trees” are positively correlated, whereas “desert” and “sea”, “desert” and “trees” are negatively correlated. All these correlations are consistent with intuition.
To further validate the effectiveness of global and local label correlations, we study two degenerate versions of GLOCAL: (i) GLObal, which uses only global label correlations; and (ii) loCAL, which uses only local label correlations. Note that the local groups obtained by clustering are not of equal sizes. For some datasets, the largest cluster contains more than of instances, while some small ones contain fewer than each. Global correlation is then dominated by the local correlation matrix of the largest cluster (Proposition 1), making the performance difference on the whole test set obscure. Hence, we focus on the performance of the small clusters. As can be seen from Table 3, using only global or local correlation may be good enough on some data sets (such as Health). On the other hand, considering both types of correlation as in GLOCAL achieves comparable or even better performance.
|Art||Rkl ()||0.1370.003||0.1370.002||0.1300.005||Bus||Rkl ()||0.0400.002||0.0400.002||0.0400.003|
|Auc ()||0.8630.003||0.8630.002||0.8700.005||Auc ()||0.9580.003||0.9580.003||0.9580.003|
|Cvg ()||5.2860.046||5.2860.046||5.1970.065||Cvg ()||2.5290.035||2.5280.040||2.5280.040|
|Ap ()||0.6020.013||0.6020.010||0.6310.011||Ap ()||0.8820.002||0.8820.002||0.8860.003|
|Com||Rkl ()||0.0950.002||0.0950.002||0.0920.002||Edu||Rkl ()||0.1010.002||0.1010.002||0.0970.002|
|Auc ()||0.9050.002||0.9050.002||0.9080.001||Auc ()||0.8990.002||0.8990.002||0.9030.002|
|Cvg ()||4.4820.032||4.4860.040||4.3640.055||Cvg ()||4.8030.033||4.8050.036||4.6720.051|
|Ap ()||0.6770.003||0.6760.003||0.6780.005||Ap ()||0.6050.003||0.6050.003||0.6240.005|
|Ent||Rkl ()||0.0910.002||0.0910.002||0.0860.003||Hea||Rkl ()||0.0540.002||0.0540.003||0.0530.004|
|Auc ()||0.9090.002||0.9090.002||0.9140.002||Auc ()||0.9450.003||0.9460.003||0.9470.003|
|Cvg ()||2.8170.027||2.7970.035||2.7090.059||Cvg ()||3.5080.036||3.5060.049||3.5040.041|
|Ap ()||0.7480.003||0.7490.004||0.7590.006||Ap ()||0.8100.004||0.8100.004||0.8120.006|