Multi-label learning deals with the problem where an instance can be associated with multiple labels simultaneously. Formally speaking, let be -dimensional feature space and be the label space with labels. Given the multi-label training set where
is a feature vector andis the label vector, the goal of multi-label learning is to learn a model , which maps from the space of feature vectors to the space of label vectors. As a learning framework that handles objects with multiple semantics, multi-label learning has been widely applied in many real-world applications, such as image annotation [Yang et al.2016], document categorization [Li, Ouyang, and Zhou2015], bioinformatics [Zhang and Zhou2006], and information retrieval [Gopal and Yang2010].
The most straightforward multi-label learning approach [Boutell et al.2004] is to decompose the problem into a set of independent binary classification tasks, one for each label. Although this strategy is easy to implement, it may result in degraded performance, due to the ignorance of correlations among labels. To compensate for this deficiency, the exploitation of label correlations has been widely accepted as a key component of effective multi-label learning approaches [Gibaja and Ventura2015, Zhang and Zhou2014].
So far, many methods have been developed to improve the performance of multi-label learning by exploring various types of label correlations [Tsoumakas et al.2009, Cesa-Bianchi, Gentile, and Zaniboni2006, Petterson and Caetano2011, Huang, Zhou, and Zhou2012, Huang, Yu, and Zhou2012, Zhu, Kwok, and Zhou2018]. There has been increasing interest in exploiting the label correlations by taking the label correlation matrix as prior knowledge [Hariharan et al.2010, Cai et al.2013, Huang et al.2016, Huang et al.2018]. Concretely, these methods directly calculate the label correlation matrix by the similarity between label vectors using common similarity measures, and then incorporate the label correlation matrix into model training for further enhancing the predictions of multiple label assignments. However, the label correlations are simply obtained by common similarity measures, which may not be able to reflect complex relationships among labels. Besides, these methods exploit label correlations by manipulating the hypothesis space, while the final predictions are not explicitly correlated.
To address the above limitations, we make a key assumption that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels
. Based on this assumption, a novel multi-label learning approach named CAMEL, i.e., CollAboration based Multi-labEl Learning, is proposed. Different from most of the existing approaches that calculate the label correlation matrix simply by common similarity measures, CAMEL presents a novel method to learn such matrix and show that it is equivalent to sparse reconstruction in the label space. The learned label correlation matrix is capable of reflecting the collaborative relationships among labels regarding the final predictions. Subsequently, CAMEL seamlessly incorporates the learned label correlations into the desired multi-label predictive model. Specifically, label-independent embedding is introduced, which aims to fit the final predictions with the learned label correlations while guiding the estimation of the model parameters simultaneously. The effectiveness of CAMEL is clearly demonstrated by experimental results on a number of datasets.
In recent years, many algorithms have been proposed to deal with multi-label learning tasks. In terms of the order of label correlations being considered, these approaches can be roughly categorized into three strategies [Zhang and Zhou2014, Gibaja and Ventura2015].
For the first-order strategy, the multi-label learning problem is tackled in a label-by-label manner where label correlations are ignored. Intuitively, one can easily decompose the multi-label learning problem into a series of independent binary classification problems (one for each label) [Boutell et al.2004]. The second-order strategy takes into consideration pairwise relationships between labels, such as the ranking between relevant labels and irrelevant labels [Elisseeff and Weston2002] or the interaction of paired labels [Zhu et al.2005]. For the third-order strategy, high-order relationships among labels are considered. Following this strategy, numerous multi-label algorithms are proposed. For example, by modeling all other labels’ influences on each label, a shared subspace [Ji et al.2008]
is extracted for model training. By addressing connections among random subsets of labels, a chain of binary classifiers[Read et al.2011] are sequentially trained.
that take the label correlation matrix as prior knowledge for model training. These approaches normally directly calculate the label correlation matrix by the similarity between label vectors using common similarity measures, and then incorporate the label correlation matrix into model training for further enhancing the predictions of multiple label assignments. For instance, cosine similarity is widely used to calculate the label correlation matrix[Cai et al.2013, Huang et al.2016, Huang et al.2018]. Such label correlation matrix is further incorporated into a structured sparsity-inducing norm regularization [Cai et al.2013]
for regularizing the learning hypotheses, or performing joint label-specific feature selection and model training[Huang et al.2016, Huang et al.2018]. In addition, there are also some high-order approaches that exploit label correlations on the hypothesis space, while they do not rely on the label correlation matrix. For example, a boosting approach [Huang, Yu, and Zhou2012] is proposed to exploit label correlations with a hypothesis reuse mechanism.
Note that most of the existing approaches using label correlation matrix are second-order and focus on the hypothesis space. Such simple label correlations exploited in the hypothesis space may not correctly depict the real relationships among labels, and final predictions are not explicitly correlated. In the next section, a novel high-order approach with crafted label correlation matrix that focus on the label space will be introduced.
The CAMEL Approach
Following the notations used in Introduction, the training set can be alternatively represented by where denotes the instance matrix, and denotes the label matrix. In addition, we denote by the -th column vector of the matrix (versus for the -th row vector of ), and represents the matrix that excludes the -th column vector of .
Label Correlation Learning
To characterize the collaborative relationships among labels regarding the final predictions, CAMEL works by learning a label correlation matrix where reflects the contribution of the -label to the -label. Guided by the assumption that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels, we thus take the given label matrix as the final prediction, and propose to learn the label correlation matrix in the following way:
where is the tradeoff parameter that controls the collaboration degree. In other words, is used to balance the -th label’s own prediction and the predictions of other labels. Since each label is normally correlated with only a few labels, the collaborative relationships between one label and other labels could be sparse. With a slight abuse of notation, we denote by the -th column vector of excluding (). Under canonical sparse representation, the coefficient vector is learned by solving the following optimization problem:
where controls the sparsity of the coefficient vector . By properly rewriting the above problem and setting , it is easy to derive the following equivalent optimization problem:
Here, this problem aims to estimate the collaborative relationships between the -th label and the other labels via sparse reconstruction. The first term corresponds to the linear reconstruction error via norm, and the second term controls the sparsity of the reconstruction coefficients by using norm. The relative importance of each term is balanced by the tradeoff parameter , which is empirically set to in the experiments. To solve problem (3), the popular Alternating Direction Method of Multiplier (ADMM) [Boyd et al.2011] is employed, and detailed information is given in Appendix A. After solving problem (3) for each label, the weight matrix can be accordingly constructed with all diagonal elements set to 0. Note that for most of the existing second-order approaches using label correlation matrix [Hariharan et al.2010, Cai et al.2013, Huang et al.2016, Huang et al.2018], only pairwise relationships are considered, and the relationships between one label and the other labels are separated. While for CAMEL, since the final prediction of each label is determined by all the predictions of other labels and itself, the relationships among all labels are exploited in a collaborative manner. Which means, the relationships between one label and the other labels are coordinated (influenced by each other). Therefore, CAMEL is a high-order approach.
Multi-Label Classifier Training
In this section, we propose a novel multi-label learning approach by seamlessly integrating the learned label correlations into the desired predictive model. Suppose the ordinary prediction matrix of is denoted by where denotes the individual label predictors respectively. In the ordinary setting, each label predictor is only in charge of a single label, while label correlations are fully lost. To absorb the learned label correlations into predictions, we reuse the assumption that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels, and propose to compute the final prediction of the -th label as follows:
where is consistent with problem (1), which controls the collaboration degree of label predictions. By considering all the label predictions simultaneously, we thus obtain the following compact representation:
Here, the whole multi-label learning problem could be considered as two parallel subproblems, i.e., training the ordinary model and fitting the final predictions by the modeling outputs with label correlations. Thus, we propose to learn label-independent embedding denoted by , which works as a bridge between model training and prediction fitting. This brings several advantages: First, the two subproblems can be solved via alternation, which encourages the mutual adaption of model training and prediction fitting; Second, the relative importance of the two subproblems can be controlled by a tradeoff parameter; Third, closed-form solutions and kernel extension can be easily derived. Let , the proposed formulation is given as follows:
where controls the complexity of the model , and are the tradeoff parameters determining the relative importance of the above three terms. To instantiate the above formulation, we choose to train the widely-used model where and are the model parameters, denotes the column vector with all elements equal to 1, and is a feature mapping that maps the feature space to some higher (maybe infinite) dimensional Hilbert space. For the regularization term to control the model complexity, we adopt the widely-used squared Frobenius norm, i.e., . To further facilitate a kernel extension for the general nonlinear case, we finally present the formulation as a constrained optimization problem:
Problem (7) is convex with respect to and with fixed, and also convex with respect to with and fixed. Therefore, it is a biconvex problem [Gorski, Pfeuffer, and Klamroth2007], and can be solved by an alternating approach.
Updating and with fixed
With fixed, problem (7) reduces to
By deriving the Lagrangian of the above constrained problem and setting the gradient with respect to to 0, it is easy to show where is the matrix that stores the Lagrangian multipliers. Let be the kernel matrix with its element , where represents the kernel function. For CAMEL, Gaussian kernel function is employed with set to the average Euclidean distance of all pairs of training instances. In this way, we choose to optimize with respect to and instead, and the close-form solutions are reported as follows:
where . The detailed information is provided in Appendix B.
Updating with and fixed
When and are fixed, the modeling output matrix is calculated by . By inserting , problem (7) reduces to:
Setting the gradient with respect to to 0, we can obtain the following closed-form solution:
Once the iterative process converges, the predicted label vector of the test instance is given as:
In this section, we conduct extensive experiments on various datasets to validate the effectiveness of CAMEL.
For comprehensive performance evaluation, we collect sixteen benchmark multi-label datasets. For each dataset , we denote by , , , , and the number of examples, the number of features (dimensions), the number of distinct class labels, the average number of labels associated with each example, and feature type, respectively. Table 1 summarizes the detailed characteristics of these datasets, which are organized in ascending order of . According to , we further roughly divide these datasets into regular-size datasets () and large-size datasets (
). For performance evaluation, 10-fold cross-validation is conducted on these datasets, where mean metric values with standard deviations are recorded.
For performance evaluation, we use seven widely-used evaluation metrics, includingOne-error, Hamming loss, Coverage, Ranking loss, Average precision, Macro-averaging F1, and Micro-averaging F1. Note that for all the employed multi-label evaluation metrics, their values vary within the interval [0,1]. In addition, for the last three metrics, the larger values indicate the better performance, and we use the symbol to present such positive logic. While for the first five metrics, the smaller values indicate the better performance, which is represented by . More detailed information about these evaluation metrics can be found in [Zhang and Zhou2014].
CAMEL is compared with three well-established and two state-of-the-art multi-label learning algorithms, including the first-order approach BR [Boutell et al.2004], the second-order approaches LLSF [Huang et al.2016] and JFSC [Huang et al.2018], and the high-order approaches ECC [Read et al.2011], and RAKEL [Tsoumakas, Katakis, and Vlahavas2011]. Here, LLSF and JFSC are the state-of-the-art counterparts using label correlation matrix.
BR, ECC, and RAKEL are implemented under the MULAN multi-label learning package [Tsoumakas et al.2011]
by using the logistic regression model as the base classifier. Furthermore, parameters suggested in the corresponding literatures are used, i.e., ECC: ensemble size 30; RAKEL: ensemble sizewith . For LLSF, parameters are chosen from , and chosen from . For JFSC, parameters , and are chosen from , and is chosen from . For the proposed approach CAMEL, is empirically set to 1, is chosen from , and is chosen from . All of these parameters are decided by conducting 5-fold cross-validation on training set.
Table 2 and 3 report the detailed experimental results on the regular-scale and large-scale datasets respectively, where the best performance among all the algorithms is shown in boldface. From the two result tables, we can see that CAMEL outperforms other comparing algorithms in most cases. Specifically, on the regular-size datasets (Table 2), across all the evaluation metrics, CAMEL ranks first in 80.4% (45/56) cases, and on the large-scale datasets (Table 3), across all the evaluation metrics, CAMEL ranks first in 69.6% (39/56) cases. Compared with the three well-established algorithms BR, ECC, and RAKEL, CAMEL introduces a new type of label correlations, i.e., collaborative relationships among labels, and achieves superior performance in 93.8% (315/336) cases. Compared with the two state-of-the-art algorithms LLSF and JFSC, instead of employing simple similarity measures to regularize the hypothesis space, CAMEL introduces a novel method to learn label correlations for explicitly correlating the final predictions, and achieves superior performance in 80.4% (180/224) cases. These comparative results clearly demonstrate the effectiveness of the collaboration based multi-label learning approach.
In this section, we first investigate the sensitivity of CAMEL with respect to the two tradeoff parameters and , and the parameter that controls the degree of collaboration, then illustrate the convergence of CAMEL. Due to page limit, we only report the experimental results on the enron dataset using the Coverage () metric. Concretely, we study the performance of CAMEL when we vary one parameter while keeping other parameters fixed at their best setting. Figure 1(a), 1(b), and 1(c) show the sensitivity curve of CAMEL with respect to , , and respectively. It can be seen that and have an important influence on the final performance, because and control the collaboration degree and the model complexity. Figure 1(d) illustrates the convergence of CAMEL by using the difference of the optimization variable between two successive iterations, i.e., . From Figure 1(d), we can observe that quickly decreases to 0 within a few number of iterations. Hence the convergence of CAMEL is demonstrated.
In this paper, we make a key assumption for multi-label learning that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels. Guided by this assumption, we propose a novel method to learn the high-order label correlations via sparse reconstruction in the label space. Besides, by seamlessly integrating the learned label correlations into model training, we propose a novel multi-label learning approach that aims to explicitly account for the correlated predictions of labels while training the desired model simultaneously. Extensive experimental results show that our approach outperforms the state-of-the-art counterparts.
Despite the demonstrated effectiveness of CAMEL, it only considers the global collaborative relationships between labels, by assuming that such collaborative relationships are shared by all the instances. However, as different instances have different characteristics, such collaborative relationships may be shared by only a subset of instances rather than all the instances. Therefore, our further work is to explore different collaborative relationships between labels for different subsets of instances.
This work was supported by MOE, NRF, and NTU.
Appendix A. The ADMM Procedure
Following the ADMM procedure, the above constrained optimization problem (13) can be solved as a series of unconstrained minimization problems using augmented Lagrangian function, which is presented as:
Here, is the penalty parameter and is the Lagrange multiplier. By introducing the scaled dual variable , a sequential minimization of the scaled ADMM iterations can be conducted by updating the three variables , and sequentially:
where is the proximity operator of the norm, which is defined as .
Appendix B. Model Parameter Optimization
The Lagrangian of problem (8) is expressed as:
where is the trace operator, and is the introduced matrix that stores the Lagrangian multipliers. Besides, we have used the property of trace operator that . By Setting the gradient w.r.t. to 0 respectively, the following equations will be induced:
The above linear equations can be simplified by the following steps:
Here, we define , then we can obtain:
In this way, can be calculated by .
[Boutell et al.2004]
Boutell, M. R.; Luo, J.; Shen, X.; and Brown, C. M.
Learning multi-label scene classification.Pattern Recognition 37(9):1757–1771.
[Boyd et al.2011]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J.; et al.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends® in Machine learning3(1):1–122.
[Cai et al.2013]
Cai, X.; Nie, F.; Cai, W.; and Huang, H.
New graph structured sparsity model for multi-label image
Proceedings of the IEEE International Conference on Computer Vision, 801–808.
- [Cesa-Bianchi, Gentile, and Zaniboni2006] Cesa-Bianchi, N.; Gentile, C.; and Zaniboni, L. 2006. Hierarchical classification: combining bayes with svm. In Proceedings of the 23rd International Conference on Machine learning, 177–184.
- [Elisseeff and Weston2002] Elisseeff, A., and Weston, J. 2002. A kernel method for multi-labelled classification. In Advances in Neural Information Processing Systems, 681–687.
- [Gibaja and Ventura2015] Gibaja, E., and Ventura, S. 2015. A tutorial on multilabel learning. ACM Computing Surveys 47(3):52.
- [Gopal and Yang2010] Gopal, S., and Yang, Y. 2010. Multilabel classification with meta-level features. In Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval, 315–322.
- [Gorski, Pfeuffer, and Klamroth2007] Gorski, J.; Pfeuffer, F.; and Klamroth, K. 2007. Biconvex sets and optimization with biconvex functions: A survey and extensions. Mathematical Methods of Operations Research 66(3):373–407.
- [Hariharan et al.2010] Hariharan, B.; Zelnik-Manor, L.; Varma, M.; and Vishwanathan, S. 2010. Large scale max-margin multi-label classification with priors. In Proceedings of the 27th International Conference on Machine Learning, 423–430.
- [Huang et al.2016] Huang, J.; Li, G.; Huang, Q.; and Wu, X. 2016. Learning label-specific features and class-dependent labels for multi-label classification. IEEE Transactions on Knowledge and Data Engineering 28(12):3309–3323.
- [Huang et al.2018] Huang, J.; Li, G.; Huang, Q.; and Wu, X. 2018. Joint feature selection and classification for multilabel learning. IEEE Transactions on Cybernetics 48(3):876–889.
- [Huang, Yu, and Zhou2012] Huang, S.-J.; Yu, Y.; and Zhou, Z.-H. 2012. Multi-label hypothesis reuse. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 525–533.
[Huang, Zhou, and Zhou2012]
Huang, S.-J.; Zhou, Z.-H.; and Zhou, Z.
Multi-label learning by exploiting label correlations locally.
Proceedings of the 26th AAAI Conference on Artificial Intelligence, 949–955.
- [Ji et al.2008] Ji, S.; Tang, L.; Yu, S.; and Ye, J. 2008. Extracting shared subspace for multi-label classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 381–389.
- [Li, Ouyang, and Zhou2015] Li, X.; Ouyang, J.; and Zhou, X. 2015. Supervised topic models for multi-label classification. Neurocomputing 149:811–819.
- [Petterson and Caetano2011] Petterson, J., and Caetano, T. S. 2011. Submodular multi-label learning. In Advances in Neural Information Processing Systems, 1512–1520.
- [Read et al.2011] Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2011. Classifier chains for multi-label classification. Machine Learning 85(3):333.
- [Tsoumakas et al.2009] Tsoumakas, G.; Dimou, A.; Spyromitros, E.; Mezaris, V.; Kompatsiaris, I.; and Vlahavas, I. 2009. Correlation-based pruning of stacked binary relevance models for multi-label learning. In Proceedings of the 1st International Workshop on Learning from Multi-Label Data, 101–116.
- [Tsoumakas et al.2011] Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J.; and Vlahavas, I. 2011. Mulan: A java library for multi-label learning. Journal of Machine Learning Research 12(7):2411–2414.
- [Tsoumakas, Katakis, and Vlahavas2011] Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2011. Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23(7):1079–1089.
- [Yang et al.2016] Yang, H.; Tianyi Zhou, J.; Zhang, Y.; Gao, B.-B.; Wu, J.; and Cai, J. 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 280–288.
[Zhang and Zhou2006]
Zhang, M.-L., and Zhou, Z.-H.
Multilabel neural networks with applications to functional genomics and text categorization.IEEE Transactions on Knowledge and Data Engineering 18(10):1338–1351.
- [Zhang and Zhou2014] Zhang, M.-L., and Zhou, Z.-H. 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26(8):1819–1837.
- [Zhu et al.2005] Zhu, S.; Ji, X.; Xu, W.; and Gong, Y. 2005. Multi-labelled classification using maximum entropy method. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, 274–281.
- [Zhu, Kwok, and Zhou2018] Zhu, Y.; Kwok, J. T.; and Zhou, Z.-H. 2018. Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering 30(6):1081–1094.