Introduction
Multilabel learning deals with the problem where an instance can be associated with multiple labels simultaneously. Formally speaking, let be dimensional feature space and be the label space with labels. Given the multilabel training set where
is a feature vector and
is the label vector, the goal of multilabel learning is to learn a model , which maps from the space of feature vectors to the space of label vectors. As a learning framework that handles objects with multiple semantics, multilabel learning has been widely applied in many realworld applications, such as image annotation [Yang et al.2016], document categorization [Li, Ouyang, and Zhou2015], bioinformatics [Zhang and Zhou2006], and information retrieval [Gopal and Yang2010].The most straightforward multilabel learning approach [Boutell et al.2004] is to decompose the problem into a set of independent binary classification tasks, one for each label. Although this strategy is easy to implement, it may result in degraded performance, due to the ignorance of correlations among labels. To compensate for this deficiency, the exploitation of label correlations has been widely accepted as a key component of effective multilabel learning approaches [Gibaja and Ventura2015, Zhang and Zhou2014].
So far, many methods have been developed to improve the performance of multilabel learning by exploring various types of label correlations [Tsoumakas et al.2009, CesaBianchi, Gentile, and Zaniboni2006, Petterson and Caetano2011, Huang, Zhou, and Zhou2012, Huang, Yu, and Zhou2012, Zhu, Kwok, and Zhou2018]. There has been increasing interest in exploiting the label correlations by taking the label correlation matrix as prior knowledge [Hariharan et al.2010, Cai et al.2013, Huang et al.2016, Huang et al.2018]. Concretely, these methods directly calculate the label correlation matrix by the similarity between label vectors using common similarity measures, and then incorporate the label correlation matrix into model training for further enhancing the predictions of multiple label assignments. However, the label correlations are simply obtained by common similarity measures, which may not be able to reflect complex relationships among labels. Besides, these methods exploit label correlations by manipulating the hypothesis space, while the final predictions are not explicitly correlated.
To address the above limitations, we make a key assumption that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels
. Based on this assumption, a novel multilabel learning approach named CAMEL, i.e., CollAboration based MultilabEl Learning, is proposed. Different from most of the existing approaches that calculate the label correlation matrix simply by common similarity measures, CAMEL presents a novel method to learn such matrix and show that it is equivalent to sparse reconstruction in the label space. The learned label correlation matrix is capable of reflecting the collaborative relationships among labels regarding the final predictions. Subsequently, CAMEL seamlessly incorporates the learned label correlations into the desired multilabel predictive model. Specifically, labelindependent embedding is introduced, which aims to fit the final predictions with the learned label correlations while guiding the estimation of the model parameters simultaneously. The effectiveness of CAMEL is clearly demonstrated by experimental results on a number of datasets.
Related Work
In recent years, many algorithms have been proposed to deal with multilabel learning tasks. In terms of the order of label correlations being considered, these approaches can be roughly categorized into three strategies [Zhang and Zhou2014, Gibaja and Ventura2015].
For the firstorder strategy, the multilabel learning problem is tackled in a labelbylabel manner where label correlations are ignored. Intuitively, one can easily decompose the multilabel learning problem into a series of independent binary classification problems (one for each label) [Boutell et al.2004]. The secondorder strategy takes into consideration pairwise relationships between labels, such as the ranking between relevant labels and irrelevant labels [Elisseeff and Weston2002] or the interaction of paired labels [Zhu et al.2005]. For the thirdorder strategy, highorder relationships among labels are considered. Following this strategy, numerous multilabel algorithms are proposed. For example, by modeling all other labels’ influences on each label, a shared subspace [Ji et al.2008]
is extracted for model training. By addressing connections among random subsets of labels, a chain of binary classifiers
[Read et al.2011] are sequentially trained.Recently, there has been increasing interest in secondorder approaches [Hariharan et al.2010, Cai et al.2013, Huang et al.2016, Huang et al.2018]
that take the label correlation matrix as prior knowledge for model training. These approaches normally directly calculate the label correlation matrix by the similarity between label vectors using common similarity measures, and then incorporate the label correlation matrix into model training for further enhancing the predictions of multiple label assignments. For instance, cosine similarity is widely used to calculate the label correlation matrix
[Cai et al.2013, Huang et al.2016, Huang et al.2018]. Such label correlation matrix is further incorporated into a structured sparsityinducing norm regularization [Cai et al.2013]for regularizing the learning hypotheses, or performing joint labelspecific feature selection and model training
[Huang et al.2016, Huang et al.2018]. In addition, there are also some highorder approaches that exploit label correlations on the hypothesis space, while they do not rely on the label correlation matrix. For example, a boosting approach [Huang, Yu, and Zhou2012] is proposed to exploit label correlations with a hypothesis reuse mechanism.Note that most of the existing approaches using label correlation matrix are secondorder and focus on the hypothesis space. Such simple label correlations exploited in the hypothesis space may not correctly depict the real relationships among labels, and final predictions are not explicitly correlated. In the next section, a novel highorder approach with crafted label correlation matrix that focus on the label space will be introduced.
The CAMEL Approach
Following the notations used in Introduction, the training set can be alternatively represented by where denotes the instance matrix, and denotes the label matrix. In addition, we denote by the th column vector of the matrix (versus for the th row vector of ), and represents the matrix that excludes the th column vector of .
Label Correlation Learning
To characterize the collaborative relationships among labels regarding the final predictions, CAMEL works by learning a label correlation matrix where reflects the contribution of the label to the label. Guided by the assumption that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels, we thus take the given label matrix as the final prediction, and propose to learn the label correlation matrix in the following way:
(1) 
where is the tradeoff parameter that controls the collaboration degree. In other words, is used to balance the th label’s own prediction and the predictions of other labels. Since each label is normally correlated with only a few labels, the collaborative relationships between one label and other labels could be sparse. With a slight abuse of notation, we denote by the th column vector of excluding (). Under canonical sparse representation, the coefficient vector is learned by solving the following optimization problem:
(2) 
where controls the sparsity of the coefficient vector . By properly rewriting the above problem and setting , it is easy to derive the following equivalent optimization problem:
(3) 
Here, this problem aims to estimate the collaborative relationships between the th label and the other labels via sparse reconstruction. The first term corresponds to the linear reconstruction error via norm, and the second term controls the sparsity of the reconstruction coefficients by using norm. The relative importance of each term is balanced by the tradeoff parameter , which is empirically set to in the experiments. To solve problem (3), the popular Alternating Direction Method of Multiplier (ADMM) [Boyd et al.2011] is employed, and detailed information is given in Appendix A. After solving problem (3) for each label, the weight matrix can be accordingly constructed with all diagonal elements set to 0. Note that for most of the existing secondorder approaches using label correlation matrix [Hariharan et al.2010, Cai et al.2013, Huang et al.2016, Huang et al.2018], only pairwise relationships are considered, and the relationships between one label and the other labels are separated. While for CAMEL, since the final prediction of each label is determined by all the predictions of other labels and itself, the relationships among all labels are exploited in a collaborative manner. Which means, the relationships between one label and the other labels are coordinated (influenced by each other). Therefore, CAMEL is a highorder approach.
MultiLabel Classifier Training
In this section, we propose a novel multilabel learning approach by seamlessly integrating the learned label correlations into the desired predictive model. Suppose the ordinary prediction matrix of is denoted by where denotes the individual label predictors respectively. In the ordinary setting, each label predictor is only in charge of a single label, while label correlations are fully lost. To absorb the learned label correlations into predictions, we reuse the assumption that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels, and propose to compute the final prediction of the th label as follows:
(4) 
where is consistent with problem (1), which controls the collaboration degree of label predictions. By considering all the label predictions simultaneously, we thus obtain the following compact representation:
(5) 
Here, the whole multilabel learning problem could be considered as two parallel subproblems, i.e., training the ordinary model and fitting the final predictions by the modeling outputs with label correlations. Thus, we propose to learn labelindependent embedding denoted by , which works as a bridge between model training and prediction fitting. This brings several advantages: First, the two subproblems can be solved via alternation, which encourages the mutual adaption of model training and prediction fitting; Second, the relative importance of the two subproblems can be controlled by a tradeoff parameter; Third, closedform solutions and kernel extension can be easily derived. Let , the proposed formulation is given as follows:
(6) 
where controls the complexity of the model , and are the tradeoff parameters determining the relative importance of the above three terms. To instantiate the above formulation, we choose to train the widelyused model where and are the model parameters, denotes the column vector with all elements equal to 1, and is a feature mapping that maps the feature space to some higher (maybe infinite) dimensional Hilbert space. For the regularization term to control the model complexity, we adopt the widelyused squared Frobenius norm, i.e., . To further facilitate a kernel extension for the general nonlinear case, we finally present the formulation as a constrained optimization problem:
(7)  
Optimization
Problem (7) is convex with respect to and with fixed, and also convex with respect to with and fixed. Therefore, it is a biconvex problem [Gorski, Pfeuffer, and Klamroth2007], and can be solved by an alternating approach.
Updating and with fixed
With fixed, problem (7) reduces to
(8)  
By deriving the Lagrangian of the above constrained problem and setting the gradient with respect to to 0, it is easy to show where is the matrix that stores the Lagrangian multipliers. Let be the kernel matrix with its element , where represents the kernel function. For CAMEL, Gaussian kernel function is employed with set to the average Euclidean distance of all pairs of training instances. In this way, we choose to optimize with respect to and instead, and the closeform solutions are reported as follows:
(9)  
where . The detailed information is provided in Appendix B.
Updating with and fixed
When and are fixed, the modeling output matrix is calculated by . By inserting , problem (7) reduces to:
(10) 
Setting the gradient with respect to to 0, we can obtain the following closedform solution:
(11) 
Once the iterative process converges, the predicted label vector of the test instance is given as:
(12) 
The pseudo code of CAMEL is presented in Algorithm 1. Since the proposed formulation is biconvex, this alternating optimization process is guaranteed to converge [Gorski, Pfeuffer, and Klamroth2007].
Experiments
In this section, we conduct extensive experiments on various datasets to validate the effectiveness of CAMEL.
Experimental Setup
Datasets
For comprehensive performance evaluation, we collect sixteen benchmark multilabel datasets. For each dataset , we denote by , , , , and the number of examples, the number of features (dimensions), the number of distinct class labels, the average number of labels associated with each example, and feature type, respectively. Table 1 summarizes the detailed characteristics of these datasets, which are organized in ascending order of . According to , we further roughly divide these datasets into regularsize datasets () and largesize datasets (
). For performance evaluation, 10fold crossvalidation is conducted on these datasets, where mean metric values with standard deviations are recorded.
Evaluation Metrics
For performance evaluation, we use seven widelyused evaluation metrics, including
Oneerror, Hamming loss, Coverage, Ranking loss, Average precision, Macroaveraging F1, and Microaveraging F1. Note that for all the employed multilabel evaluation metrics, their values vary within the interval [0,1]. In addition, for the last three metrics, the larger values indicate the better performance, and we use the symbol to present such positive logic. While for the first five metrics, the smaller values indicate the better performance, which is represented by . More detailed information about these evaluation metrics can be found in [Zhang and Zhou2014].Comparing Approaches
CAMEL is compared with three wellestablished and two stateoftheart multilabel learning algorithms, including the firstorder approach BR [Boutell et al.2004], the secondorder approaches LLSF [Huang et al.2016] and JFSC [Huang et al.2018], and the highorder approaches ECC [Read et al.2011], and RAKEL [Tsoumakas, Katakis, and Vlahavas2011]. Here, LLSF and JFSC are the stateoftheart counterparts using label correlation matrix.
BR, ECC, and RAKEL are implemented under the MULAN multilabel learning package [Tsoumakas et al.2011]
by using the logistic regression model as the base classifier. Furthermore, parameters suggested in the corresponding literatures are used, i.e., ECC: ensemble size 30; RAKEL: ensemble size
with . For LLSF, parameters are chosen from , and chosen from . For JFSC, parameters , and are chosen from , and is chosen from . For the proposed approach CAMEL, is empirically set to 1, is chosen from , and is chosen from . All of these parameters are decided by conducting 5fold crossvalidation on training set.Dataset  

cal500  502  68  174  26.04  numeric 
emotions  593  72  6  1.87  numeric 
genbase  662  1185  27  1.25  nominal 
medical  978  1449  45  1.25  nominal 
enron  1702  1001  53  3.38  nominal 
image  2000  294  5  1.24  numeric 
scene  2407  294  5  1.07  numeric 
yeast  2417  103  14  4.24  numeric 
science  5000  743  40  1.45  numeric 
arts  5000  462  26  1.64  numeric 
business  5000  438  30  1.59  numeric 
rcv1s1  6000  944  101  2.88  nominal 
rcv1s2  6000  944  101  2.63  nominal 
rcv1s3  6000  944  101  2.61  nominal 
rcv1s4  6000  944  101  2.48  nominal 
rcv1s5  6000  944  101  2.64  nominal 
Experimental Results

Oneerror  

cal500  emotions  genbase  medical  enron  image  scene  yeast  
CAMEL  0.1290.053  0.2920.052  0.0010.001  0.1100.021  0.2070.038  0.2420.033  0.1750.027  0.2180.027  
BR  0.8930.038  0.2840.077  0.0170.016  0.3220.055  0.6460.023  0.3870.027  0.3610.036  0.2440.028  
ECC  0.2950.036  0.2960.074  0.0100.013  0.1560.037  0.4210.034  0.4060.023  0.3060.020  0.2380.030  
RAKEL  0.6340.039  0.3000.070  0.0090.007  0.2430.055  0.5320.007  0.4020.024  0.2800.031  0.2440.027  
LLSF  0.1380.050  0.4120.051  0.0020.005  0.1200.020  0.2500.042  0.3270.030  0.2590.020  0.3940.029  
JFSC  0.1160.051  0.4380.086  0.0020.005  0.1280.024  0.2780.041  0.3460.023  0.2660.022  0.2420.021  

Hamming loss  
cal500  emotions  genbase  medical  enron  image  scene  yeast  
CAMEL  0.1360.005  0.2030.021  0.0010.001  0.0110.001  0.0450.003  0.1440.012  0.0720.009  0.1900.005  
BR  0.1890.005  0.2160.028  0.0020.002  0.0260.003  0.1110.006  0.2100.014  0.1390.009  0.2050.007  
ECC  0.1540.005  0.2140.027  0.0090.004  0.0110.002  0.0670.002  0.2100.016  0.1120.006  0.2040.010  
RAKEL  0.1950.004  0.2380.025  0.0020.001  0.0200.002  0.0920.004  0.2230.013  0.1390.008  0.2240.009  
LLSF  0.1380.006  0.2670.022  0.0010.001  0.0100.002  0.0480.002  0.1800.010  0.1090.003  0.2780.009  
JFSC  0.1910.004  0.2950.019  0.0010.001  0.0100.001  0.0510.003  0.1880.012  0.1100.007  0.2060.006  

Coverage  
cal500  emotions  genbase  medical  enron  image  scene  yeast  
CAMEL  0.7520.019  0.3120.031  0.0120.005  0.0280.012  0.2390.028  0.1560.016  0.0620.006  0.4460.010  
BR  0.7860.015  0.3190.026  0.0140.006  0.1130.030  0.5800.023  0.2160.018  0.1680.015  0.4630.011  
ECC  0.7960.019  0.3100.029  0.0130.003  0.0340.012  0.2910.020  0.2330.022  0.1350.010  0.4600.010  
RAKEL  0.9620.016  0.3620.027  0.0140.005  0.0950.018  0.5130.019  0.2530.017  0.1690.013  0.5440.013  
LLSF  0.7780.025  0.3620.032  0.0210.006  0.0310.014  0.2830.023  0.1920.007  0.0920.006  0.6010.020  
JFSC  0.7300.026  0.3920.046  0.0140.007  0.0300.012  0.3140.024  0.2000.009  0.1020.007  0.4550.011  

Ranking loss  
cal500  emotions  genbase  medical  enron  image  scene  yeast  
CAMEL  0.1770.009  0.1800.032  0.0010.001  0.0160.008  0.079 0.028  0.1280.013  0.0580.005  0.1620.007  
BR  0.2330.007  0.1820.030  0.0030.004  0.0910.027  0.3040.014  0.2040.017  0.1510.015  0.1760.008  
ECC  0.2190.007  0.1720.031  0.0020.002  0.0220.010  0.1180.008  0.2250.023  0.1170.010  0.1790.009  
RAKEL  0.3660.008  0.2250.029  0.0020.001  0.0730.018  0.2440.017  0.2210.018  0.1310.014  0.2400.009  
LLSF  0.1840.012  0.2450.033  0.0020.003  0.0190.010  0.1070.009  0.1740.006  0.0930.005  0.3460.017  
JFSC  0.1880.010  0.2710.041  0.0030.003  0.0170.008  0.1180.013  0.1830.007  0.1050.007  0.1790.009  

Average precision  
cal500  emotions  genbase  medical  enron  image  scene  yeast  
CAMEL  0.5150.018  0.7880.035  0.9970.003  0.9170.017  0.7180.025  0.8430.018  0.8970.012  0.7750.013  
BR  0.3450.018  0.7830.040  0.9880.008  0.7500.036  0.3880.016  0.7530.016  0.7710.021  0.7540.013  
ECC  0.4420.014  0.7890.036  0.9910.008  0.8840.023  0.5570.015  0.7380.020  0.8110.012  0.7560.014  
RAKEL  0.3290.016  0.7630.039  0.9930.006  0.8000.032  0.4560.019  0.7350.017  0.8040.022  0.7200.014  
LLSF  0.5070.021  0.7160.035  0.9970.005  0.9120.015  0.6820.028  0.7900.014  0.8430.008  0.6010.015  
JFSC  0.4920.020  0.6910.040  0.9970.004  0.9080.016  0.6550.025  0.7790.011  0.8350.010  0.7460.012  

Macroaveraging F1  
cal500  emotions  genbase  medical  enron  image  scene  yeast  
CAMEL  0.1800.032  0.6250.052  0.9710.030  0.7790.043  0.3250.044  0.6600.030  0.7870.023  0.4110.018  
BR  0.1670.019  0.6200.044  0.9510.029  0.6400.060  0.2360.016  0.5530.027  0.6230.026  0.3910.021  
ECC  0.2360.027  0.6220.043  0.9280.037  0.7550.054  0.3030.030  0.5400.030  0.6620.026  0.3950.015  
RAKEL  0.1870.020  0.6140.044  0.9580.030  0.6890.051  0.2560.017  0.5400.028  0.6440.024  0.3810.020  
LLSF  0.1800.031  0.6150.056  0.9710.031  0.7690.057  0.2920.043  0.5540.031  0.6150.007  0.2350.016  
JFSC  0.2390.031  0.3450.023  0.9710.031  0.7720.043  0.3390.048  0.5590.035  0.7050.019  0.3000.007  

Microaveraging F1  
cal500  emotions  genbase  medical  enron  image  scene  yeast  
CAMEL  0.3370.017  0.6490.041  0.9880.012  0.8350.019  0.5800.023  0.6590.031  0.7800.026  0.6550.010  
BR  0.3390.016  0.6390.050  0.9780.014  0.6110.032  0.3590.015  0.5580.028  0.6190.023  0.6330.013  
ECC  0.3640.015  0.6420.046  0.9070.035  0.7960.023  0.4520.015  0.5410.030  0.6530.023  0.6430.017  
RAKEL  0.3510.018  0.6290.049  0.9830.011  0.6780.042  0.3920.014  0.5410.031  0.6290.026  0.6320.016  
LLSF  0.3250.015  0.6370.049  0.9920.003  0.8230.027  0.5340.025  0.5570.032  0.6180.008  0.2800.018  
JFSC  0.4730.013  0.4060.022  0.9950.006  0.8180.018  0.5550.026  0.5650.033  0.6950.022  0.6090.012 

Oneerror  

science  arts  rcv1s1  rcv1s2  rcv1s3  rcv1s4  rcv1s5  business  
CAMEL  0.4570.021  0.4620.024  0.4040.019  0.4030.018  0.4130.019  0.3310.016  0.4040.010  0.1010.009  
BR  0.7600.015  0.6420.022  0.7420.019  0.7230.024  0.7180.021  0.6620.021  0.7150.015  0.4170.016  
ECC  0.5740.022  0.5260.023  0.4710.020  0.4410.021  0.4480.021  0.3780.019  0.4250.016  0.1530.008  
RAKEL  0.6230.014  0.5430.024  0.6130.019  0.5920.022  0.5780.020  0.5520.020  0.5750.014  0.2010.009  
LLSF  0.4860.013  0.4540.027  0.4090.015  0.4060.016  0.4150.021  0.3330.016  0.3990.018  0.1220.016  
JFSC  0.4890.027  0.4470.027  0.4180.016  0.4070.014  0.4180.025  0.3370.015  0.4070.023  0.1220.019  

Hamming loss  
science  arts  rcv1s1  rcv1s2  rcv1s3  rcv1s4  rcv1s5  business  
CAMEL  0.0300.001  0.0550.002  0.0260.008  0.0230.001  0.0230.001  0.0180.001  0.0220.001  0.0240.001  
BR  0.0720.002  0.0790.003  0.0560.001  0.0530.001  0.0530.001  0.0410.001  0.0510.002  0.0490.001  
ECC  0.0360.002  0.0630.002  0.0280.001  0.0240.001  0.0240.001  0.0190.001  0.0240.001  0.0300.001  
RAKEL  0.0420.002  0.0750.002  0.0460.001  0.0390.001  0.0350.001  0.0350.001  0.0360.003  0.0350.002  
LLSF  0.0360.002  0.0540.002  0.0270.001  0.0250.001  0.0250.001  0.0190.001  0.0230.001  0.0480.007  
JFSC  0.0350.002  0.0570.002  0.0290.001  0.0250.001  0.0250.001  0.0190.001  0.0250.001  0.0270.002  

Coverage  
science  arts  rcv1s1  rcv1s2  rcv1s3  rcv1s4  rcv1s5  business  
CAMEL  0.1890.010  0.2050.009  0.1510.008  0.1420.012  0.1310.006  0.1430.003  0.1320.005  0.0820.006  
BR  0.3030.011  0.2040.009  0.3930.011  0.3410.013  0.3510.018  0.2940.015  0.3360.013  0.1410.002  
ECC  0.1960.009  0.2290.009  0.1660.011  0.1540.007  0.1540.008  0.1080.003  0.1450.001  0.1040.001  
RAKEL  0.2090.012  0.2140.008  0.2730.011  0.3290.012  0.2930.017  0.2730.012  0.2460.012  0.1070.003  
LLSF  0.1970.014  0.1950.011  0.1410.009  0.1460.008  0.1330.008  0.1090.006  0.1330.006  0.0860.013  
JFSC  0.1960.011  0.2330.018  0.1400.006  0.1430.009  0.1360.010  0.1060.005  0.1390.006  0.0860.011  

Ranking loss  
science  arts  rcv1s1  rcv1s2  rcv1s3  rcv1s4  rcv1s5  business  
CAMEL  0.1390.007  0.1350.008  0.0580.003  0.0770.005  0.0470.003  0.0570.002  0.0730.002  0.0400.004  
BR  0.2450.009  0.1450.006  0.1970.006  0.1900.008  0.1980.010  0.1730.009  0.1810.006  0.0880.006  
ECC  0.1510.006  0.1640.007  0.0740.005  0.0690.003  0.0700.002  0.0470.004  0.0630.003  0.0550.002  
RAKEL  0.1950.007  0.1560.008  0.1830.006  0.1530.008  0.1780.010  0.1120.009  0.1230.006  0.0670.005  
LLSF  0.1490.009  0.1410.009  0.0600.003  0.0600.004  0.0480.003  0.0340.003  0.0450.003  0.0450.009  
JFSC  0.1470.008  0.1590.009  0.0610.003  0.0620.006  0.0610.004  0.0470.003  0.0600.003  0.0450.008  

Average precision  
science  arts  rcv1s1  rcv1s2  rcv1s3  rcv1s4  rcv1s5  business  
CAMEL  0.6240.016  0.6070.018  0.6150.009  0.6440.012  0.6350.010  0.7170.008  0.6260.009  0.8910.009  
BR  0.3830.011  0.5140.013  0.3530.011  0.3820.015  0.3820.015  0.4430.013  0.3900.009  0.7090.008  
ECC  0.5160.020  0.5530.018  0.5450.016  0.5870.015  0.5850.016  0.6770.017  0.6000.009  0.8440.005  
RAKEL  0.4870.012  0.5260.015  0.4240.012  0.4890.016  0.4590.014  0.4790.012  0.4320.009  0.8580.007  
LLSF  0.5940.021  0.6310.016  0.6270.009  0.6370.008  0.6320.013  0.7140.010  0.6250.013  0.8670.013  
JFSC  0.5950.020  0.6210.020  0.6060.008  0.6300.009  0.6240.014  0.7000.012  0.6240.013  0.8740.018  

Macroaveraging F1  
science  arts  rcv1s1  rcv1s2  rcv1s3  rcv1s4  rcv1s5  business  
CAMEL  0.3100.038  0.3120.029  0.2500.023  0.2580.022  0.2470.025  0.3400.031  0.2530.016  0.3260.046  
BR  0.2150.048  0.2570.020  0.2320.018  0.2100.017  0.2210.019  0.3130.016  0.2360.019  0.2490.017  
ECC  0.2850.024  0.2820.021  0.2710.023  0.2570.022  0.2660.012  0.3340.018  0.2850.014  0.3260.032  
RAKEL  0.2670.028  0.2750.019  0.2660.019  0.2370.023  0.2430.017  0.3220.017  0.2550.018  0.3070.024  
LLSF  0.3120.038  0.2190.032  0.2610.022  0.2570.025  0.2700.027  0.3340.031  0.2170.018  0.3250.028  
JFSC  0.3080.039  0.3050.032  0.3080.026  0.2490.019  0.2580.024  0.3370.032  0.2540.019  0.3180.036  

Microaveraging F1  
science  arts  rcv1s1  rcv1s2  rcv1s3  rcv1s4  rcv1s5  business  
CAMEL  0.4280.018  0.4150.015  0.4010.015  0.4370.017  0.4310.025  0.4910.017  0.4410.015  0.7460.011  
BR  0.2770.013  0.3490.018  0.3010.009  0.3100.009  0.3070.013  0.3560.009  0.3210.009  0.5950.003  
ECC  0.3430.028  0.3770.018  0.3850.016  0.4100.022  0.4140.013  0.4820.024  0.4400.011  0.6900.007  
RAKEL  0.3370.014  0.3680.017  0.3410.010  0.3370.008  0.3350.014  0.3690.008  0.3500.008  0.7010.014  
LLSF  0.4460.025  0.3680.018  0.4630.016  0.4320.018  0.4280.023  0.4780.017  0.4380.019  0.6930.035  
JFSC  0.4490.026  0.4420.017  0.4560.008  0.4220.011  0.4240.012  0.4820.013  0.4380.011  0.7120.021 
Table 2 and 3 report the detailed experimental results on the regularscale and largescale datasets respectively, where the best performance among all the algorithms is shown in boldface. From the two result tables, we can see that CAMEL outperforms other comparing algorithms in most cases. Specifically, on the regularsize datasets (Table 2), across all the evaluation metrics, CAMEL ranks first in 80.4% (45/56) cases, and on the largescale datasets (Table 3), across all the evaluation metrics, CAMEL ranks first in 69.6% (39/56) cases. Compared with the three wellestablished algorithms BR, ECC, and RAKEL, CAMEL introduces a new type of label correlations, i.e., collaborative relationships among labels, and achieves superior performance in 93.8% (315/336) cases. Compared with the two stateoftheart algorithms LLSF and JFSC, instead of employing simple similarity measures to regularize the hypothesis space, CAMEL introduces a novel method to learn label correlations for explicitly correlating the final predictions, and achieves superior performance in 80.4% (180/224) cases. These comparative results clearly demonstrate the effectiveness of the collaboration based multilabel learning approach.
Sensitivity Analysis
In this section, we first investigate the sensitivity of CAMEL with respect to the two tradeoff parameters and , and the parameter that controls the degree of collaboration, then illustrate the convergence of CAMEL. Due to page limit, we only report the experimental results on the enron dataset using the Coverage () metric. Concretely, we study the performance of CAMEL when we vary one parameter while keeping other parameters fixed at their best setting. Figure 1(a), 1(b), and 1(c) show the sensitivity curve of CAMEL with respect to , , and respectively. It can be seen that and have an important influence on the final performance, because and control the collaboration degree and the model complexity. Figure 1(d) illustrates the convergence of CAMEL by using the difference of the optimization variable between two successive iterations, i.e., . From Figure 1(d), we can observe that quickly decreases to 0 within a few number of iterations. Hence the convergence of CAMEL is demonstrated.
Conclusion
In this paper, we make a key assumption for multilabel learning that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels. Guided by this assumption, we propose a novel method to learn the highorder label correlations via sparse reconstruction in the label space. Besides, by seamlessly integrating the learned label correlations into model training, we propose a novel multilabel learning approach that aims to explicitly account for the correlated predictions of labels while training the desired model simultaneously. Extensive experimental results show that our approach outperforms the stateoftheart counterparts.
Despite the demonstrated effectiveness of CAMEL, it only considers the global collaborative relationships between labels, by assuming that such collaborative relationships are shared by all the instances. However, as different instances have different characteristics, such collaborative relationships may be shared by only a subset of instances rather than all the instances. Therefore, our further work is to explore different collaborative relationships between labels for different subsets of instances.
Acknowledgements
This work was supported by MOE, NRF, and NTU.
Appendix A. The ADMM Procedure
To solve problem (3) by ADMM, we first reformulate problem (3) into the following equivalent form:
(13)  
Following the ADMM procedure, the above constrained optimization problem (13) can be solved as a series of unconstrained minimization problems using augmented Lagrangian function, which is presented as:
(14)  
Here, is the penalty parameter and is the Lagrange multiplier. By introducing the scaled dual variable , a sequential minimization of the scaled ADMM iterations can be conducted by updating the three variables , and sequentially:
(15) 
where is the proximity operator of the norm, which is defined as .
Appendix B. Model Parameter Optimization
The Lagrangian of problem (8) is expressed as:
(16)  
where is the trace operator, and is the introduced matrix that stores the Lagrangian multipliers. Besides, we have used the property of trace operator that . By Setting the gradient w.r.t. to 0 respectively, the following equations will be induced:
(17) 
The above linear equations can be simplified by the following steps:
(18) 
Here, we define , then we can obtain:
(19) 
In this way, can be calculated by .
References

[Boutell et al.2004]
Boutell, M. R.; Luo, J.; Shen, X.; and Brown, C. M.
2004.
Learning multilabel scene classification.
Pattern Recognition 37(9):1757–1771. 
[Boyd et al.2011]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J.; et al.
2011.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends® in Machine learning
3(1):1–122. 
[Cai et al.2013]
Cai, X.; Nie, F.; Cai, W.; and Huang, H.
2013.
New graph structured sparsity model for multilabel image
annotations.
In
Proceedings of the IEEE International Conference on Computer Vision
, 801–808.  [CesaBianchi, Gentile, and Zaniboni2006] CesaBianchi, N.; Gentile, C.; and Zaniboni, L. 2006. Hierarchical classification: combining bayes with svm. In Proceedings of the 23rd International Conference on Machine learning, 177–184.
 [Elisseeff and Weston2002] Elisseeff, A., and Weston, J. 2002. A kernel method for multilabelled classification. In Advances in Neural Information Processing Systems, 681–687.
 [Gibaja and Ventura2015] Gibaja, E., and Ventura, S. 2015. A tutorial on multilabel learning. ACM Computing Surveys 47(3):52.
 [Gopal and Yang2010] Gopal, S., and Yang, Y. 2010. Multilabel classification with metalevel features. In Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval, 315–322.
 [Gorski, Pfeuffer, and Klamroth2007] Gorski, J.; Pfeuffer, F.; and Klamroth, K. 2007. Biconvex sets and optimization with biconvex functions: A survey and extensions. Mathematical Methods of Operations Research 66(3):373–407.
 [Hariharan et al.2010] Hariharan, B.; ZelnikManor, L.; Varma, M.; and Vishwanathan, S. 2010. Large scale maxmargin multilabel classification with priors. In Proceedings of the 27th International Conference on Machine Learning, 423–430.
 [Huang et al.2016] Huang, J.; Li, G.; Huang, Q.; and Wu, X. 2016. Learning labelspecific features and classdependent labels for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 28(12):3309–3323.
 [Huang et al.2018] Huang, J.; Li, G.; Huang, Q.; and Wu, X. 2018. Joint feature selection and classification for multilabel learning. IEEE Transactions on Cybernetics 48(3):876–889.
 [Huang, Yu, and Zhou2012] Huang, S.J.; Yu, Y.; and Zhou, Z.H. 2012. Multilabel hypothesis reuse. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 525–533.

[Huang, Zhou, and Zhou2012]
Huang, S.J.; Zhou, Z.H.; and Zhou, Z.
2012.
Multilabel learning by exploiting label correlations locally.
In
Proceedings of the 26th AAAI Conference on Artificial Intelligence
, 949–955.  [Ji et al.2008] Ji, S.; Tang, L.; Yu, S.; and Ye, J. 2008. Extracting shared subspace for multilabel classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 381–389.
 [Li, Ouyang, and Zhou2015] Li, X.; Ouyang, J.; and Zhou, X. 2015. Supervised topic models for multilabel classification. Neurocomputing 149:811–819.
 [Petterson and Caetano2011] Petterson, J., and Caetano, T. S. 2011. Submodular multilabel learning. In Advances in Neural Information Processing Systems, 1512–1520.
 [Read et al.2011] Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2011. Classifier chains for multilabel classification. Machine Learning 85(3):333.
 [Tsoumakas et al.2009] Tsoumakas, G.; Dimou, A.; Spyromitros, E.; Mezaris, V.; Kompatsiaris, I.; and Vlahavas, I. 2009. Correlationbased pruning of stacked binary relevance models for multilabel learning. In Proceedings of the 1st International Workshop on Learning from MultiLabel Data, 101–116.
 [Tsoumakas et al.2011] Tsoumakas, G.; SpyromitrosXioufis, E.; Vilcek, J.; and Vlahavas, I. 2011. Mulan: A java library for multilabel learning. Journal of Machine Learning Research 12(7):2411–2414.
 [Tsoumakas, Katakis, and Vlahavas2011] Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2011. Random klabelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23(7):1079–1089.
 [Yang et al.2016] Yang, H.; Tianyi Zhou, J.; Zhang, Y.; Gao, B.B.; Wu, J.; and Cai, J. 2016. Exploit bounding box annotations for multilabel object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 280–288.

[Zhang and Zhou2006]
Zhang, M.L., and Zhou, Z.H.
2006.
Multilabel neural networks with applications to functional genomics and text categorization.
IEEE Transactions on Knowledge and Data Engineering 18(10):1338–1351.  [Zhang and Zhou2014] Zhang, M.L., and Zhou, Z.H. 2014. A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26(8):1819–1837.
 [Zhu et al.2005] Zhu, S.; Ji, X.; Xu, W.; and Gong, Y. 2005. Multilabelled classification using maximum entropy method. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, 274–281.
 [Zhu, Kwok, and Zhou2018] Zhu, Y.; Kwok, J. T.; and Zhou, Z.H. 2018. Multilabel learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering 30(6):1081–1094.
Comments
There are no comments yet.