1 Introduction
Multilabel learning deals with the problem that each instance is associated with multiple labels simultaneously. Due to its ability to cope with the realworld objects with multiple semantic meanings, multilabel learning has been successfully applied in various application domains [38], such as tag recommendation [20, 28], bioinformatics [4, 6, 36], information retrieval [10, 41], rule mining [30, 33], web mining [21, 29], and so on. Formally speaking, suppose the given multilabel data set is denoted by where
is a feature vector with
dimensions (features) and is the corresponding label vector with the size of label space being . Here, indicates that the th instance has the th label (or equivalently, the th label is a relevant label of ), otherwise the th label is an irrelevant label of . Let be the dimensional feature space, and be the dimensional label space, multilabel learning aims to induce a mapping function , which is able to correctly predict the label vector of unseen instances.To solve the multilabel learning problem, the most straightforward solution is Binary Relevance (BR) [1, 31], which aims to decompose the original learning problem into a set of independent binary classification problems. However, this solution generally achieves mediocre performance, as label correlations are regrettably ignored. To ease this problem, a large number of multilabel learning approaches take into account label correlations explicitly or implicitly to improve the learning performance. Examples include chains of binary classification [25, 26], ensemble of multiclass classification [32], labelspecific features [35, 13]
[14].Although there have been a considerable number of methods proposed to enhance the performance of multilabel learning, it still remains unknown whether data augmentation is helpful to multilabel learning. Data augmentation [17, 34, 24] is a widely used technique in many machine learning tasks, and it aims to apply a small mutation in the original training data and synthetically creating new examples to virtually increase the number of training examples, thereby achieving better generalization performance. In this paper, we provide the first attempt to leverage the data augmentation technique to improve the performance of multilabel learning. We show that the data augmentation technique can not only capture the local label correlations and label importances, but also potentially enables the learning function to be smooth.
Specifically, we propose a novel data augmentation approach, which is motivated by the statement that the local data characteristics can be captured by clustering [18, 19]. The cluster center is an average feature vector of all the instances in the cluster, which can be also regarded as a local representative of the cluster. If we consider the cluster center as a new instance, its corresponding label vector (labeling information) is supposed to the average label vector of all the instances in the cluster. Such data augmentation approach brings multiple important advantages for multilabel learning. First, the local label correlations (in the cluster) can be captured by the label vector of the cluster center. The local label correlations are also already shown to be very helpful to multilabel learning by existing works [16, 42]. Second, the labeling importance degree of each label in the cluster can be reflected by the label vector of the cluster center. Many existing multilabel learning approaches [22, 12, 39, 11] have shown that great performance can be achieved by taking into account the labeling importance degree of each relevant label. Third, each cluster center can be also considered as the label smoothing [23] of all the instances in the cluster. Note that the label vector of each real instance is binary (), while the label vector of the cluster center is continuous (), which potentially makes the learning function smoother. In addition, our proposed augmentation approach can be considered as a generalization of the popular mixup approach [34] to the case of multiple examples.
With the augmented training data at hand, we further propose a novel regularization term. Inspired by the cluster assumption [3, 40] that instances in the same cluster are supposed to have the same label, we present a novel regularization term to bridge the gap between the real examples and the virtual examples. Specifically, the modeling output of each real instance and its corresponding cluster center should be similar. Such regularization term naturally promotes the local smoothness of the learning function. The effectiveness of the proposed approach is clearly demonstrated by extensive experimental results on a number of realworld multilabel data sets.
In summary, our main contributions are threefold:

We propose a novel data augmentation approach to enlarge the multilabel training set by generating multiple compact examples.

We propose a novel regularization approach that bridges the gap between the real examples and the virtual examples.

Extensive experimental results clearly demonstrate that our proposed approach outperforms the stateoftheart counterparts.
2 Related Work
There have been a huge number of approaches proposed to deal with the multilabel learning problem. According to the order of label correlations, most of the existing approaches could be roughly divided into three categories. Approaches in the first category [1, 31, 37] do not take label correlations consideration, and normally tackle the multilabel learning problem in a labelbylabel manner. Although this kind of approaches is simple and intuitive, it can only achieve passable performance, due to the neglect of label correlations. To address the above, approaches in the second category [9, 13, 14] take into account the pairwise (secondorder) correlations between labels. In addition, approaches in the third category [25, 26, 32, 7] consider highorder correlations among multiple labels (e.g., label subsets or all labels). Note that the existing approaches only exploit label correlations from the given training examples, and there still remains the question of whether we can exploit label correlations from virtual examples.
Data augmentation [17, 34, 24] is a widely used technique in many machine learning tasks, and it aims to apply a small mutation in the original training data and synthetically creating new examples to virtually increase the number of training examples, thereby achieving better generalization. Traditional data augmentation techniques [17] for image classification tasks normally generate new examples from the original training data by flipping, distorting, adding a small amount of noise, or cropping a patch from an original image. Apart from the traditional data augmentation techniques, the SimplePairing approach [17] randomly chooses two examples and , then the new example is generated (randomly decided) by either or . On the other hand, given such two examples, the new example generated by the mixup approach [34] is represented as . Although satisfied performance has been achieved by the two approaches, they only focus on generating new examples by manipulating exactly two real examples. How to generate new examples from multiple real examples and how to apply the generated new examples for improving the performance of multilabel learning task still remains unknown. These questions will be answered in the next section.
3 The Proposed Approach
In this section, we present our approach IMCC (Incorporating Multiple Clustering Centers). Following the notations used in the Introduction, we denote the feature matrix by and denote the label matrix by , where is the number of examples. IMCC works by taking two elementary steps, including virtual examples generation and multilabel model training.
3.1 Virtual Examples Generation
In the first step, IMCC aims to generate a number of virtual examples that could be useful for the subsequent model training step. In order to generate new examples, we have to gain some insights from the existing examples. To achieve this, the clustering techniques are widely used as standalone tools for data analysis [35]. In the paper, the popular means algorithm [19] is adopted, due to its simplicity and effectiveness. Suppose the instances are partitioned into disjoint clusters . If the th instance is partitioned into the th cluster , then . Typically, the clustering center is a representative instance of the cluster, hence its semantic meanings could be the average of semantic meanings of all the instances in the cluster. Hence for each cluster , its clustering center is defined as:
(1) 
where is a indicator function, i.e., equals 1 if is true, otherwise it equals 0. From one specific view of point, is the local representative instance of the instances belonging to the th cluster, hence its semantic meanings could be the average of semantic meanings of all the instances in the cluster. In other words, suppose denote the labeling information of , then should be the average label vectors of all the instances in :
(2) 
In this way, we can have a complementary training set . Here we give a concrete example to illustrate the advantage of the proposed data augmentation approach. Suppose there is a cluster including three examples , and , where , , . Hence the virtual example is given as , where the label vector is . First, it is clearly that our proposed data augmentation approach could be considered as a generalization of the popular mixup approach [34] to the case of multiple examples. Second, the generated label vector contains soft labels, which are able to describe the labeling importance degree of each label [12, 39, 8] in the cluster. As we can see, the first and the fourth label are most important. Third, as each soft label vector is generated by aggregating the local labeling information in the cluster, the local label correlations could be captured. Concretely, it is clear that the first and the fourth label cooccur in the same cluster, hence they have very strong local correlations. Besides, there is a negative value for the second label, which suggests that the second label may possess the opposite semantic meaning against other labels, since other labels have a positive value. Fourth, the soft label vector of cluster center can be also considered as the label smoothing [23] of all the instances in the cluster. Note that the label vector of each real instance is binary (), while the label vector of the cluster center is continuous (), which potentially makes the learning function smoother.
3.2 MultiLabel Model Training
For compact representations of the complementary training set, the additional feature matrix and the corresponding label matrix are denoted by and , respectively. Note that there are soft labels (ranging from 1 to +1) in while hard labels (either 1 or +1 ) in .
With the original data set and the complementary data set , the objective function could be designed as follows:
(3) 
where and are the model parameters, and the widely used Frobenius norm of
is employed to reduce the model complexity to avoid overfitting. The tradeoff hyperparameters
and control the importance of learning from virtual examples and model complexity, respectively. By a compact representation, problem (3) can be equivalently stated as follows:(4) 
where and denote the vectors of size and , with every element equals 1. Inspired by the cluster assumption [3, 40] that instances in the same cluster are supposed to have the same label, we propose a novel regularization approach that the modeling output of each instance should be similar as that of the corresponding cluster center. Thus the regularization term is stated as:
(5) 
Note that the clusters are disjoint, hence results in only one cluster center () such that is true. Here, we specially introduce a matrix where . In this way, problem (5) is equivalent to:
(6) 
By combining problem (4) and problem (6), the final objective function is given as:
(7) 
where is a tradeoff parameter that controls the importance of the regularization term.
3.3 Optimization
For optimization, it would be not hard to compute the derivative of problem (7) with respect to and :
(8)  
(9) 
By setting and to 0, we can obtain the optimal values of and , shown as follows:
(10)  
(11) 
3.4 Kernel Extension
In the previous section, we have shown the closedform solutions of the linear model. However, such simple linear model cannot work in the nonlinear case, which may deteriorate the learning performance when the data cannot be linearly separated. To address this problem, in this section, we show that our approach can be easily extended to a kernelbased nonlinear model.
Specifically, we use a nonlinear feature mapping , which maps the original feature space to some higher (maybe infinite) dimensional Hilbert space, i.e., . By representation theorem [27], the optimal value of can be represented by a linear combination of the input features , which means where is a coefficients matrix. In other words, is a new variable that can be used to replace . Note that the kernel matrix is normally given as , hence , where the element of is defined as , and denotes the kernel function. Similarly, where with its element . In addition, where with its element . With these notations in mind, we can obtain the following objective function:
(12) 
where denotes the trace operator, and we used its important property, i.e., . Since , . Similarly, the fourth term of problem (12) can be also derived in the same manner. To solve problem (12), it is not hard to obtain the derivative with respect to and :
(13)  
(14) 
Setting and to 0, we can also obtain the closedform solutions:
(15)  
(16) 
In this paper, the Gaussian kernel function is adopted, i.e., , where the kernel parameter is empirically set to the averaged pairwise Euclidean distances of instances.
Data set  
cal500  502  68  174  numeric  26.044  0.150  502  1.000 
image  2000  294  5  numeric  1.236  0.247  20  0.010 
scene  2407  294  5  numeric  1.074  0.179  15  0.006 
yeast  2417  103  14  numeric  4.237  0.300  198  0.082 
enron  1702  1001  53  nominal  3.378  0.064  753  0.442 
genbase  662  1185  27  nominal  1.252  0.046  32  0.048 
medical  978  1449  45  nominal  1.245  0.028  94  0.096 
arts  5000  462  26  numeric  1.636  0.063  462  0.924 
bibtex  7395  1836  159  nominal  2.402  0.015  2856  0.386 
computer  5000  681  33  nominal  1.508  0.046  253  0.051 
corel5k  5000  499  374  nominal  3.522  0.009  3175  0.635 
education  5000  550  33  nominal  1.461  0.443  308  0.062 
health  5000  612  32  nominal  1.662  0.052  257  0.051 
social  5000  1047  39  nominal  1.283  0.033  226  0.045 
society  5000  636  27  nominal  1.692  0.063  582  0.116 
3.5 Test Phase
Once the model parameters and are learned, we denote the optimal solutions as and . Then, the predicted label vector of the test instance is given as:
(17) 
where returns if , otherwise . The pseudo code of IMCC is presented in Algorithm 1.
4 Experiments
In this section, we evaluate the performance of our proposed IMCC approach by comparing with multiple stateoftheart approaches on a number of realworld multilabel data sets, in terms of several widely used evaluation metrics.
4.1 Experimental Setup
4.1.1 Data Sets
In order to get a persuasive comprehensive performance evaluation, we collect 15 realworld multilabel data sets for experimental analysis. For each data set , we denote by , , , and the number of examples, number of dimensions (features), number of class labels, and feature type, respectively. In addition, following [12, 39], the properties of each data set are further characterized by several statistics, including label cardinality , label density , distinct label sets and proportion of distinct label set . The detailed definitions of these multilabel statistics can be found in [26]. Table 1 reports the detailed information of all the data sets. According to , we divide the data sets into two parts: the regularscale data sets for and the largescale data sets for
. For each data set, we randomly sample 80% examples to form the training set, and the remaining 20% examples belong to the test set. We repeat such sampling process for 10 times, and record the mean prediction value with the standard deviation.
4.1.2 Comparing Algorithms
We compare our proposed approach IMCC with 6 stateoftheart multilabel learning approaches. Each algorithm is configured with the suggested parameters according to the respective literature.

ECC [26]
: It is an ensemble of classifier chains, where the order of classifier chains is randomly generated. The employed base model is SVM, and the ensemble size is set to 10.

MAHR [15]: It uses a boosting approach and exploit label correlations by a hypothesis mechanism. The boosting round is set to .

LIFT [35]: It constructs different features for different labels, train a binary SVM model for each label based on the labelspecific features.

LLSF [13]: It learns labelspecific features for multilabel learning. Parameters and are searched in , and is searched in .

JFSC [14]: It performs joint feature selection and classification for multilabel learning. Parameters , , and are searched in , and is searched in .

IMCC: This is our proposed approach, which incorporates multiple cluster centers for multilabel learning. The regularization hyperparameters , and are searched in , and the number of clusters is searched in .
For all the above approaches, the searched parameters are chosen by fivefold cross validation on the training set.
4.1.3 Evaluation Metrics
To comprehensively measure the performance of each multilabel learning approach, we adopt five widely used evaluation metrics, including one error, hamming loss, ranking loss, coverage and average precision. Note that for all the adopted multilabel evaluation metrics, their values are in the interval . Given the train set and the test set where are the feature vector with dimensions (features) and are the corresponding groundtruth label vector with the size of label space being . The optimal model parameters and . Then we can obtain , the predicted label vector of .

One error: It evaluates the fraction that the label with the topranked predicted by the instance does not belong to its groundtruth relevant label set. The smaller the value of one error, the better performance of the classifier.
(18) where , and returns 1 if holds and 0 otherwise.

Hamming loss: It evaluates the fraction of instance label pairs which have been misclassified. The smaller the value of hamming loss, the better performance of the classifier.
(19) 
Rank loss: It evaluates the average fraction of misordered label pairs. The smaller the value of ranking loss, the better performance of the classifier.
(20) where , , denotes the number of positive label of , and denotes the number of positive label.

Coverage: It evaluates how many steps are needed, on average, to move down the ranked label list of an instance so as to cover all its relevant labels. The smaller the value of coverage, the better performance of the classifier.
(21) where indicates the rank of for .

Average precision: It evaluates the average fraction of relevant labels ranked higher than a particular label. The larger the value of average precision, the better performance of the classifier.
(22) where .
4.2 Experimental results
Comparing  Oneerror  

algorithms  cal500  image  scene  yeast  enron  genbase  medical 
IMCC  0.1160.024(1)  0.2530.021(1)  0.1790.017(1)  0.2100.015(1)  0.2300.014(2)  0.0020.005(1)  0.1170.018(1) 
BRsvm  0.1190.025(3)  0.3120.018(3)  0.2600.022(6)  0.2250.016(4)  0.2850.023(6)  0.1010.313(5)  0.2350.044(7) 
ECC  0.1180.023(2)  0.3210.020(4)  0.2410.016(3)  0.2360.020(5)  0.2980.019(7)  0.1010.314(5)  0.2230.067(5) 
MAHR  0.1860.092(7)  0.3060.016(2)  0.2310.010(2)  0.2380.017(6)  0.2650.016(5)  0.0020.003(2)  0.1460.027(4) 
LLSF  0.1220.023(5)  0.3310.021(7)  0.2540.015(5)  0.3580.023(7)  0.2260.017(1)  0.0020.003(3)  0.1260.016(2) 
JFSC  0.1190.023(4)  0.3290.026(6)  0.2700.011(7)  0.2170.011(2)  0.2390.014(3)  0.0040.005(4)  0.1430.022(3) 
LIFT  0.1220.024(5)  0.3260.024(5)  0.2410.019(3)  0.2210.013(3)  0.2510.022(4)  0.1010.314(5)  0.2300.051(6) 
Comparing  Hamming loss  
algorithms  cal500  image  scene  yeast  enron  genbase  medical 
IMCC  0.1370.003(1)  0.1480.009(1)  0.0770.004(1)  0.1910.005(1)  0.0460.002(1)  0.0020.001(4)  0.0100.001(1) 
BRsvm  0.1370.003(1)  0.1810.011(3)  0.1050.004(5)  0.1990.005(2)  0.0510.002(4)  0.0050.012(5)  0.0130.007(5) 
ECC  0.1540.004(7)  0.2560.011(7)  0.1550.009(7)  0.2490.005(6)  0.0610.002(7)  0.0050.012(5)  0.0150.031(7) 
MAHR  0.1410.003(6)  0.1710.007(2)  0.0910.003(2)  0.2070.005(5)  0.0510.001(4)  0.0010.001(1)  0.0100.001(1) 
LLSF  0.1380.003(3)  0.1810.009(3)  0.1030.003(4)  0.3010.004(7)  0.0460.002(1)  0.0010.001(1)  0.0100.001(1) 
JFSC  0.1380.003(4)  0.1860.008(6)  0.1180.004(6)  0.1990.005(2)  0.0520.002(6)  0.0010.001(1)  0.0100.001(1) 
LIFT  0.1390.003(5)  0.1810.010(1)  0.0980.004(3)  0.1990.005(2)  0.0470.001(3)  0.0050.012(5)  0.0130.007(5) 
Comparing  Ranking loss  
algorithms  cal500  image  scene  yeast  enron  genbase  medical 
IMCC  0.1810.005(1)  0.1370.010(1)  0.0610.007(1)  0.1570.005(1)  0.0740.006(1)  0.0010.003(1)  0.0180.005(2) 
BRsvm  0.1830.004(2)  0.1690.011(4)  0.0890.007(5)  0.1690.003(3)  0.0840.008(4)  0.0090.013(6)  0.0260.010(6) 
ECC  0.1890.004(6)  0.1650.009(2)  0.0810.005(3)  0.1710.006(4)  0.0840.007(4)  0.0090.013(6)  0.0250.010(5) 
MAHR  0.2750.010(7)  0.1650.008(2)  0.0830.005(4)  0.1810.005(6)  0.1290.006(7)  0.0050.003(4)  0.0270.008(7) 
LLSF  0.1870.007(5)  0.1780.014(7)  0.0910.005(6)  0.3410.007(7)  0.0810.008(2)  0.0020.002(2)  0.0170.005(1) 
JFSC  0.1840.006(4)  0.1750.015(6)  0.0960.005(7)  0.1710.005(4)  0.0980.007(6)  0.0010.001(1)  0.0190.006(3) 
LIFT  0.1830.004(2)  0.1710.013(5)  0.0780.004(2)  0.1680.005(2)  0.0810.007(2)  0.0080.014(5)  0.0240.010(4) 
Comparing  Coverage  
algorithms  cal500  image  scene  yeast  enron  genbase  medical 
IMCC  0.7470.014(2)  0.1670.013(1)  0.0660.007(1)  0.4410.006(1)  0.2210.017(1)  0.0110.006(1)  0.0280.008(1) 
BRsvm  0.7510.014(4)  0.1910.012(4)  0.0890.006(5)  0.4580.006(4)  0.2350.021(5)  0.0220.014(5)  0.0410.013(6) 
ECC  0.7650.013(6)  0.1870.010(2)  0.0810.004(3)  0.4550.008(2)  0.2280.018(3)  0.0220.014(5)  0.0390.012(5) 
MAHR  0.8940.012(7)  0.1890.008(3)  0.0840.004(4)  0.4770.007(6)  0.3390.020(7)  0.0130.002(3)  0.0410.010(6) 
LLSF  0.7470.015(2)  0.1940.015(5)  0.0920.004(6)  0.6270.009(7)  0.2220.019(2)  0.0130.003(3)  0.0280.008(1) 
JFSC  0.7420.014(1)  0.1940.015(5)  0.0920.005(6)  0.4550.007(2)  0.2650.017(6)  0.0110.002(1)  0.0290.009(3) 
LIFT  0.7510.017(4)  0.1940.015(5)  0.0790.003(2)  0.4610.007(5)  0.2280.018(3)  0.0220.014(5)  0.0380.011(4) 
Comparing  Average precision  
algorithms  cal500  image  scene  yeast  enron  genbase  medical 
IMCC  0.5050.005(1)  0.8340.012(1)  0.8930.010(1)  0.7770.008(1)  0.7040.013(1)  0.9970.004(1)  0.9120.012(1) 
BRsvm  0.5010.006(2)  0.7970.011(3)  0.8470.012(5)  0.7620.008(3)  0.6570.016(4)  0.9440.152(6)  0.8410.132(7) 
ECC  0.4910.003(6)  0.7970.011(3)  0.8570.008(4)  0.7560.011(5)  0.6570.013(4)  0.9440.152(6)  0.8520.134(5) 
MAHR  0.4410.010(7)  0.8010.008(2)  0.8610.006(2)  0.7450.009(6)  0.6410.013(7)  0.9940.003(4)  0.8920.018(4) 
LLSF  0.5010.010(2)  0.7890.014(5)  0.8470.007(5)  0.6170.007(7)  0.7030.015(2)  0.9960.003(2)  0.9080.009(2) 
JFSC  0.5010.007(2)  0.7890.016(5)  0.8360.007(7)  0.7620.008(3)  0.6430.013(6)  0.9960.003(2)  0.8990.013(3) 
LIFT  0.4960.006(5)  0.7890.015(5)  0.8590.010(3)  0.7660.007(2)  0.6840.013(3)  0.9470.153(5)  0.8480.023(6) 
Comparing  Oneerror  
algorithms  arts  bibtex  computer  corel5k  education  health  social  society 
IMCC  0.4560.013(1)  0.3610.008(4)  0.3330.014(1)  0.6610.009(2)  0.4620.016(2)  0.2540.011(2)  0.2720.004(1)  0.3860.018(1) 
BRsvm  0.4560.014(1)  0.4030.015(7)  0.4070.209(4)  0.7020.105(5)  0.2710.031(1)  0.4680.367(5)  0.4090.311(5)  0.4460.195(4) 
ECC  0.4820.010(5)  0.3940.012(6)  0.4130.206(6)  0.7180.099(6)  0.5710.226(5)  0.4730.364(6)  0.4140.309(6)  0.4520.193(6) 
MAHR  0.5480.011(7)  0.3710.005(5)  0.4090.014(5)  0.9070.008(7)  0.6030.021(7)  0.3210.015(4)  0.3280.007(4)  0.4460.015(4) 
LLSF  0.4610.011(4)  0.3490.004(1)  0.3370.017(2)  0.6240.011(1)  0.4660.013(3)  0.2460.015(1)  0.2730.008(2)  0.3940.017(2) 
JFSC  0.5120.012(6)  0.3580.007(3)  0.3810.014(3)  0.6750.008(3)  0.5150.022(4)  0.2960.009(3)  0.3230.008(3)  0.4230.018(3) 
LIFT  0.4560.011(1)  0.3550.011(2)  0.4130.206(6)  0.6830.112(4)  0.5810.221(6)  0.4780.361(7)  0.4270.302(7)  0.4690.187(7) 
Comparing  Hamming loss  
algorithms  arts  bibtex  computer  corel5k  education  health  social  society 
IMCC  0.0570.001(3)  0.0130.0(3)  0.0330.002(1)  0.0090.001(1)  0.0380.001(1)  0.0330.001(1)  0.0210.001(1)  0.0510.001(1) 
BRsvm  0.0540.001(1)  0.0130.0(3)  0.0360.009(3)  0.0110.001(5)  0.1990.009(7)  0.0410.015(5)  0.0240.011(5)  0.0550.012(4) 
ECC  0.0770.004(7)  0.0140.0(6)  0.0460.009(7)  0.0110.001(5)  0.0590.013(6)  0.0480.015(6)  0.0310.011(7)  0.0610.012(6) 
MAHR  0.0570.001(3)  0.0130.0(3)  0.0370.002(5)  0.0090.001(1)  0.0410.001(4)  0.0380.002(4)  0.0220.001(3)  0.0560.001(5) 
LLSF  0.0570.001(3)  0.0120.0(1)  0.0340.001(2)  0.0090.001(1)  0.0380.001(1)  0.0330.001(1)  0.0210.001(1)  0.0520.001(2) 
JFSC  0.0570.001(3)  0.0170.0(7)  0.0360.002(3)  0.0090.001(1)  0.0390.001(3)  0.0360.001(3)  0.0220.001(3)  0.0530.001(3) 
LIFT  0.0540.001(1)  0.0120.0(1)  0.0370.009(5)  0.0110.001(5)  0.0440.013(5)  0.0480.016(6)  0.0240.011(5)  0.0610.012(6) 
Comparing  Ranking loss  
algorithms  arts  bibtex  computer  corel5k  education  health  social  society 
IMCC  0.1110.003(1)  0.0630.002(1)  0.0770.004(4)  0.1110.002(1)  0.0720.004(1)  0.0460.003(1)  0.0520.004(2)  0.1260.005(3) 
BRsvm  0.1140.004(2)  0.0850.001(6)  0.0710.010(2)  0.1230.003(3)  0.1560.012(6)  0.0490.016(3)  0.0520.012(2)  0.1230.012(2) 
ECC  0.1150.004(4)  0.0830.002(4)  0.0680.010(1)  0.1220.003(2)  0.0760.015(2)  0.0480.016(2)  0.0490.011(1)  0.1210.012(1) 
MAHR  0.2010.010(7)  0.0940.004(7)  0.1250.006(7)  0.2660.018(7)  0.2090.012(7)  0.0770.006(7)  0.0950.006(7)  0.2110.008(7) 
LLSF  0.1210.004(5)  0.0690.002(2)  0.0890.005(5)  0.1260.004(5)  0.0810.004(4)  0.0620.003(5)  0.0610.005(5)  0.1370.005(5) 
JFSC  0.1220.004(6)  0.0830.003(4)  0.0950.004(6)  0.1380.002(6)  0.0810.005(4)  0.0690.005(6)  0.0780.006(6)  0.1460.006(6) 
LIFT  0.1140.004(3)  0.0740.002(3)  0.0740.011(3)  0.1230.003(3)  0.0780.015(3)  0.0510.016(4)  0.0520.011(2)  0.1260.013(3) 
Comparing  Coverage  
algorithms  arts  bibtex  computer  corel5k  education  health  social  society 
IMCC  0.1730.004(1)  0.1240.003(1)  0.1180.006(4)  0.2690.006(1)  0.1050.005(2)  0.0960.006(4)  0.0810.006(4)  0.2070.007(4) 
BRsvm  0.1740.006(3)  0.1580.003(6)  0.1070.010(2)  0.2890.006(4)  0.2910.015(7)  0.0890.015(2)  0.0710.011(2)  0.1890.014(2) 
ECC  0.1730.007(1)  0.1560.003(5)  0.1050.009(1)  0.2870.006(3)  0.1030.015(1)  0.0880.014(1)  0.0680.011(1)  0.1880.014(1) 
MAHR  0.2790.012(7)  0.1710.004(7)  0.1740.007(7)  0.5150.027(7)  0.2640.014(6)  0.1360.009(7)  0.1280.007(7)  0.3070.009(7) 
LLSF  0.1890.006(6)  0.1320.004(2)  0.1310.006(5)  0.2810.006(2)  0.1190.005(5)  0.1210.005(5)  0.0910.006(5)  0.2160.008(5) 
JFSC  0.1840.006(5)  0.1510.004(4)  0.1420.005(6)  0.3190.004(6)  0.1140.007(4)  0.1330.008(6)  0.1090.007(6)  0.2310.011(6) 
LIFT  0.1740.006(3)  0.1410.003(3)  0.1110.010(3)  0.2890.006(4)  0.1060.015(3)  0.0910.015(3)  0.0710.011(2)  0.1910.014(3) 
Comparing  Average precision  
algorithms  arts  bibtex  computer  corel5k  education  health  social  society 
IMCC  0.6340.008(1)  0.6080.006(2)  0.7230.010(1)  0.2960.002(2)  0.6480.013(2)  0.7950.008(1)  0.7860.007(1)  0.6480.010(1) 
BRsvm  0.6270.009(2)  0.5380.010(7)  0.6850.099(3)  0.2710.027(4)  0.8070.014(1)  0.6950.167(6)  0.7190.155(4)  0.6220.086(3) 
ECC  0.6170.007(5)  0.5480.008(6)  0.6850.099(3)  0.2650.027(5)  0.5910.115(5)  0.6980.168(5)  0.7190.153(4)  0.6190.087(5) 
MAHR  0.5240.008(7)  0.5740.005(5)  0.6350.010(7)  0.0990.005(7)  0.4810.016(7)  0.7250.009(4)  0.7150.007(6)  0.5610.010(7) 
LLSF  0.6270.007(2)  0.6130.005(1)  0.7140.011(2)  0.3050.008(1)  0.6420.010(3)  0.7860.008(2)  0.7800.008(2)  0.6390.010(2) 
JFSC  0.5970.007(6)  0.5930.006(3)  0.6850.009(3)  0.2610.003(6)  0.6150.014(4)  0.7610.006(3)  0.7510.007(3)  0.6220.010(3) 
LIFT  0.6270.007(2)  0.5850.007(4)  0.6780.098(6)  0.2810.028(3)  0.5820.113(6)  0.6880.164(7)  0.7080.152(7)  0.6090.085(6) 
Tables 2 and 3 report the detailed experimental results of each algorithm on regularscale and largescale data sets, respectively. For the two tables, the best results are highlighted (in boldface), and the number in each bracket indicates the ranking of this algorithm.
Evaluation metric  critical value ( = 0.05)  

Oneerror  4.57  2.209 
Hamming loss  6.06  
Ranking loss  13.74  
Coverage  6.76  
Average precision  11.45 
In order to further systematically analyze the relative performance of each comparing algorithm, we use the popular statistical test  Friedman test [5] for the comparison studies of multiple algorithms on a number of data sets, with respect to each evaluation metric. Specifically, given algorithms to be compared on data sets, and the th algorithm’s average ranking on all the data sets is denoted by
. Note that mean ranks are shared in case of the performance of the algorithms are equal. Based on the null hypothesis that the performance of all algorithms is equal, the Friedman statistics
is calculated by: where the is distributed to the distribution with degrees of freedom:(23) 
In this paper, the number of comparing algorithms , the number of data sets . Table 4 summarizes the Friedman statistics according to each evaluation metric and the critical value at 0.05 significance level. As shown in Table 4, the equal hypothesis is obviously rejected at the significance level . Consequently, the posthoc test [5] is used for further analysis. It makes sense to employ Nemenyi test [5] to indicate whether our proposed IMCC approach achieves a superior performance to the comparing algorithms by treating IMCC as the control algorithm. The significant differences between IMCC and other algorithms can be determined by comparing their average ranking with the Critical Difference (CD) [5] ().
Given , and , for the Nemenyi test, , we can obtain . The performance of an algorithm is considered to be significantly different from that of IMCC if their average ranking over all data sets differs at least one CD. Figure 1 shows the CD diagrams on each evaluation metric. In Figure1, the comparison algorithm is connected to the IMCC if their average rank is within one CD to that of IMCC. Otherwise, there exists significantly different performance between IMCC and a comparing algorithm if the algorithm is not connected with the IMCC.
Based on the above experimental results, the following observations can be made:

As shown in Table 2 and Table 3, IMCC ranked first on all evaluation metrics on the four data sets including image, scene, yeast and arts). This is because these four data sets are regularscale data sets, which have limited number of examples, and IMCC can achieve great performance on regularscale data sets due to data augmentation.

From both Table 2 and Table 3, we can observe that across all evaluation metrics and on all the fifteen data sets, IMCC ranks first on all the fifteen data sets in 72.00% cases, and ranks top three in 89.33% cases. It is also worthy noting that IMCC ranks first in 88.57% cases on the regularscale data sets (Table 2) while IMCC ranks first in 55.50% cases on the largescale data sets (Table 3). These results indicate that IMCC is superior to other comparing algorithms in most cases and IMCC tends to work better on regularscale data sets. Such observation accords with the widelyaccepted intuition that the data augmentation approach is normally more helpful to the regularscale data sets compared with the largescale data sets. As the largescale data sets may provide relatively adequate training examples, data augmentation might be not much useful in this case. Despite this, IMCC still achieves competitive performance against other stateoftheart approaches on the largescale data sets.
Comments
There are no comments yet.