Incorporating Multiple Cluster Centers for Multi-Label Learning

04/17/2020 ∙ by Senlin Shu, et al. ∙ 6

Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Most of the existing approaches aim to improve the performance of multi-label learning by exploiting label correlations. Although the data augmentation technique is widely used in many machine learning tasks, it is still unclear whether data augmentation is helpful to multi-label learning. In this paper, (to the best of our knowledge) we provide the first attempt to leverage the data augmentation technique to improve the performance of multi-label learning. Specifically, we first propose a novel data augmentation approach that performs clustering on the real examples and treats the cluster centers as virtual examples, and these virtual examples naturally embody the local label correlations and label importances. Then, motivated by the cluster assumption that examples in the same cluster should have the same label, we propose a novel regularization term to bridge the gap between the real examples and virtual examples, which can promote the local smoothness of the learning function. Extensive experimental results on a number of real-world multi-label data sets clearly demonstrate that our proposed approach outperforms the state-of-the-art counterparts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Due to its ability to cope with the real-world objects with multiple semantic meanings, multi-label learning has been successfully applied in various application domains [38], such as tag recommendation [20, 28], bioinformatics [4, 6, 36], information retrieval [10, 41], rule mining [30, 33], web mining [21, 29], and so on. Formally speaking, suppose the given multi-label data set is denoted by where

is a feature vector with

dimensions (features) and is the corresponding label vector with the size of label space being . Here, indicates that the -th instance has the -th label (or equivalently, the -th label is a relevant label of ), otherwise the -th label is an irrelevant label of . Let be the -dimensional feature space, and be the -dimensional label space, multi-label learning aims to induce a mapping function , which is able to correctly predict the label vector of unseen instances.

To solve the multi-label learning problem, the most straightforward solution is Binary Relevance (BR) [1, 31], which aims to decompose the original learning problem into a set of independent binary classification problems. However, this solution generally achieves mediocre performance, as label correlations are regrettably ignored. To ease this problem, a large number of multi-label learning approaches take into account label correlations explicitly or implicitly to improve the learning performance. Examples include chains of binary classification [25, 26], ensemble of multi-class classification [32], label-specific features [35, 13]

, feature selection 

[14].

Although there have been a considerable number of methods proposed to enhance the performance of multi-label learning, it still remains unknown whether data augmentation is helpful to multi-label learning. Data augmentation [17, 34, 24] is a widely used technique in many machine learning tasks, and it aims to apply a small mutation in the original training data and synthetically creating new examples to virtually increase the number of training examples, thereby achieving better generalization performance. In this paper, we provide the first attempt to leverage the data augmentation technique to improve the performance of multi-label learning. We show that the data augmentation technique can not only capture the local label correlations and label importances, but also potentially enables the learning function to be smooth.

Specifically, we propose a novel data augmentation approach, which is motivated by the statement that the local data characteristics can be captured by clustering [18, 19]. The cluster center is an average feature vector of all the instances in the cluster, which can be also regarded as a local representative of the cluster. If we consider the cluster center as a new instance, its corresponding label vector (labeling information) is supposed to the average label vector of all the instances in the cluster. Such data augmentation approach brings multiple important advantages for multi-label learning. First, the local label correlations (in the cluster) can be captured by the label vector of the cluster center. The local label correlations are also already shown to be very helpful to multi-label learning by existing works [16, 42]. Second, the labeling importance degree of each label in the cluster can be reflected by the label vector of the cluster center. Many existing multi-label learning approaches [22, 12, 39, 11] have shown that great performance can be achieved by taking into account the labeling importance degree of each relevant label. Third, each cluster center can be also considered as the label smoothing [23] of all the instances in the cluster. Note that the label vector of each real instance is binary (), while the label vector of the cluster center is continuous (), which potentially makes the learning function smoother. In addition, our proposed augmentation approach can be considered as a generalization of the popular mixup approach [34] to the case of multiple examples.

With the augmented training data at hand, we further propose a novel regularization term. Inspired by the cluster assumption [3, 40] that instances in the same cluster are supposed to have the same label, we present a novel regularization term to bridge the gap between the real examples and the virtual examples. Specifically, the modeling output of each real instance and its corresponding cluster center should be similar. Such regularization term naturally promotes the local smoothness of the learning function. The effectiveness of the proposed approach is clearly demonstrated by extensive experimental results on a number of real-world multi-label data sets.

In summary, our main contributions are three-fold:

  • We propose a novel data augmentation approach to enlarge the multi-label training set by generating multiple compact examples.

  • We propose a novel regularization approach that bridges the gap between the real examples and the virtual examples.

  • Extensive experimental results clearly demonstrate that our proposed approach outperforms the state-of-the-art counterparts.

2 Related Work

There have been a huge number of approaches proposed to deal with the multi-label learning problem. According to the order of label correlations, most of the existing approaches could be roughly divided into three categories. Approaches in the first category [1, 31, 37] do not take label correlations consideration, and normally tackle the multi-label learning problem in a label-by-label manner. Although this kind of approaches is simple and intuitive, it can only achieve passable performance, due to the neglect of label correlations. To address the above, approaches in the second category [9, 13, 14] take into account the pairwise (second-order) correlations between labels. In addition, approaches in the third category [25, 26, 32, 7] consider high-order correlations among multiple labels (e.g., label subsets or all labels). Note that the existing approaches only exploit label correlations from the given training examples, and there still remains the question of whether we can exploit label correlations from virtual examples.

Data augmentation [17, 34, 24] is a widely used technique in many machine learning tasks, and it aims to apply a small mutation in the original training data and synthetically creating new examples to virtually increase the number of training examples, thereby achieving better generalization. Traditional data augmentation techniques [17] for image classification tasks normally generate new examples from the original training data by flipping, distorting, adding a small amount of noise, or cropping a patch from an original image. Apart from the traditional data augmentation techniques, the SimplePairing approach [17] randomly chooses two examples and , then the new example is generated (randomly decided) by either or . On the other hand, given such two examples, the new example generated by the mixup approach [34] is represented as . Although satisfied performance has been achieved by the two approaches, they only focus on generating new examples by manipulating exactly two real examples. How to generate new examples from multiple real examples and how to apply the generated new examples for improving the performance of multi-label learning task still remains unknown. These questions will be answered in the next section.

3 The Proposed Approach

In this section, we present our approach IMCC (Incorporating Multiple Clustering Centers). Following the notations used in the Introduction, we denote the feature matrix by and denote the label matrix by , where is the number of examples. IMCC works by taking two elementary steps, including virtual examples generation and multi-label model training.

3.1 Virtual Examples Generation

In the first step, IMCC aims to generate a number of virtual examples that could be useful for the subsequent model training step. In order to generate new examples, we have to gain some insights from the existing examples. To achieve this, the clustering techniques are widely used as stand-alone tools for data analysis [35]. In the paper, the popular -means algorithm [19] is adopted, due to its simplicity and effectiveness. Suppose the instances are partitioned into disjoint clusters . If the -th instance is partitioned into the -th cluster , then . Typically, the clustering center is a representative instance of the cluster, hence its semantic meanings could be the average of semantic meanings of all the instances in the cluster. Hence for each cluster , its clustering center is defined as:

(1)

where is a indicator function, i.e., equals 1 if is true, otherwise it equals 0. From one specific view of point, is the local representative instance of the instances belonging to the -th cluster, hence its semantic meanings could be the average of semantic meanings of all the instances in the cluster. In other words, suppose denote the labeling information of , then should be the average label vectors of all the instances in :

(2)

In this way, we can have a complementary training set . Here we give a concrete example to illustrate the advantage of the proposed data augmentation approach. Suppose there is a cluster including three examples , and , where , , . Hence the virtual example is given as , where the label vector is . First, it is clearly that our proposed data augmentation approach could be considered as a generalization of the popular mixup approach [34] to the case of multiple examples. Second, the generated label vector contains soft labels, which are able to describe the labeling importance degree of each label [12, 39, 8] in the cluster. As we can see, the first and the fourth label are most important. Third, as each soft label vector is generated by aggregating the local labeling information in the cluster, the local label correlations could be captured. Concretely, it is clear that the first and the fourth label co-occur in the same cluster, hence they have very strong local correlations. Besides, there is a negative value for the second label, which suggests that the second label may possess the opposite semantic meaning against other labels, since other labels have a positive value. Fourth, the soft label vector of cluster center can be also considered as the label smoothing [23] of all the instances in the cluster. Note that the label vector of each real instance is binary (), while the label vector of the cluster center is continuous (), which potentially makes the learning function smoother.

3.2 Multi-Label Model Training

For compact representations of the complementary training set, the additional feature matrix and the corresponding label matrix are denoted by and , respectively. Note that there are soft labels (ranging from -1 to +1) in while hard labels (either -1 or +1 ) in .

With the original data set and the complementary data set , the objective function could be designed as follows:

(3)

where and are the model parameters, and the widely used Frobenius norm of

is employed to reduce the model complexity to avoid overfitting. The trade-off hyperparameters

and control the importance of learning from virtual examples and model complexity, respectively. By a compact representation, problem (3) can be equivalently stated as follows:

(4)

where and denote the vectors of size and , with every element equals 1. Inspired by the cluster assumption [3, 40] that instances in the same cluster are supposed to have the same label, we propose a novel regularization approach that the modeling output of each instance should be similar as that of the corresponding cluster center. Thus the regularization term is stated as:

(5)

Note that the clusters are disjoint, hence results in only one cluster center () such that is true. Here, we specially introduce a matrix where . In this way, problem (5) is equivalent to:

(6)

By combining problem (4) and problem (6), the final objective function is given as:

(7)

where is a trade-off parameter that controls the importance of the regularization term.

3.3 Optimization

For optimization, it would be not hard to compute the derivative of problem (7) with respect to and :

(8)
(9)

By setting and to 0, we can obtain the optimal values of and , shown as follows:

(10)
(11)

3.4 Kernel Extension

In the previous section, we have shown the closed-form solutions of the linear model. However, such simple linear model cannot work in the nonlinear case, which may deteriorate the learning performance when the data cannot be linearly separated. To address this problem, in this section, we show that our approach can be easily extended to a kernel-based nonlinear model.

Specifically, we use a nonlinear feature mapping , which maps the original feature space to some higher (maybe infinite) dimensional Hilbert space, i.e., . By representation theorem [27], the optimal value of can be represented by a linear combination of the input features , which means where is a coefficients matrix. In other words, is a new variable that can be used to replace . Note that the kernel matrix is normally given as , hence , where the element of is defined as , and denotes the kernel function. Similarly, where with its element . In addition, where with its element . With these notations in mind, we can obtain the following objective function:

(12)

where denotes the trace operator, and we used its important property, i.e., . Since , . Similarly, the fourth term of problem (12) can be also derived in the same manner. To solve problem (12), it is not hard to obtain the derivative with respect to and :

(13)
(14)

Setting and to 0, we can also obtain the closed-form solutions:

(15)
(16)

In this paper, the Gaussian kernel function is adopted, i.e., , where the kernel parameter is empirically set to the averaged pairwise Euclidean distances of instances.

0:  : the multi-label training set : the regularization hyperparameters: the number of clusters: the unseen test instance
0:  : the predicted label vector for the test instance
0:  
1:  perform -means clustering on ;
2:  calculate cluster centers according to Eq. (1);
3:  calculate the label vectors of cluster centers according to Eq. (2);
4:  calculate the optimal solutions and according to Eq. (15) and Eq. (16);
5:  return the predicted label vector according to Eq. (17).
Algorithm 1 The IMCC Algorithm
Data set
cal500 502 68 174 numeric 26.044 0.150 502 1.000
image 2000 294 5 numeric 1.236 0.247 20 0.010
scene 2407 294 5 numeric 1.074 0.179 15 0.006
yeast 2417 103 14 numeric 4.237 0.300 198 0.082
enron 1702 1001 53 nominal 3.378 0.064 753 0.442
genbase 662 1185 27 nominal 1.252 0.046 32 0.048
medical 978 1449 45 nominal 1.245 0.028 94 0.096
arts 5000 462 26 numeric 1.636 0.063 462 0.924
bibtex 7395 1836 159 nominal 2.402 0.015 2856 0.386
computer 5000 681 33 nominal 1.508 0.046 253 0.051
corel5k 5000 499 374 nominal 3.522 0.009 3175 0.635
education 5000 550 33 nominal 1.461 0.443 308 0.062
health 5000 612 32 nominal 1.662 0.052 257 0.051
social 5000 1047 39 nominal 1.283 0.033 226 0.045
society 5000 636 27 nominal 1.692 0.063 582 0.116
Table 1: Characteristics of the benchmark multi-label data sets.

3.5 Test Phase

Once the model parameters and are learned, we denote the optimal solutions as and . Then, the predicted label vector of the test instance is given as:

(17)

where returns if , otherwise . The pseudo code of IMCC is presented in Algorithm 1.

4 Experiments

In this section, we evaluate the performance of our proposed IMCC approach by comparing with multiple state-of-the-art approaches on a number of real-world multi-label data sets, in terms of several widely used evaluation metrics.

4.1 Experimental Setup

4.1.1 Data Sets

In order to get a persuasive comprehensive performance evaluation, we collect 15 real-world multi-label data sets for experimental analysis. For each data set , we denote by , , , and the number of examples, number of dimensions (features), number of class labels, and feature type, respectively. In addition, following [12, 39], the properties of each data set are further characterized by several statistics, including label cardinality , label density , distinct label sets and proportion of distinct label set . The detailed definitions of these multi-label statistics can be found in [26]. Table 1 reports the detailed information of all the data sets. According to , we divide the data sets into two parts: the regular-scale data sets for and the large-scale data sets for

. For each data set, we randomly sample 80% examples to form the training set, and the remaining 20% examples belong to the test set. We repeat such sampling process for 10 times, and record the mean prediction value with the standard deviation.

4.1.2 Comparing Algorithms

We compare our proposed approach IMCC with 6 state-of-the-art multi-label learning approaches. Each algorithm is configured with the suggested parameters according to the respective literature.

  • BRsvm [1]: It decomposes the multi-label classification problem into independent binary (one-versus-rest) classification problems. The employed base model is binary SVM, which is trained by the libsvm toolbox [2].

  • ECC [26]

    : It is an ensemble of classifier chains, where the order of classifier chains is randomly generated. The employed base model is SVM, and the ensemble size is set to 10.

  • MAHR [15]: It uses a boosting approach and exploit label correlations by a hypothesis mechanism. The boosting round is set to .

  • LIFT [35]: It constructs different features for different labels, train a binary SVM model for each label based on the label-specific features.

  • LLSF [13]: It learns label-specific features for multi-label learning. Parameters and are searched in , and is searched in .

  • JFSC [14]: It performs joint feature selection and classification for multi-label learning. Parameters , , and are searched in , and is searched in .

  • IMCC: This is our proposed approach, which incorporates multiple cluster centers for multi-label learning. The regularization hyperparameters , and are searched in , and the number of clusters is searched in .

For all the above approaches, the searched parameters are chosen by five-fold cross validation on the training set.

4.1.3 Evaluation Metrics

To comprehensively measure the performance of each multi-label learning approach, we adopt five widely used evaluation metrics, including one error, hamming loss, ranking loss, coverage and average precision. Note that for all the adopted multi-label evaluation metrics, their values are in the interval . Given the train set and the test set where are the feature vector with dimensions (features) and are the corresponding ground-truth label vector with the size of label space being . The optimal model parameters and . Then we can obtain , the predicted label vector of .

  • One error: It evaluates the fraction that the label with the top-ranked predicted by the instance does not belong to its ground-truth relevant label set. The smaller the value of one error, the better performance of the classifier.

    (18)

    where , and returns 1 if holds and 0 otherwise.

  • Hamming loss: It evaluates the fraction of instance label pairs which have been misclassified. The smaller the value of hamming loss, the better performance of the classifier.

    (19)
  • Rank loss: It evaluates the average fraction of misordered label pairs. The smaller the value of ranking loss, the better performance of the classifier.

    (20)

    where , , denotes the number of positive label of , and denotes the number of positive label.

  • Coverage: It evaluates how many steps are needed, on average, to move down the ranked label list of an instance so as to cover all its relevant labels. The smaller the value of coverage, the better performance of the classifier.

    (21)

    where indicates the rank of for .

  • Average precision: It evaluates the average fraction of relevant labels ranked higher than a particular label. The larger the value of average precision, the better performance of the classifier.

    (22)

    where .

4.2 Experimental results

Comparing One-error
algorithms cal500 image scene yeast enron genbase medical
IMCC 0.1160.024(1) 0.2530.021(1) 0.1790.017(1) 0.2100.015(1) 0.2300.014(2) 0.0020.005(1) 0.1170.018(1)
BRsvm 0.1190.025(3) 0.3120.018(3) 0.2600.022(6) 0.2250.016(4) 0.2850.023(6) 0.1010.313(5) 0.2350.044(7)
ECC 0.1180.023(2) 0.3210.020(4) 0.2410.016(3) 0.2360.020(5) 0.2980.019(7) 0.1010.314(5) 0.2230.067(5)
MAHR 0.1860.092(7) 0.3060.016(2) 0.2310.010(2) 0.2380.017(6) 0.2650.016(5) 0.0020.003(2) 0.1460.027(4)
LLSF 0.1220.023(5) 0.3310.021(7) 0.2540.015(5) 0.3580.023(7) 0.2260.017(1) 0.0020.003(3) 0.1260.016(2)
JFSC 0.1190.023(4) 0.3290.026(6) 0.2700.011(7) 0.2170.011(2) 0.2390.014(3) 0.0040.005(4) 0.1430.022(3)
LIFT 0.1220.024(5) 0.3260.024(5) 0.2410.019(3) 0.2210.013(3) 0.2510.022(4) 0.1010.314(5) 0.2300.051(6)
Comparing Hamming loss
algorithms cal500 image scene yeast enron genbase medical
IMCC 0.1370.003(1) 0.1480.009(1) 0.0770.004(1) 0.1910.005(1) 0.0460.002(1) 0.0020.001(4) 0.0100.001(1)
BRsvm 0.1370.003(1) 0.1810.011(3) 0.1050.004(5) 0.1990.005(2) 0.0510.002(4) 0.0050.012(5) 0.0130.007(5)
ECC 0.1540.004(7) 0.2560.011(7) 0.1550.009(7) 0.2490.005(6) 0.0610.002(7) 0.0050.012(5) 0.0150.031(7)
MAHR 0.1410.003(6) 0.1710.007(2) 0.0910.003(2) 0.2070.005(5) 0.0510.001(4) 0.0010.001(1) 0.0100.001(1)
LLSF 0.1380.003(3) 0.1810.009(3) 0.1030.003(4) 0.3010.004(7) 0.0460.002(1) 0.0010.001(1) 0.0100.001(1)
JFSC 0.1380.003(4) 0.1860.008(6) 0.1180.004(6) 0.1990.005(2) 0.0520.002(6) 0.0010.001(1) 0.0100.001(1)
LIFT 0.1390.003(5) 0.1810.010(1) 0.0980.004(3) 0.1990.005(2) 0.0470.001(3) 0.0050.012(5) 0.0130.007(5)
Comparing Ranking loss
algorithms cal500 image scene yeast enron genbase medical
IMCC 0.1810.005(1) 0.1370.010(1) 0.0610.007(1) 0.1570.005(1) 0.0740.006(1) 0.0010.003(1) 0.0180.005(2)
BRsvm 0.1830.004(2) 0.1690.011(4) 0.0890.007(5) 0.1690.003(3) 0.0840.008(4) 0.0090.013(6) 0.0260.010(6)
ECC 0.1890.004(6) 0.1650.009(2) 0.0810.005(3) 0.1710.006(4) 0.0840.007(4) 0.0090.013(6) 0.0250.010(5)
MAHR 0.2750.010(7) 0.1650.008(2) 0.0830.005(4) 0.1810.005(6) 0.1290.006(7) 0.0050.003(4) 0.0270.008(7)
LLSF 0.1870.007(5) 0.1780.014(7) 0.0910.005(6) 0.3410.007(7) 0.0810.008(2) 0.0020.002(2) 0.0170.005(1)
JFSC 0.1840.006(4) 0.1750.015(6) 0.0960.005(7) 0.1710.005(4) 0.0980.007(6) 0.0010.001(1) 0.0190.006(3)
LIFT 0.1830.004(2) 0.1710.013(5) 0.0780.004(2) 0.1680.005(2) 0.0810.007(2) 0.0080.014(5) 0.0240.010(4)
Comparing Coverage
algorithms cal500 image scene yeast enron genbase medical
IMCC 0.7470.014(2) 0.1670.013(1) 0.0660.007(1) 0.4410.006(1) 0.2210.017(1) 0.0110.006(1) 0.0280.008(1)
BRsvm 0.7510.014(4) 0.1910.012(4) 0.0890.006(5) 0.4580.006(4) 0.2350.021(5) 0.0220.014(5) 0.0410.013(6)
ECC 0.7650.013(6) 0.1870.010(2) 0.0810.004(3) 0.4550.008(2) 0.2280.018(3) 0.0220.014(5) 0.0390.012(5)
MAHR 0.8940.012(7) 0.1890.008(3) 0.0840.004(4) 0.4770.007(6) 0.3390.020(7) 0.0130.002(3) 0.0410.010(6)
LLSF 0.7470.015(2) 0.1940.015(5) 0.0920.004(6) 0.6270.009(7) 0.2220.019(2) 0.0130.003(3) 0.0280.008(1)
JFSC 0.7420.014(1) 0.1940.015(5) 0.0920.005(6) 0.4550.007(2) 0.2650.017(6) 0.0110.002(1) 0.0290.009(3)
LIFT 0.7510.017(4) 0.1940.015(5) 0.0790.003(2) 0.4610.007(5) 0.2280.018(3) 0.0220.014(5) 0.0380.011(4)
Comparing Average precision
algorithms cal500 image scene yeast enron genbase medical
IMCC 0.5050.005(1) 0.8340.012(1) 0.8930.010(1) 0.7770.008(1) 0.7040.013(1) 0.9970.004(1) 0.9120.012(1)
BRsvm 0.5010.006(2) 0.7970.011(3) 0.8470.012(5) 0.7620.008(3) 0.6570.016(4) 0.9440.152(6) 0.8410.132(7)
ECC 0.4910.003(6) 0.7970.011(3) 0.8570.008(4) 0.7560.011(5) 0.6570.013(4) 0.9440.152(6) 0.8520.134(5)
MAHR 0.4410.010(7) 0.8010.008(2) 0.8610.006(2) 0.7450.009(6) 0.6410.013(7) 0.9940.003(4) 0.8920.018(4)
LLSF 0.5010.010(2) 0.7890.014(5) 0.8470.007(5) 0.6170.007(7) 0.7030.015(2) 0.9960.003(2) 0.9080.009(2)
JFSC 0.5010.007(2) 0.7890.016(5) 0.8360.007(7) 0.7620.008(3) 0.6430.013(6) 0.9960.003(2) 0.8990.013(3)
LIFT 0.4960.006(5) 0.7890.015(5) 0.8590.010(3) 0.7660.007(2) 0.6840.013(3) 0.9470.153(5) 0.8480.023(6)
Table 2: Predictive results of each algorithm (mean standard deviation) on the regular-scale data sets. The best results are highlighted, and the number in the brackets indicates the ranking of the algorithm.
Comparing One-error
algorithms arts bibtex computer corel5k education health social society
IMCC 0.4560.013(1) 0.3610.008(4) 0.3330.014(1) 0.6610.009(2) 0.4620.016(2) 0.2540.011(2) 0.2720.004(1) 0.3860.018(1)
BRsvm 0.4560.014(1) 0.4030.015(7) 0.4070.209(4) 0.7020.105(5) 0.2710.031(1) 0.4680.367(5) 0.4090.311(5) 0.4460.195(4)
ECC 0.4820.010(5) 0.3940.012(6) 0.4130.206(6) 0.7180.099(6) 0.5710.226(5) 0.4730.364(6) 0.4140.309(6) 0.4520.193(6)
MAHR 0.5480.011(7) 0.3710.005(5) 0.4090.014(5) 0.9070.008(7) 0.6030.021(7) 0.3210.015(4) 0.3280.007(4) 0.4460.015(4)
LLSF 0.4610.011(4) 0.3490.004(1) 0.3370.017(2) 0.6240.011(1) 0.4660.013(3) 0.2460.015(1) 0.2730.008(2) 0.3940.017(2)
JFSC 0.5120.012(6) 0.3580.007(3) 0.3810.014(3) 0.6750.008(3) 0.5150.022(4) 0.2960.009(3) 0.3230.008(3) 0.4230.018(3)
LIFT 0.4560.011(1) 0.3550.011(2) 0.4130.206(6) 0.6830.112(4) 0.5810.221(6) 0.4780.361(7) 0.4270.302(7) 0.4690.187(7)
Comparing Hamming loss
algorithms arts bibtex computer corel5k education health social society
IMCC 0.0570.001(3) 0.0130.0(3) 0.0330.002(1) 0.0090.001(1) 0.0380.001(1) 0.0330.001(1) 0.0210.001(1) 0.0510.001(1)
BRsvm 0.0540.001(1) 0.0130.0(3) 0.0360.009(3) 0.0110.001(5) 0.1990.009(7) 0.0410.015(5) 0.0240.011(5) 0.0550.012(4)
ECC 0.0770.004(7) 0.0140.0(6) 0.0460.009(7) 0.0110.001(5) 0.0590.013(6) 0.0480.015(6) 0.0310.011(7) 0.0610.012(6)
MAHR 0.0570.001(3) 0.0130.0(3) 0.0370.002(5) 0.0090.001(1) 0.0410.001(4) 0.0380.002(4) 0.0220.001(3) 0.0560.001(5)
LLSF 0.0570.001(3) 0.0120.0(1) 0.0340.001(2) 0.0090.001(1) 0.0380.001(1) 0.0330.001(1) 0.0210.001(1) 0.0520.001(2)
JFSC 0.0570.001(3) 0.0170.0(7) 0.0360.002(3) 0.0090.001(1) 0.0390.001(3) 0.0360.001(3) 0.0220.001(3) 0.0530.001(3)
LIFT 0.0540.001(1) 0.0120.0(1) 0.0370.009(5) 0.0110.001(5) 0.0440.013(5) 0.0480.016(6) 0.0240.011(5) 0.0610.012(6)
Comparing Ranking loss
algorithms arts bibtex computer corel5k education health social society
IMCC 0.1110.003(1) 0.0630.002(1) 0.0770.004(4) 0.1110.002(1) 0.0720.004(1) 0.0460.003(1) 0.0520.004(2) 0.1260.005(3)
BRsvm 0.1140.004(2) 0.0850.001(6) 0.0710.010(2) 0.1230.003(3) 0.1560.012(6) 0.0490.016(3) 0.0520.012(2) 0.1230.012(2)
ECC 0.1150.004(4) 0.0830.002(4) 0.0680.010(1) 0.1220.003(2) 0.0760.015(2) 0.0480.016(2) 0.0490.011(1) 0.1210.012(1)
MAHR 0.2010.010(7) 0.0940.004(7) 0.1250.006(7) 0.2660.018(7) 0.2090.012(7) 0.0770.006(7) 0.0950.006(7) 0.2110.008(7)
LLSF 0.1210.004(5) 0.0690.002(2) 0.0890.005(5) 0.1260.004(5) 0.0810.004(4) 0.0620.003(5) 0.0610.005(5) 0.1370.005(5)
JFSC 0.1220.004(6) 0.0830.003(4) 0.0950.004(6) 0.1380.002(6) 0.0810.005(4) 0.0690.005(6) 0.0780.006(6) 0.1460.006(6)
LIFT 0.1140.004(3) 0.0740.002(3) 0.0740.011(3) 0.1230.003(3) 0.0780.015(3) 0.0510.016(4) 0.0520.011(2) 0.1260.013(3)
Comparing Coverage
algorithms arts bibtex computer corel5k education health social society
IMCC 0.1730.004(1) 0.1240.003(1) 0.1180.006(4) 0.2690.006(1) 0.1050.005(2) 0.0960.006(4) 0.0810.006(4) 0.2070.007(4)
BRsvm 0.1740.006(3) 0.1580.003(6) 0.1070.010(2) 0.2890.006(4) 0.2910.015(7) 0.0890.015(2) 0.0710.011(2) 0.1890.014(2)
ECC 0.1730.007(1) 0.1560.003(5) 0.1050.009(1) 0.2870.006(3) 0.1030.015(1) 0.0880.014(1) 0.0680.011(1) 0.1880.014(1)
MAHR 0.2790.012(7) 0.1710.004(7) 0.1740.007(7) 0.5150.027(7) 0.2640.014(6) 0.1360.009(7) 0.1280.007(7) 0.3070.009(7)
LLSF 0.1890.006(6) 0.1320.004(2) 0.1310.006(5) 0.2810.006(2) 0.1190.005(5) 0.1210.005(5) 0.0910.006(5) 0.2160.008(5)
JFSC 0.1840.006(5) 0.1510.004(4) 0.1420.005(6) 0.3190.004(6) 0.1140.007(4) 0.1330.008(6) 0.1090.007(6) 0.2310.011(6)
LIFT 0.1740.006(3) 0.1410.003(3) 0.1110.010(3) 0.2890.006(4) 0.1060.015(3) 0.0910.015(3) 0.0710.011(2) 0.1910.014(3)
Comparing Average precision
algorithms arts bibtex computer corel5k education health social society
IMCC 0.6340.008(1) 0.6080.006(2) 0.7230.010(1) 0.2960.002(2) 0.6480.013(2) 0.7950.008(1) 0.7860.007(1) 0.6480.010(1)
BRsvm 0.6270.009(2) 0.5380.010(7) 0.6850.099(3) 0.2710.027(4) 0.8070.014(1) 0.6950.167(6) 0.7190.155(4) 0.6220.086(3)
ECC 0.6170.007(5) 0.5480.008(6) 0.6850.099(3) 0.2650.027(5) 0.5910.115(5) 0.6980.168(5) 0.7190.153(4) 0.6190.087(5)
MAHR 0.5240.008(7) 0.5740.005(5) 0.6350.010(7) 0.0990.005(7) 0.4810.016(7) 0.7250.009(4) 0.7150.007(6) 0.5610.010(7)
LLSF 0.6270.007(2) 0.6130.005(1) 0.7140.011(2) 0.3050.008(1) 0.6420.010(3) 0.7860.008(2) 0.7800.008(2) 0.6390.010(2)
JFSC 0.5970.007(6) 0.5930.006(3) 0.6850.009(3) 0.2610.003(6) 0.6150.014(4) 0.7610.006(3) 0.7510.007(3) 0.6220.010(3)
LIFT 0.6270.007(2) 0.5850.007(4) 0.6780.098(6) 0.2810.028(3) 0.5820.113(6) 0.6880.164(7) 0.7080.152(7) 0.6090.085(6)
Table 3: Predictive results of each algorithm (mean standard deviation) on the large-scale data sets. The best results are highlighted, and the number in the brackets indicates the ranking of the algorithm.

Tables 2 and 3 report the detailed experimental results of each algorithm on regular-scale and large-scale data sets, respectively. For the two tables, the best results are highlighted (in boldface), and the number in each bracket indicates the ranking of this algorithm.

Evaluation metric critical value ( = 0.05)
One-error 4.57 2.209
Hamming loss 6.06
Ranking loss 13.74
Coverage 6.76
Average precision 11.45
Table 4: Friedman statistics according to each evaluation metric and the critical value at 0.05 significance level (comparing algorithms and data sets ).

In order to further systematically analyze the relative performance of each comparing algorithm, we use the popular statistical test - Friedman test [5] for the comparison studies of multiple algorithms on a number of data sets, with respect to each evaluation metric. Specifically, given algorithms to be compared on data sets, and the -th algorithm’s average ranking on all the data sets is denoted by

. Note that mean ranks are shared in case of the performance of the algorithms are equal. Based on the null hypothesis that the performance of all algorithms is equal, the Friedman statistics

is calculated by: where the is distributed to the distribution with degrees of freedom:

(23)

In this paper, the number of comparing algorithms , the number of data sets . Table 4 summarizes the Friedman statistics according to each evaluation metric and the critical value at 0.05 significance level. As shown in Table 4, the equal hypothesis is obviously rejected at the significance level . Consequently, the post-hoc test [5] is used for further analysis. It makes sense to employ Nemenyi test [5] to indicate whether our proposed IMCC approach achieves a superior performance to the comparing algorithms by treating IMCC as the control algorithm. The significant differences between IMCC and other algorithms can be determined by comparing their average ranking with the Critical Difference (CD) [5] ().

Given , and , for the Nemenyi test, , we can obtain . The performance of an algorithm is considered to be significantly different from that of IMCC if their average ranking over all data sets differs at least one CD. Figure 1 shows the CD diagrams on each evaluation metric. In Figure1, the comparison algorithm is connected to the IMCC if their average rank is within one CD to that of IMCC. Otherwise, there exists significantly different performance between IMCC and a comparing algorithm if the algorithm is not connected with the IMCC.

(a) One-error
(b) Hamming loss
(c) Ranking loss
(d) Coverage
(e) Average precision
Figure 1: Comparison of IMCC (control algorithm) against other comparing algorithms based on the Nemenyi test. Algorithms not connected with IMCC in the CD diagram are considered to have significantly different performance from IMCC.

Based on the above experimental results, the following observations can be made:

  • As shown in Table 2 and Table 3, IMCC ranked first on all evaluation metrics on the four data sets including image, scene, yeast and arts). This is because these four data sets are regular-scale data sets, which have limited number of examples, and IMCC can achieve great performance on regular-scale data sets due to data augmentation.

  • From both Table 2 and Table 3, we can observe that across all evaluation metrics and on all the fifteen data sets, IMCC ranks first on all the fifteen data sets in 72.00% cases, and ranks top three in 89.33% cases. It is also worthy noting that IMCC ranks first in 88.57% cases on the regular-scale data sets (Table 2) while IMCC ranks first in 55.50% cases on the large-scale data sets (Table 3). These results indicate that IMCC is superior to other comparing algorithms in most cases and IMCC tends to work better on regular-scale data sets. Such observation accords with the widely-accepted intuition that the data augmentation approach is normally more helpful to the regular-scale data sets compared with the large-scale data sets. As the large-scale data sets may provide relatively adequate training examples, data augmentation might be not much useful in this case. Despite this, IMCC still achieves competitive performance against other state-of-the-art approaches on the large-scale data sets.

  • From Figure 1, we can observe that, in all cases, IMCC achieves the best performance compared to all algorithms. It is also worthy noting that IMCC significantly outperforms each comparing algorithm on at least two evaluation metrics. Moreover, on the One-error and Average precision metrics, only LLSF is competitive against IMCC (i.e., IMCC significantly outperforms the other five algorithms on the two evaluation metrics).

(a) Varying and on enron
(b) Varying and on yeast
(c) Varying