In multi-label data, each example is typically associated with a small number of labels, much smaller than the total number of labels. This results in a sparse label matrix, where a small total number of positive class values is shared by a much larger number of example-label pairs. From the viewpoint of each separate label, this gives rise to class imbalance, which has been recently recognized as a key challenge in multi-label learning [6, 7, 11, 18, 31].
Approaches for handling class imbalance in multi-label data can be divided into two categories: a) reducing the imbalance level of multi-label data via resampling techniques, including synthetic data generation [5, 6, 7, 8], and b) making multi-label learning methods resilient to class imbalance [11, 18, 31]. This work focuses on the first category, whose approaches can be coupled with any multi-label learning method and are therefore more flexible.
Existing resampling approaches for multi-label data focus on class imbalance at the global scale of the whole dataset. However, previous studies of class imbalance in binary and multi-class classification [19, 20] have found that the distribution of class values in the local neighbourhood of minority examples, rather than the global imbalance level, is the main reason for the difficulty of a classifier to recognize the minority class. We hypothesize that this finding is also true, and even more important to consider, in the more complex setting of multi-label data, where it has not been examined yet.
Consider for example the 2-dimensional multi-label datasets (a) and (b) in Fig.1 concerning points in a plane. The points are characterized by three labels, concerning the shape of the points (triangles, circles), the border of the points (solid, none) and the color of the points (green, red). These datasets have the same level of label imbalance. Yet (b) appears much more challenging due to the presence of sub-concepts for the triangles and the points without border and the overlap of the green and red points as well as the points with solid and no border.
This work proposes a novel multi-label synthetic oversampling method, named MLSOL, whose seed instance selection and synthetic instance generation processes depend on the local distribution of the labels. This allows MLSOL to create more diverse and better labelled synthetic instances. Furthermore, we consider the coupling of MLSOL and other resampling methods with a simple but flexible ensemble framework to further improve its performance and robustness. Experimental results on 13 multi-label datasets demonstrate the effectiveness of the proposed sampling approach, especially its ensemble version, for three different imbalance-aware evaluation metrics and six different multi-label methods.
The remainder of this paper is organized as follows. Section 2 offers a brief review of methods for addressing class imbalance in multi-label data. Then, our approach is introduced in Section 3. Section 4 presents and discusses the experimental results. Finally, Section 5 summarizes the main contributions of this work.
2 Related Work
A first approach to dealing with class imbalance in the context of multi-label data is to utilize the resampling technique, which is applied in a pre-processing step and is independent of the particular multi-label learning algorithm that will be subsequently applied to the data. LP-RUS and LP-ROS are two twin sampling methods, of which the former removes instances assigned with the most frequent labelset (i.e. particular combination of label values) and the latter replicates instances whose labelset appears the fewest times .
Instead of considering whole labelset, several sampling methods alleviate the imbalance of the dataset in the individual label aspect, i.e. increasing the frequency of minority labels and reducing the number of appearances of majority labels. ML-RUS and ML-ROS simply delete instances with majority labels and clone examples with minority labels, respectively 
. MLeNN eliminates instances only with majority labels and similar labelset of its neighbors in a heuristic way based on the Edited Nearest Neighbor (ENN) rule. To make a multi-label dataset more balanced, MLSMOTE randomly selects instance containing minority labels and its neighbors to generate synthetic instances which are associated with labels that appear more that half times of the seed instance and its neighbors according to strategy .
REMEDIAL tackles the concurrence of labels with different imbalance level in one instance, of which the level is assessed by , by decomposing the sophisticated instance of into two simpler examples, but may introduce extra confusions into the learning task, i.e. there are several pairs of instances with same features and disparate labels . The REMEDIAL could be either a standalone sampling method or the prior part of other sampling techniques, i.e. RHwRSMT combines REMEDIAL with MLSMOTE .
Apart from resampling methods, another group of approaches focuses on multi-label learning method handling the class-imbalance problem directly. Some methods deal with the imbalance issue of multi-label learning via transforming the multi-label dataset to several binary/multi-class classification problems. COCOA converts the original multi-label dataset to one binary dataset and several multi-class datasets for each label, and builds imbalance classifiers with the assistance of sampling for each dataset 
. SOSHF transforms the multi-label learning task to an imbalanced single label classification assignment via cost-sensitive clustering, and the new task is addressed by oblique structured Hellinger decision trees. Besides, many approaches aims to modify current multi-label learning methods to handle class-imbalance problem. ECCRU3 extends the ECC resilient to class imbalance by coupling undersampling and improving of the exploitation of majority examples
. Apart from ECCRU3, the modified models based on neural network[26, 16, 23], SVM , hypernetwork  and BR [10, 12, 25, 28] have been proposed as well. Furthermore, other strategies, such as representation learning , constrained submodular minimization  and balanced pseudo-label , have been utilized to address the imbalance obstacle of multi-label learning as well.
3 Our Approach
We start by introducing our mathematical notation. Let be a -dimensional input feature space, a label set containing labels and a -dimensional label space. is a multi-label training data set containing instances. Each instance
consists of a feature vectorand a label vector , where is the -th element of and denotes that is (not) associated with -th instance. A multi-label method learns the mapping function and (or) from that given an unseen instance , outputs a label vector with the predicted labels of and (or) real-valued vector corresponding relevance degrees to respectively.
We propose a novel Multi-Label Synthetic Oversampling approach based on the Local distribution of labels (MLSOL). The pseudo-code of MLSOL is shown in Algorithm 1. Firstly, some auxiliary variables, as the weight vector and type matrix used for seed instance selection and synthetic examples generation respectively, are calculated based on the local label distribution of instances (line 3-6 in Algorithm 1). Then in each iteration, the seed and reference instances are selected, upon which a synthetic example is generated and added into the dataset. The loop (line 7-12 in Algorithm 1) would terminate when expected number of new examples are created. The following subsections detail the definition of auxiliaries as well as strategies to pick seed instances and create synthetic examples.
3.1 Selection of Seed Instances
We sample seed instances with replacement, with the probability of selection being proportional to the minority class values it is associated with, weighted by the difficulty of correctly classifying these values based on the proportion of opposite (majority) class values in the local neighborhood of the instance.
For each instance we first retrieve its nearest neighbours, . Then for each label we compute the proportion of neighbours having opposite class with respect to the class of the instance and store the result in the matrix according to the following equation, where is the indicator function that returns 1 if is true and 0 otherwise:
The values in range from 0 to 1, with values close to 0 (1) indicating a safe (hostile) neighborhood of similarly (oppositely) labelled examples. A value of can further be viewed as a hint that
is an outlier in this neighborhood with respect to.
The next step is to aggregate the values in per training example, , in order to arrive at a single sampling weight, , characterizing the difficulty in correctly predicting the minority class values of this example. A straightforward way to do this is to simply sum these values for the labels where the instance contains the minority class. Assuming for simplicity of presentation that the value 1 corresponds to the minority class, we arrive at this aggregation as follows:
There are two issues with this. The first one is that we have also taken into account the outliers. We will omit them by adding a second indicator function requesting to be less than 1. The second issue is that this aggregation does not take into account the global level of class imbalance of each of the labels. The fewer the number of minority samples, the higher the difficulty of correctly classifying the corresponding minority class. In contrast, Equation 2 treats all labels equally. To resolve this issue, we can normalize the values of the non-outlier minority examples in so that they sum to 1 per label, by dividing with the sum of the values of all non-outlier minority examples of that label. This will increase the relative importance of the weights of labels with fewer samples. Addressing these two issues we arrive at the following proposed aggregation:
3.2 Synthetic Instance Generation
The definition of the type of each instance-label pair is indispensable for the assignment of appropriate labels to the new instances that we shall create. Inspired by , we distinguish minority class instances into four types, namely safe (), borderline (), rare () and outlier (), according to the proportion of neighbours from the same (minority) class:
: . The safe instance is located in the region overwhelmed by minority examples.
: . The borderline instance is placed in the decision boundary between minority and majority classes.
: , and only if the type of its neighbours from the minority class are or . Otherwise there are some or examples in the proximity, which suggests that it could be rather a . The rare instance, accompanied with isolated pairs or triples of minority class examples, is located in the majority class area and distant from the decision boundary.
: . The outlier is surrounded by majority examples.
For the sake of uniform representation, the type of majority class instance is defined as majority (). Let be the type matrix and be the type of . The detailed steps of obtaining are illustrated in Algorithm 2.
Once the seed instance has been decided, the reference instance is randomly chosen from the nearest neighbours of the seed instance. Using the selected seed and reference instance, a new synthetic instance is generated according to Algorithm 3. The feature values of the synthetic instance
are interpolated along the line which connects the two input samples (line 1-2 in Algorithm3). Once is confirmed, we compute , which indicates whether the synthetic instance is closer to the seed () or closer to the reference instance () (line 3-4 in Algorithm 3).
With respect to label assignment, we employ a scheme considering the labels and types of the seed and reference instances as well as the location of the synthetic instance, which is able to create informative instances for difficult minority class labels without bringing in noise for majority labels. For each label , is set as (line 6-7 in Algorithm 3) if and belong to the same class. In the case where is majority class, the seed instance and the reference example should be exchanged to guarantee that is always the minority class (line 9-11 in Algorithm 3). Then, , a threshold for is specified based on the type of the seed label, (line 12-16 in Algorithm 3), which is used to determine the instance (seed or reference) whose labels will be copied to the synthetic example. For , and , where the minority (seed) example is surrounded by several majority instances and suffers more risk to be classified wrongly, the cut-point of label assignment is closer to the majority (reference) instance. Specifically, for represents that the frontier of label assignment is in the midpoint between seed and reference instance, for denotes that the range of minority class extends as three times as large than the majority class, and for ensures that the generated instance is always set as minority class regardless of its location. With respect to as a singular point placed at majority class region, all possible synthetic instances are assigned the majority class due to the inability of an outlier to cover the input space. Finally, is set as if is not larger than , otherwise is equal to (line 17-20 in Algorithm 3).
Compared with MLSMOTE, MLSOL is able to generate more diverse and well-labeled synthetic instances. As the example in Figure 2 shows, given a seed instance, the labels of the synthetic instance are fixed in MLSMOTE, while the labels of the new instance change according to its location in MLSOL, which avoids the introduction of noise as well.
3.3 Ensemble of Multi-Label Sampling (EMLS)
Ensemble is a effective strategy to increase overall accuracy and overcome over-fitting problem, but has not been leveraged to multi-label sampling approaches. To improve the robustness of MLSOL and current multi-label sampling methods, we propose the ensemble framework called EMLS where any multi-label sampling approach and classifier could be embedded. In EMLS, multi-label learning models are trained and each model is built upon a re-sampled dataset generated by a multi-label sampling method with various random seed. There are many random operations in existing and proposed multi-label learning sampling methods [7, 6], which guarantees the diversity of training set for each model in the ensemble framework via employing different random seed. Then the bipartition threshold of each label is decided by maximizing F-measure on training set, as COCOA  and ECCRU3  do. Given the test example, the predicting relevant scores is calculated as the average output relevant degrees obtained from models, and the labels whose relevance degree is larger than the corresponding bipartition threshold are predicted as ”1”, and ”0” otherwise.
3.4 Complexity Analysis
The complexity of searching NN of input instances is . The complexity of computing , and is , and , respectively. The complexity of creating instances is where is the number of generated examples. The overall complexity of MLSOL is , of which the NN searching is the most time-consuming part.
Let’s define and the complexity of training and prediction of multi-label learning method respectively, and the complexity of a multi-label sampling approach. The complexity of EMLS is for prediction and for training.
4 Empirical Analysis
Table 1 shows detailed information for the 13 benchmark multi-label datasets, obtained from Mulan’s repository111http://mulan.sourceforge.net/datasets-mlc.html, that are used in this study. Besides, in textual data sets with more than 1000 features we applied a simple dimensionality reduction approach that retains the top 10 (bibtex, enron, medical) or top 1 (rcv1subset1, rcv1subset2, yahoo-Arts1, yahoo-Business1) of the features ordered by number of non-zero values (i.e. frequency of appearance). Besides, we remove labels only containing one minority class instance, because when splitting the dataset into training and test sets, there may be only majority class instances of those extremely imbalanced labels in training set.
Four multi-label sampling methods are used for comparison, namely the state-of-the-art MLSMOTE  and RHwRSMT  that integrates REMEDIAL  and MLSMOTE, as well as their ensemble versions, called EMLSMOTE and ERHwRSMT respectively. Furthermore, the base learning approach without employing any sampling approach, denoted as Default, is also used for comparing. For all sampling methods, the number of nearest neighbours is set to 5 and the Euclidean distance is used to measure the distance between the examples. In MLSOL, the sampling ratio is set to 0.3. In RHwRSMT, the threshold for decoupling instance is set to . For MLSMOTE and RHwRSMT, the label generation strategy is . The ensemble size is set to 5 for all ensemble methods. In addition, six multi-label learning methods are employed as base learning methods, comprising four standard multi-label learning methods (BR , MLkNN , CLR , RAkEL ), as well as two state-of-the-art methods addressing the class imbalance problem (COCOA  and ECCRU3 ).
Three widely used imbalance aware evaluation metrics are leveraged to measure the performance of methods, namely macro-averaged F-measure, macro-averaged AUC-ROC (area under the receiver operating characteristic curve) and macro-averaged AUCPR (area under the precision recall curve). For simplicity, we omit the “macro-averaged” in further references to these metrics within the rest of this paper.
The experiments were conducted on a machine with 410-core CPUs running at 2.27 GHz. We apply -fold cross validation with multi-label stratification  to each dataset and the average results are reported. The implementation of our approach and the scripts of our experiments are publicly available at Mulan’s GitHub repository222https://github.com/tsoumakas/mulan/tree/master/mulan. The default parameters are used for base learners.
4.2 Results and Analysis
Detailed experimental results are listed in the supplementary material of this paper. The statistical significance of the differences among the methods participating in our empirical study is examined by employing the Friedman test, followed by the Wilcoxon signed rank test with Bergman-Hommel’s correction at the 5% level, following literature guidelines [14, 1]. Table 2 shows the average rank of each method as well as its significant wins/losses versus each one of the rest of the methods for each of the three evaluation metrics and each of the six base multi-label methods. The best results are highlighted with bold typeface.
We start our discussion by looking at the single model version of the three resampling methods. We first notice that RHwRSMT achieves the worst results and that it is even worse than no resampling at all (default), which is mainly due to the additional bewilderment yielded by REMEDIAL, i.e. there are several pairs of instances with same features and disparate labels. MLSOL and MLSMOTE exhibit similar total wins and losses, especially in AUCPR, which is considered as the most appropriate measure in the context of class imbalance . Moreover, the wins and losses of MLSOL and MLSMOTE are not that different from no resampling at all. This is particularly true when using a multi-label learning method that already handles class imbalance, such as COCOA and ECCRU3, which is not surprising.
We then notice that the ensemble versions of the three multi-label resampling methods outperform their corresponding single model versions in all cases. This verifies the known effectiveness of resampling apporaches in reducing the error, in particular via reducing the variance component of the expected error. Ensembling enables MLSMOTE and MLSOL to achieve much better results compared to no resampling and it even helps RHwRSMT to do slightly better than no resampling.
Focusing on the ensemble versions of the three resampling methods we notice that EMLSOL achieves the best average rank and the most significant wins without suffering any significant loss in all 18 different pairs of the 6 base multi-label methods and the 3 evaluation measures, with the exception that MLSMOTE with MLkNN as base learner achieves best average rank in terms of F-measure. EMLSMOTE comes second in total wins and losses in most cases, while ERHwRSMT does much worse than EMLSMOTE.
An interesting observation here is that while MLSOL and MLSMOTE have similar performance, MLSOL benefitted much more than MLSMOTE from the ensemble approach. This happens because randomization plays a more important role in MLSOL than in MLSMOTE. MLSOL uses weighted sampling for seed instance selection, while MLSMOTE takes all minority samples into account instead. This allows EMLSOL to create more diverse models, which achieve greater error correction when aggregated.
We proposed MLSOL, a new synthetic oversampling approach for tackling the class-imbalance problem in multi-label data. Based on the local distribution of labels, MLSOL selects more important and informative seed instances and generates more diverse and well-labeled synthetic instances. In addition, we employed MLSOL within a simple ensemble framework, which exploits the random aspects of our approach during sampling training examples to use as seeds and during the generation of synthetic training examples.
We experimentally compared the proposed approach against two state-of-the art resampling methods on 13 benchmark multi-label datasets. The results offer strong evidence on the superiority of MLSOL, especially of its ensemble version, in three different imbalance-aware evaluation measures using six different underlying base multi-label methods.
Bin Liu is supported from the China Scholarship Council (CSC) under the Grant CSC No.201708500095.
Benavoli, A., Corani, G., Mangili, F.: Should We Really Use Post-Hoc Tests Based on Mean-Ranks? Journal of Machine Learning Research17, 1–10 (2016)
-  37(9), 1757–1771 (2004). https://doi.org/10.1016/j.patcog.2004.03.009
Cao, P., Liu, X., Zhao, D., Zaiane, O.: Cost Sensitive Ranking Support Vector Machine for Multi-label Data Learning. In: Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). pp. 244–255. Springer International Publishing, Cham (2017)
Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: A First Approach to Deal with Imbalance in Multi-label Datasets. In: Proceedings of the 8th International Conference on Hybrid Artificial Intelligent Systems (HAIS 2013). vol. 8073 LNAI, pp. 150–160 (2013).https://doi.org/10.1007/978-3-642-40846-5_16
-  Charte, F., Rivera, A.J., Del Jesus, M.J., Herrera, F.: MLeNN: A first approach to heuristic multilabel undersampling. In: Intelligent Data Engineering and Automated Learning – IDEAL 2014. vol. 8669 LNCS, pp. 1–9. Springer International Publishing (2014). https://doi.org/10.1007/978-3-319-10840-7_1
-  Charte, F., Rivera, A.J., Del Jesus, M.J., Herrera, F.: MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems 89, 385–397 (2015). https://doi.org/10.1016/j.knosys.2015.07.019
-  Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 163, 3–16 (9 2015). https://doi.org/10.1016/j.neucom.2014.08.091
-  Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing 326-327, 39–53 (2019). https://doi.org/10.1016/j.neucom.2016.08.158
-  Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: REMEDIAL-HwR: Tackling multilabel imbalance through label decoupling and data resampling hybridization. Neurocomputing 326-327, 110–122 (2019). https://doi.org/10.1016/j.neucom.2017.01.118
-  Chen, K., Lu, B.L., Kwok, J.T.: Efficient Classification of Multi-label and Imbalanced Data using Min-Max Modular Classifiers. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings. pp. 1770–1775. IEEE (2006). https://doi.org/10.1109/IJCNN.2006.246893
-  Daniels, Z.A., Metaxas, D.N.: Addressing Imbalance in Multi-Label Classification Using Structured Hellinger Forests. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. pp. 1826–1832 (2017)
-  Dendamrongvit, S., Kubat, M.: Undersampling Approach for Imbalanced Training Sets and Induction from Multi-label Text-Categorization Domains. In: Proceedings of the 13th Pacific-Asia International Conference on Knowledge Discovery and Data Mining (PAKDD’09). pp. 40–52 (2009). https://doi.org/10.1007/978-3-642-14640-4_4
-  Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K.: Multilabel classification via calibrated label ranking. Machine Learning 73(2), 133–153 (2008). https://doi.org/10.1007/s10994-008-5064-8
-  Garcia, S., Herrera, F.: An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of machine learning research 9, 2677–2694 (2008)
-  Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2016)
-  Li, C., Shi, G.: Improvement of learning algorithm for the multi-instance multi-label RBF neural networks trained with imbalanced samples. Journal of Information Science and Engineering 29(4), 765–776 (2013)
-  Li, L., Wang, H.: Towards Label Imbalance in Multi-label Classification with Many Labels. arXiv preprint arXiv:1604.01304 (2016)
-  Liu, B., Tsoumakas, G.: Making Classifier Chains Resilient to Class Imbalance. In: 10th Asian Conference on Machine Learning (ACML 2018). p. 280–295. Beijing (2018)
-  Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems 46(3), 563–597 (2016)
-  Sáez, J.A., Krawczyk, B., Woźniak, M.: Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognition 57, 164–178 (2016). https://doi.org/10.1016/j.patcog.2016.03.012
-  Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE (2015). https://doi.org/10.1371/journal.pone.0118432
-  Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the Stratification of Multi-label Data. In: Proc. 2011 European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 145–158. Springer Berlin Heidelberg, Athens, Greece (2011)
Sozykin, K., Khan, A.M., Protasov, S., Hussain, R.: Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks. In: 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). pp. 146–151 (2018)
-  Sun, K.W., Lee, C.H.: Addressing class-imbalance in multi-label learning via two-stage multi-label hypernetwork. Neurocomputing 266, 375–389 (2017). https://doi.org/10.1016/j.neucom.2017.05.049
-  Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition 45(10), 3738–3750 (2012). https://doi.org/10.1016/j.patcog.2012.03.014
-  Tepvorachai, G., Papachristou, C.: Multi-label imbalanced data enrichment process in neural net classifier training. In: Proceedings of the International Joint Conference on Neural Networks. pp. 1301–1307 (2008). https://doi.org/10.1109/IJCNN.2008.4633966
-  Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23(7), 1079–1089 (2011)
-  Wan, S., Duan, Y., Zou, Q.: HPSLPred: An Ensemble Multi-Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source. Proteomics 17(17-18), 1700262 (2017). https://doi.org/10.1002/pmic.201700262
-  Wu, B., Lyu, S., Ghanem, B.: Constrained Submodular Minimization for Missing Labels and Class Imbalance in Multi-label Learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. pp. 2229–2236. AAAI’16, AAAI Press (2016)
Zeng, W., Chen, X., Cheng, H.: Pseudo labels for imbalanced multi-label learning. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA). pp. 25–31 (10 2014).https://doi.org/10.1109/DSAA.2014.7058047
-  Zhang, M.L., Li, Y.K., Liu, X.Y.: Towards class-imbalance aware multi-label learning. In: Proceedings of the 24th International Conference on Artificial Intelligence. pp. 4041–4047 (2015)
Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learning. Pattern recognition40(7), 2038–2048 (2007)