DeepAI
Log In Sign Up

Integrating Unsupervised Clustering and Label-specific Oversampling to Tackle Imbalanced Multi-label Data

09/25/2021
by   Payel Sadhukhan, et al.
0

There is often a mixture of very frequent labels and very infrequent labels in multi-label datatsets. This variation in label frequency, a type class imbalance, creates a significant challenge for building efficient multi-label classification algorithms. In this paper, we tackle this problem by proposing a minority class oversampling scheme, UCLSO, which integrates Unsupervised Clustering and Label-Specific data Oversampling. Clustering is performed to find out the key distinct and locally connected regions of a multi-label dataset (irrespective of the label information). Next, for each label, we explore the distributions of minority points in the cluster sets. Only the minority points within a cluster are used to generate the synthetic minority points that are used for oversampling. Even though the cluster set is the same across all labels, the distributions of the synthetic minority points will vary across the labels. The training dataset is augmented with the set of label-specific synthetic minority points, and classifiers are trained to predict the relevance of each label independently. Experiments using 12 multi-label datasets and several multi-label algorithms show that the proposed method performed very well compared to the other competing algorithms.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/07/2020

Multi-Label Sampling based on Local Label Imbalance

Class imbalance is an inherent characteristic of multi-label data that h...
05/12/2020

Unsupervised Multi-label Dataset Generation from Web Data

This paper presents a system towards the generation of multi-label datas...
05/22/2021

PLM: Partial Label Masking for Imbalanced Multi-label Classification

Neural networks trained on real-world datasets with long-tailed label di...
07/27/2017

Food Ingredients Recognition through Multi-label Learning

Automatically constructing a food diary that tracks the ingredients cons...
08/29/2017

EC3: Combining Clustering and Classification for Ensemble Learning

Classification and clustering algorithms have been proved to be successf...
03/02/2021

Multi-label Classification via Adaptive Resonance Theory-based Clustering

This paper proposes a multi-label classification algorithm capable of co...
05/01/2020

Investigating Class-level Difficulty Factors in Multi-label Classification Problems

This work investigates the use of class-level difficulty factors in mult...

1 Introduction

In a multi-label dataset a single datapoint is associated with more than one relevant label. This type of data is obtained naturally from real-world domains like text  [14, 10], bioinformatics  [1], video [25], images  [2, 19, 16] and music  [15]. We denote a multi-label dataset as . Here, is the input datapoint in dimensions, and is the corresponding label assignment for among the possible labels. indicates if the label is applicable (or relevant) for the datapoint: denotes that the label is relevant to , and denotes that the label is not applicable, or is irrelevant, to . The target of multi-label learning is to build a model that can correctly predict all of the relevant labels for a datapoint .

Multi-label datasets are often found to possess an imbalance in the representation of the different labels—some labels are relevant to a very large number of datapoints while other labels are only relevant to a few. We can consider this an example of the class imbalance problem common in binary classification problems if we consider the relevance of each label to be analogous to a binary classification. Often, the labels in a multi-label dataset have widely varying degrees of imbalance and this is a challenging aspect of building multi-label classification models.

Addressing label imbalance to improve multi-label classification is an active field of research, and several methods have been proposed to address this problem [38, 7, 17]. There is, however, room for significant improvement. Label-specific oversampling can be a solution to address the issue of varying label imbalances in multi-label datasets. In this light, we propose UCLSO, which integrates Unsupervised Clustering and Label Specific data Oversampling. The essence of the UCLSO approach is to integrate information about the proximity of points and their label-specific class-memberships to solve the issue of class imbalance in multi-label datasets. In this work, i) synthetic minority points are generated from local data clusters which are obtained from unsupervised clustering of the feature space, and ii) the cardinality of the label-specific oversampled minority set obtained from a local cluster will depend on the cluster’s share of minority datapoints for that label. In effect, the method oversamples the minority class by focusing on per-cluster local distributions of the minority datapoints to maintain the local minority distribution ratio. The key highlights of our work are,

  • We propose UCLSO, a new minority class oversampling method for multi-label datasets, which generates synthetic minority datapoints specifically in the minority regions of the input space.

  • UCLSO preserves the intrinsic class distributions of the local clusters in order to avoid generating synthetic minority datapoints in the majority region, or as outliers in the input space.

  • UCLSO ensures that the number of synthetic minority points added to a region is proportionate to the original minority density in that region.

  • In UCLSO, datapoints belonging to individual clusters (consistent across the labels) have distinct label relevance which vary across the different labels. We integrate this label-specific information along with the information from the previous step to obtain sets of label-specific synthetic minority points.

  • An empirical study involving 12 well-known real-world multi-label datasets and nine competing methods indicates that UCLSO shows promising results and is able to perform better, in general, than the competing methods.

The remainder of the paper is structured as follows. Section 2 discusses the relevant existing work in the multi-label domain. In Section 3 we first describe the motivations of our approach and then present the steps of the proposed UCLSO algorithm. The experiment design is described in Section 4 and the results of the experiments are discussed in Section 5. Finally, Section 6 concludes the paper and discusses some directions for future work.

2 Related Work

Existing multi-label classification methods are principally classified into two types: i) Problem transformation methods that modify the multi-label dataset in different ways such that it can be used with existing multi-class classification algorithms [32, 8, 27, 41], and ii) Algorithm adaptation

approaches that modify existing machine learning algorithms to directly handle multi-label datasetets

[40, 18, 36, 41].

Multi-label algorithms can also be categorised based on if and how they take label associations into account, which allows algorithms to be categorised as: i) first-order, ii) second-order or iii) higher-order approaches based on the number of labels that are considered together to train the models. First order approaches do not consider any label association and learn a classifier for each label independently of all other labels [40, 31, 37]. In second order methods, pair-wise label associations are explored to achieve enhanced learning of multi-label data [21, 8]. Higher order approaches considering associations between more than two labels [2]. A number of diversified techniques have facilitated higher order label associations through interesting schemes including classifier chains [5, 26], RAkEL [32], random graph ensembles [29], DMLkNN [35], IBLR-ML+ [6], and Stacked-MLkNN [20].

In recent years, data transformation has been a popular choice for handling multi-label datasets. The two principal ways of data transformation in multi-label domain are: i) feature extraction or selection, and ii) data oversampling or undersampling. One of the earliest applications of feature extraction in multi-label learning was through LIFT

[39]

, which brought significant performance improvements. Most feature selection or extraction methods select a label-specific feature set for each label to improve the discerning capability of the label specific classifiers. Subsequently, a number of different feature selection and extraction approaches have been proposed

[13, 34, 33, 11]. Recently, the class imbalance problem in multi-label learning has received more interest from the researchers. One common approach to handling imbalance is to balance the cardinalities of the relevant and irrelevant classes for each label. One way of achieving this is through the removal of points from the majority class of each label– for example using random undersampling [30] or tomek-link based undersampling [22]. Another way to achieve this is by adding synthetic minority points to the minority class [17, 28, 3]. Although this approaches have been shown to be effective there is still a lot of room for improvement.

3 Unsupervised Clustering and Label Specific data Oversampling (UCLSO)

In this section we discuss the motivation and then present the proposed approach: Unsupervised Clustering and Label-Specific data Oversampling (UCLSO).

3.1 Motivation

Figure 1: A toy dataset illustrating the problems with oversampling, and how UCLSO addresses them

Let us consider the two-dimensional toy dataset with two labels (1 and 2) shown in Figure 1(a). The imbalance ratios of labels 1 and 2 in this dataset are 24.7 and 14.4 respectively. Figure 1

(b) shows 5 clusters in this datatset found using k-means. In Figures

1 (c) and (d), we mark the points with respect to their label-specific class memberships. The colours red and blue indicate majority and minority class points respectively. Data pre-processing via minority class oversampling is a popular choice to tackle the issue of imbalance in imbalanced datasets [12]. In a multi-label dataset, due to spatial and quantitative variation of class-memberships across the labels, we need to label specific oversampling. Figures 1 (e) and (f) show the label-specific SMOTE-based [4] oversampling (synthetic points in yellow) for label 1 and label 2 respectively. It can be seen that SMOTE oversamples the synthetic minority points in majority regions on a number of occasions for both labels 1 and 2 (highlighted by black circles in Figures 1 (e) and (f)).

In order to achieve effective learning of a dataset, we need to prevent the majority space encroachment during oversampling. We tackle this issue by clustering (using k-means) the feature space. Clustering the dataset will give us k localized subspaces. Oversampling only within each cluster can prevent the majority class encroachment.

This work is motivated by an effort to balance the cardinalities of the minority and majority classes of the labels without encroaching on the majority class spaces, as well as an effort to preserve the underlying distribution of the datapoints.

As indicated in Figures 1 (e) and (f), a generic oversampling for all labels will not be fruitful as different labels have different quantitative and spatial distribution of the minority points. The are two aspects we need to keep in mind. i) Where should we perform the oversampling? To answer this, we cluster the feature space in an unsupervised manner (only the feature attributes of the points are taken into account). ii) If there is more than one subspace in which to perform oversampling, how much should we oversample in each subspace? We look into the distribution of the minority points (label-specific) in the clusters to decide this. The degree of label-specific oversampling in a cluster should be proportional to its original minority class distribution for that label. Figures 1 (g) and (h) show the oversampling on labels 1 and labels 2 through the proposed method UCLSO. The degree of encroachment in the majority class region is much less for UCLSO compared to SMOTE. The next section will present the proposed UCLSO method in detail.

3.2 Approach

1:procedure UCLSO() : Training dataset, : k-means clusters
2:      = k_means () Cluster input space
3:     for  do
4:         
5:         for  do
6:               number of original minority points for label in
7:                Find the synthetic minority instance shares of each cluster
8:              
9:              
10:              
11:                Generate synthetic points
12:              for  do
13:                   is selected randomly from
14:                   is ’s randomly selected nearest neighbor in
15:                   selected randomly
16:                   synthetic pt. for label from
17:                  
18:              end for
19:         end for
20:          Augment original data
21:     end for
22:     return Per label augmented synthetic datasets
23:end procedure
Algorithm 1 UCLSO

Following the motivation in the previous section, in this work we propose a first-order oversampling method for multi-label classification datasets, UCLSO, that handles class-imbalance for each label independently.

The main idea of this oversampling method of minority points is to keep the synthetic minority points concentrated within the minority region of the input space. This is done to introduce more synthetic minority points in the minority regions of the input space in a guided fashion, while avoiding introduction of synthetic minority points in non-minority regions. This ideally should improve the minority label area representation, which will help in classifier algorithm training to define a better decision boundary for the specific imbalanced label.

A common approach to generating synthetic minority datapoints is to select two points within a neighbourhood and then generate a synthetic point by interpolation at a random location between them. For a label with a high imbalance ratio and sparsely distributed minority points, the neighbours from the same class for this label can lie far apart. Consequently, the neighbourhood can encompass a large volume of feature space. Therefore, oversampling in such a manner may lead to the generation of synthetic minority points which end up in the majority region of the input space.

To tackle this issue, we partition the original points into clusters , based only on the input space inter-point Euclidean distances. We use the k-means algorithm to perform this clustering. After clustering the datapoints, for each cluster , we randomly select , a minority point from the cluster, and , which is a randomly chosen nearest neighbour of in

. We compute the synthetic minority point by interpolation at a random location of the direction vector connecting

and . The synthetic point is computed as follows

(1)

where is the th synthetic datapoint generated in cluster for the label , and

is a random number sampled from the uniform distribution, which decides the location of the synthetic point between

and .

The number of synthetic minority points generated from a cluster is directly proportional to the share of original minority points in that cluster. Therefore, more synthetic minority points will be introduced in the clusters with more original minority points. This is because, we are more confident about adding minority points in a region which originally had comparatively more original minority points. The number of synthetic minority points to be added is computed as follows

(2)

where and are the sets of minority and majority datapoints for the label respectively. Here is the number of original minority datapoints for label in cluster . This way, the clusters which have more original minority points will be populated with more synthetic minority point.

Following the above steps, once we obtain the synthetic minority set for the label , the original training dataset is appended with to get an augmented dataset for each label . This augmented training set, , is used to train a binary classifier model for the corresponding label . The above process is summarised in Algorithm 1.

4 Experiments

We performed a set of experiments to evaluate the effectiveness of the proposed UCLSO method. This section describes the datasets, algorithms, experimental setup, and evaluation processes used for the experiments.

Dataset Instances Inputs Labels Type Cardinality Density Distinct Proportion of Imbalance Ratio
Labelsets Distinct min max avg
Labelsets
yeast 2417 103 13 numeric 4.233 0.325 189 0.078 1.328 12.500 2.778
emotions 593 72 6 numeric 1.869 0.311 27 0.046 1.247 3.003 2.146
medical 978 144 14 numeric 1.075 0.077 42 0.043 2.674 43.478 11.236
cal500 502 68 124 numeric 25.058 0.202 502 1.000 1.040 24.390 3.846
rcv1-s1 6000 472 42 numeric 2.458 0.059 574 0.096 3.342 49.000 24.966
rcv1-s2 6000 472 39 numeric 2.170 0.056 489 0.082 3.216 47.780 26.370
rcv1-s3 6000 472 39 numeric 2.150 0.055 488 0.081 3.205 49.000 26.647
enron 1702 50 24 nominal 3.113 0.130 547 0.321 1.000 43.478 5.348
bibtex 7395 183 26 nominal 0.934 0.036 377 0.051 6.097 47.974 32.245
llog 1460 100 18 nominal 0.851 0.047 109 0.075 7.538 46.097 24.981
corel5k 5000 499 44 nominal 2.241 0.050 1037 0.207 3.460 50.000 17.857
slashdot 3782 53 14 nominal 1.134 0.081 118 0.031 5.464 35.714 10.989
Table 1: Description of datasets

Several well-known multi-label datasets were selected which are listed in Table 1 111http://mulan.sourceforge.net/datasets-mlc.html. Here, instances, inputs and labels

indicate the total number of datapoints, the number of predictor variables, and the number of potential labels respectively in each dataset.

Type indicates if the input space is numeric or nominal. Distinct labelsets indicates the number of unique combinations of labels. Cardinality is the average number of labels per datapoint, and Density is achieved by dividing Cardinality by the Labels.

The datasets are modified as recommended in [38, 12]. Labels having a very high degree of imbalance (50 or greater) or having too few positive samples (20 in this case) are removed. For text datasets (medical, enron, rcv1, bibtex), only the input space features with high degree of document frequencies are retained.

To compare the performance of different approaches, we have selected the label-based macro-averaged F-Score and label-based macro-averaged AUC scores recommended in

[38]. For the experiments evaluating the proposed algorithm we have performed a fold cross-validation experiment. The experiment setup and environment was kept identical to Zhang et. al.[38]. For clustering, the number of clusters was set to for the k-means step of UCLSO. In the classification phase, a set of linear SVM classifiers are used, one for each label.

We compare the performance of UCLSO against several state-of-the-art multi-label classification algorithms – COCOA [38], THRSEL [24], IRUS [30], SMOTE-EN [4], RML [23], and binary relevance (BR), calibrated label ranking (CLR) [8], ensemble classifier chains (ECC) [27] and RAkEL [32]. We base our experiments on the experiment presented in Zhang et. al. [38], and extend the results of that paper by adding the performance of UCLSO.

5 Results

Tables 2 and 3 shows the label-based macro-average F-Score and label-based macro-averaged AUC results respectively222Note that results for Table 3 does not have the results RML [23] as the implementation does not provide prediction scores., along with the relative ranks in brackets (lower ranks are better) of the algorithms compared for each dataset. The last row of both tables indicate the average rank for the algorithms. The best values are highlighted in boldface.

Also, to further analyse the differences between the algorithms, we performed a non-parametric statistical test for a multiple classifier comparison test. Following [9], we have performed a Friedman test with Finner -value adjustments, and the critical difference plots from the test results are shown in Figure 2 333The full result tables in supplementary material: https://github.com/phoxis/uclso/blob/main/UCLSO_ICONIP2021_Supplementary_Material.pdf.

UCLSO COCOA THRSEL IRUS SMOTE-EN RML BR CLR ECC RAkEL
yeast 0.505 (1) 0.461 (3) 0.427 (5  ) 0.426 (6 ) 0.436 (4  ) 0.471 (2  ) 0.409 (9 ) 0.413 (8 ) 0.389 (10 ) 0.420 (7)
emotions 0.658 (2) 0.666 (1) 0.560 (9  ) 0.622 (5 ) 0.575 (8  ) 0.645 (3  ) 0.550 (10) 0.595 (7 ) 0.638 (4  ) 0.613 (6)
medical 0.783 (1) 0.759 (2) 0.733 (3.5) 0.537 (10) 0.700 (8  ) 0.707 (7  ) 0.718 (6 ) 0.724 (5 ) 0.733 (3.5) 0.672 (9)
cal500 0.273 (2) 0.210 (5) 0.252 (3  ) 0.277 (1 ) 0.235 (4  ) 0.209 (6  ) 0.169 (8 ) 0.081 (10) 0.092 (9  ) 0.193 (7)
rcv1-s1 0.443 (1) 0.364 (3) 0.292 (5  ) 0.252 (8 ) 0.313 (4  ) 0.387 (2  ) 0.285 (6 ) 0.227 (9 ) 0.192 (10 ) 0.272 (7)
rcv1-s2 0.432 (1) 0.342 (3) 0.275 (5  ) 0.234 (8 ) 0.305 (4  ) 0.363 (2  ) 0.272 (6 ) 0.226 (9 ) 0.173 (10 ) 0.263 (7)
rcv1-s3 0.480 (1) 0.339 (3) 0.275 (5  ) 0.225 (8 ) 0.302 (4  ) 0.371 (2  ) 0.271 (6 ) 0.211 (9 ) 0.163 (10 ) 0.257 (7)
enron 0.352 (1) 0.342 (2) 0.291 (5  ) 0.293 (4 ) 0.266 (8  ) 0.307 (3  ) 0.246 (9 ) 0.244 (10) 0.268 (6  ) 0.267 (7)
bibtex 0.442 (1) 0.318 (3) 0.303 (4  ) 0.253 (8 ) 0.283 (5  ) 0.326 (2  ) 0.263 (7 ) 0.265 (6 ) 0.212 (10 ) 0.252 (9)
llog 0.181 (1) 0.082 (6) 0.096 (3  ) 0.124 (2 ) 0.095 (4.5) 0.095 (4.5) 0.031 (7 ) 0.024 (8 ) 0.022 (10 ) 0.023 (9)
corel5k 0.209 (2) 0.196 (3) 0.146 (4  ) 0.105 (6 ) 0.125 (5  ) 0.215 (1  ) 0.089 (7 ) 0.049 (10) 0.054 (9  ) 0.084 (8)
slashdot 0.443 (1) 0.374 (2) 0.355 (4  ) 0.257 (10) 0.366 (3  ) 0.343 (5  ) 0.291 (8 ) 0.290 (9 ) 0.304 (6  ) 0.296 (7)
Avg. rank 1.25 3.00 4.62 6.33 5.12 3.29 7.42 8.33 8.12 7.5
Table 2: Each cell indicates the averaged label-based macro-averaged F-Scores scores (best score in bold) along with the relative rank of the corresponding algorithm in brackets. The last row indicates the overall average ranks.
UCLSO COCOA THRSEL IRUS SMOTE-EN BR CLR ECC RAkEL
yeast 0.666 (3) 0.711 (1  ) 0.576 (8.5) 0.658 (4  ) 0.582 (7) 0.576 (8.5) 0.650 (5  ) 0.705 (2  ) 0.641 (6)
emotions 0.819 (3) 0.844 (2  ) 0.687 (8.5) 0.802 (4  ) 0.698 (7) 0.687 (8.5) 0.796 (6  ) 0.850 (1  ) 0.797 (5)
medical 0.967 (1) 0.964 (2  ) 0.869 (7.5) 0.955 (3.5) 0.873 (6) 0.869 (7.5) 0.955 (3.5) 0.952 (5  ) 0.856 (9)
cal500 0.550 (4) 0.558 (2  ) 0.509 (8.5) 0.545 (5  ) 0.512 (7) 0.509 (8.5) 0.561 (1  ) 0.557 (3  ) 0.528 (6)
rcv1-s1 0.919 (1) 0.889 (3  ) 0.643 (7.5) 0.882 (4  ) 0.626 (9) 0.643 (7.5) 0.891 (2  ) 0.881 (5  ) 0.728 (6)
rcv1-s2 0.912 (1) 0.882 (2.5) 0.640 (7.5) 0.880 (4  ) 0.622 (9) 0.640 (7.5) 0.882 (2.5) 0.874 (5  ) 0.721 (6)
rcv1-s3 0.956 (1) 0.880 (2  ) 0.633 (7.5) 0.872 (4.5) 0.628 (9) 0.633 (7.5) 0.877 (3  ) 0.872 (4.5) 0.718 (6)
enron 0.719 (5) 0.752 (1  ) 0.597 (8.5) 0.738 (3  ) 0.619 (7) 0.597 (8.5) 0.720 (4  ) 0.750 (2  ) 0.650 (6)
bibtex 0.844 (4) 0.877 (2  ) 0.673 (8.5) 0.894 (1  ) 0.706 (6) 0.673 (8.5) 0.811 (5  ) 0.873 (3  ) 0.696 (7)
llog 0.721 (1) 0.663 (4  ) 0.518 (7.5) 0.676 (2  ) 0.561 (6) 0.518 (7.5) 0.612 (5  ) 0.673 (3  ) 0.514 (9)
corel5k 0.695 (4) 0.718 (3  ) 0.559 (7.5) 0.687 (5  ) 0.596 (6) 0.559 (7.5) 0.740 (1  ) 0.723 (2  ) 0.552 (9)
slashdot 0.806 (1) 0.774 (2  ) 0.632 (8.5) 0.753 (4  ) 0.714 (6) 0.632 (8.5) 0.742 (5  ) 0.765 (3  ) 0.638 (7)
Avg. ranks 2.42 2.21 8.00 3.67 7.08 8.00 3.58 3.21 6.83
Table 3: Each cell indicates the averaged Label-based macro-averaged AUC scores (best score in bold) along with the relative rank of the corresponding algorithm in brackets. The last row indicates average ranks.

Table 2 clearly shows that the overall performance of the proposed UCLSO algorithm is better than all the other algorithms, attaining the best average rank of 1.25. The second best rank is attained by COCOA (avg. rank 3). Also, the proposed method UCLSO achieved much better performance than the other approaches for many datasets and attained the top rank for nine of the datasets, and on the remaining three datasets it attained the second rank. these results also show that methods that attempt to explicitly consider the label imbalance issue perform better than those that do not. The other algorithms which specifically address label imbalance attained the following order: RML (avg. rank 3.29), THRSEL (avg. rank 4.62), SMOTE-EN (avg. rank 5.12) and IRUS (arg. rank 6.33). The algorithms which do not consider the label imbalances like BR (avg. rank 7.42), RAkEL (avg. rank 7.5), ECC (avg. rank 8.12), and CLR (avg. rank 8.33) all performed poorly.

Multiple classifier comparison results in Figure (a)a

show that when UCLSO is compared with other algorithms, except for COCOA and RML, the null hypothesis can be rejected with a significance level of

. Therefore, based on the statistical test, UCLSO is significantly better than the other algorithms, except COCOA and RML.

(a) label-based macro-averaged F-Score
(b) label-based macro-averaged AUC
Figure 2: Critical difference plots. The scale indicates the average ranks. The methods which are not connected with the horizontal lines are significantly different with a significance level of .

Table 3 shows the label-based macro-averaged AUC scores, which shows that proposed method UCLSO was able to attain the second best average rank of 2.42, being very close to COCOA attaining the best rank of 2.21. Interestingly UCLSO attained more rank ones (six) than COCOA (two rank ones). Also, interestingly ECC was able to perform better than UCLSO in six of the datasets, but was able to perform better in nine datasets when compared to COCOA. It is also interesting to notice that ECC and CLR had higherrankings for the label-based macro-averaged AUC metric than for macro-averaged F-Scores. It seems that a simple BR still performed poorly. As ECC and CLR takes label associations into consideration in a binary relevance and ranking fashion, respectively, it helped improve the comparative performances. RAkEL, on the other hand, taking label associations into account is sensitive on the label subset size (value of k) and the specific combination, which can lead to an even higher degree of imbalance. The difference in the results of the label-based macro-average AUC compared to the F-Score also indicates the importance of thresholding the predictions when deciding the relevance of a certain label.

Multiple classifier comparison results show in Figure (b)b that when UCLSO is compared with others, the null hypothesis could not be rejected for COCOA, ECC, CLR and IRUS in this case with a significance level of . Although, UCLSO performed significantly better than RAkEL, SMOTE-ML, THRSEL and BR. Overall, the experiments demonstrate the effectiveness of the proposed method UCLSO, as it outperforms the compared state of the art algorithms in almost all cases.

6 Conclusion and Future Work

In this work we have proposed an algorithm to address the class imbalance of labels in multi-label classification problems. The proposed algorithm, Unsupervised Clustering and Label-Specific data Oversampling (UCLSO), oversamples label-specific minority datapoints in a multi-label problem to balance the sizes of the majority and the minority classes of each label. The oversampling of the minority classes for each label is done in a way such that more minority class samples are generated in regions (or clusters) where the density of minority points is high. This avoids the introduction of minority datapoints in majority regions in the input space. The number of samples introduced per cluster also depends on the share of the minority class for that cluster.

An experiment with 12 well-known multi-label datasets and other state of the art algorithms demonstrates the efficacy of UCLSO with respect to label-based macro-averaged F-Score. UCLSO attained the best average rank and the degree of its improvement over existing approaches was significant. This shows that UCLSO has successfully improved the classification of imbalanced multi-label data. In future, we would specifically like to incorporate some imbalance informed clustering to extend our scheme. Moreover, it would be interesting to amalgamate the oversampling technique with label associated learning, another key component of multi-label data.

References

  • [1] Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22 (7), pp. 830–836. Cited by: §1.
  • [2] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown (2004)

    Learning multi-label scene classification

    .
    Pattern recognition 37 (9), pp. 1757–1771. Cited by: §1, §2.
  • [3] F. Charte, A. J. Rivera, M. J. del Jesus, and F. Herrera (2015) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems 89, pp. 385–397. External Links: ISSN 0950-7051 Cited by: §2.
  • [4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002-06) SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16 (1), pp. 321–357. External Links: ISSN 1076-9757 Cited by: §3.1, §4.
  • [5] W. Cheng, E. Hüllermeier, and K. J. Dembczynski (2010) Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 279–286. Cited by: §2.
  • [6] W. Cheng and E. Hüllermeier (2009)

    Combining instance-based learning and logistic regression for multilabel classification

    .
    Machine Learning 76 (2-3), pp. 211–225. Cited by: §2.
  • [7] Z. Daniels and D. Metaxas (2017) Addressing imbalance in multi-label classification using structured hellinger forests. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 31. Cited by: §1.
  • [8] J. Fürnkranz, E. Hüllermeier, E. Loza Mencía, and K. Brinker (2008-11) Multilabel classification via calibrated label ranking. Mach. Learn. 73 (2), pp. 133–153. External Links: ISSN 0885-6125 Cited by: §2, §2, §4.
  • [9] S. García, A. Fernández, J. Luengo, and F. Herrera (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Information sciences 180 (10), pp. 2044–2064. Cited by: §5.
  • [10] S. Godbole and S. Sarawagi (2004) Discriminative methods for multi-labeled classification. In Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 22–30. Cited by: §1.
  • [11] (2017) Granular multi-label feature selection based on mutual information. Pattern Recognition 67, pp. 410 – 423. External Links: ISSN 0031-3203 Cited by: §2.
  • [12] H. He and E. A. Garcia (2009-09) Learning from imbalanced data. IEEE Trans. on Knowl. and Data Eng. 21 (9), pp. 1263–1284. External Links: ISSN 1041-4347 Cited by: §3.1, §4.
  • [13] J. Huang, G. Li, Q. Huang, and X. Wu (2018-03) Joint feature selection and classification for multilabel learning. IEEE Transactions on Cybernetics 48 (3), pp. 876–889. External Links: ISSN 2168-2267 Cited by: §2.
  • [14] T. Joachims (1998)

    Text categorization with support vector machines: learning with many relevant features

    .
    In European conference on machine learning, pp. 137–142. Cited by: §1.
  • [15] T. Li and M. Ogihara (2006-06) Toward intelligent music information retrieval. Multimedia, IEEE Transactions on 8 (3), pp. 564–574. External Links: ISSN 1520-9210 Cited by: §1.
  • [16] X. Li, F. Zhao, and Y. Guo (2014) Multi-label image classification with a probabilistic label enhancement model. In Uncertainty in Artificial Intelligence, Cited by: §1.
  • [17] B. Liu and G. Tsoumakas (2019) Synthetic oversampling of multi-label data based on local label distribution. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 180–193. Cited by: §1, §2.
  • [18] J. Nam, J. Kim, E. L. Mencía, I. Gurevych, and J. Fürnkranz (2014)

    Large-scale multi-label text classification—revisiting neural networks

    .
    In Joint european conference on machine learning and knowledge discovery in databases, pp. 437–452. Cited by: §2.
  • [19] G. Nasierding, G. Tsoumakas, and A. Z. Kouzani (2009-10) Clustering based multi-label classification for image annotation and retrieval. In Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on, pp. 4514–4519. Cited by: §1.
  • [20] A. Pakrashi and B. M. Namee (2017) Stacked-MLkNN: a stacking based improvement to multi-label k-nearest neighbours. In LIDTA@PKDD/ECML, Cited by: §2.
  • [21] S. Park and J. Fürnkranz (2007) Efficient pairwise classification. In ECML 2007. LNCS (LNAI, pp. 658–665. Cited by: §2.
  • [22] R. M. Pereira, Y. M.G. Costa, and C. N. Silla Jr. (2020) MLTL: a multi-label approach for the tomek link undersampling algorithm. Neurocomputing 383, pp. 95–105. External Links: ISSN 0925-2312 Cited by: §2.
  • [23] J. Petterson and T. S. Caetano (2010) Reverse multi-label learning. In Advances in Neural Information Processing Systems 23, pp. 1912–1920. Cited by: §4, footnote 2.
  • [24] I. Pillai, G. Fumera, and F. Roli (2013-07) Threshold optimisation for multi-label classifiers. Pattern Recogn. 46 (7), pp. 2055–2065. External Links: ISSN 0031-3203 Cited by: §4.
  • [25] G. Qi, X. Hua, Y. Rui, J. Tang, T. Mei, and H. Zhang (2007) Correlative multi-label video annotation. In Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, New York, NY, USA, pp. 17–26. External Links: ISBN 978-1-59593-702-5 Cited by: §1.
  • [26] J. Read, L. Martino, and D. Luengo (2013) Efficient monte carlo optimization for multi-label classifier chains. pp. 3457–3461. External Links: ISSN 1520-6149 Cited by: §2.
  • [27] J. Read, B. Pfahringer, G. Holmes, and E. Frank (2011) Classifier chains for multi-label classification. Machine learning 85 (3), pp. 333. Cited by: §2, §4.
  • [28] P. Sadhukhan and S. Palit (2019) Reverse-nearest neighborhood based oversampling for imbalanced, multi-label datasets. Pattern Recognition Letters 125, pp. 813 – 820. External Links: ISSN 0167-8655 Cited by: §2.
  • [29] H. Su and J. Rousu (2015-05-01) Multilabel classification through random graph ensembles. Machine Learning 99 (2). Cited by: §2.
  • [30] M. A. Tahir, J. Kittler, and F. Yan (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. 45 (10), pp. 3738–3750. External Links: ISSN 0031-3203 Cited by: §2, §4.
  • [31] E. A. Tanaka, S. R. Nozawa, A. A. Macedo, and J. A. Baranauskas (2015)

    A multi-label approach using binary relevance and decision trees applied to functional genomics

    .
    Journal of Biomedical Informatics 54, pp. 85–95. External Links: ISSN 1532-0464 Cited by: §2.
  • [32] G. Tsoumakas, I. Katakis, and I. Vlahavas (2011-07) Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23 (7), pp. 1079–1089. External Links: ISSN 1041-4347 Cited by: §2, §2, §4.
  • [33] J. Xu, J. Liu, J. Yin, and C. Sun (2016)

    A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously

    .
    Knowledge-Based Systems 98, pp. 172–184. Cited by: §2.
  • [34] J. Xu (2018) A weighted linear discriminant analysis framework for multi-label feature extraction. Neurocomputing 275, pp. 107–120. Cited by: §2.
  • [35] Z. Younes, F. Abdallah, and T. Denœux (2008) Multi-label classification algorithm derived from k-nearest neighbor rule with label dependencies. In 2008 16th European Signal Processing Conference, pp. 1–5. Cited by: §2.
  • [36] M.L. Zhang and Z.H. Zhou (2006) Multi-label neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, pp. 1338–1351. Cited by: §2.
  • [37] M. Zhang, Y. Li, X. Liu, and X. Geng (2018) Binary relevance for multi-label learning: an overview. Frontiers of Computer Science 12 (2), pp. 191–202. Cited by: §2.
  • [38] M. Zhang, Y. Li, H. Yang, and X. Liu (2020) Towards class-imbalance aware multi-label learning. IEEE Transactions on Cybernetics. Cited by: §1, §4, §4, §4.
  • [39] M. Zhang and L. Wu (2015-01) Lift: multi-label learning with label-specific features. Pattern Analysis and Machine Intelligence, IEEE Transactions on 37 (1), pp. 107–120. External Links: ISSN 0162-8828 Cited by: §2.
  • [40] M. Zhang and Z. Zhou (2007-07)

    ML-KNN: a lazy learning approach to multi-label learning

    .
    Pattern Recogn. 40 (7), pp. 2038–2048. External Links: ISSN 0031-3203 Cited by: §2, §2.
  • [41] M. Zhang and Z. Zhou (2013) A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26 (8), pp. 1819–1837. Cited by: §2.