Learning from imbalanced datasets can be very challenging as the classes are not equally represented in the datasets [provost2000machine]. There might not be enough examples for a learner to form a legit hypothesis that can well model the under-represented classes. Hence, the classification results are often biased towards the majority classes. The curse of imbalanced learning is prevalent in real world applications. In medical research, models are usually trained to give predictions on a dichotomous outcome based on a series of observable features [elhassan2016classification]. For example, learning from a cancer dataset which mostly contains non-concer data samples is perceived to be difficult. Other practical applications with more severely skewed datasets are fraudulent telephone calls [fawcett1996combining], detection of oil spills in satellite images [kubat1998machine], detection of network intrusions [lee1998data], and information retrieval and filtering tasks [lewis1994heterogeneous]. In these scenarios, the imbalance ratio of majority class to minority class can go up to 100,000 [chawla2002smote]. Even though class imbalance issue can exist in multi-class applications, we only focus on binary class scenario in this paper as it is feasible to reduce a multi-class classification problem into a series of binary classification problems [allwein2000reducing].
Overall accuracy is typically chosen to evaluate the predictive power of machine learning classifiers provided with a balanced datasets. As for imbalanced datasets, overall accuracy is no longer an effective metric. For example, in the information retrieval and filtering domain by Lewis and Catlette (1994), only 0.2% are interesting cases [kubat1998machine]. A dummy classifier which always gives predictions of majority class would easily achieve an overall accuracy of 99.8%. However, this predictive model is uninformative as we are more interested in classifying the minority class. Common alternatives to overall accuracy in assessing imbalanced learning models are F measures, G mean, and Area Under the Curve (AUC) for Receiver Operating Characteristic (ROC) [swets1988measuring]. By convention, majority class is regarded as negative class and minority class as positive class [chawla2002smote, kubat1997addressing]. Figure 1
demonstrates a confusion matrix which is typically used to visualize and assess the performance of predictive models. Based on this confusion matrix, the evaluation metrics used in this paper are mathematically formulated as follows:
Section 2 briefly reviews the literature in dealing with imbalanced datasets. Section 3 proposed our WOTBoost algorithm with details. Section 4 compares the experimental results of WOTBoost algorithm with other baseline methods in terms of precision, recall, F measure,G mean, Specificity, AUC. Section 5 discusses the results and propose the future work direction.
There have been ongoing efforts in this research domain finding ways to better tackle the imbalanced learning problem. Most of the state-of-the-art research methodologies are fallen into two major categories: 1) Data level approach, or 2) Algorithm level approach [ali2015classification].
2.1 Data level approach
On the data level, skewed datasets can be balanced by either 1) oversampling the minority class data examples, 2) under-sampling the majority class data examples.
It aims to overcome the class imbalance by artificially creating new data from the under-represented class. However, simply duplicating the minority class samples would potentially cause overfitting. One of the most widely used techniques is SMOTE. The SMOTE algorithm generates synthetic data examples for minority class by randomly placing the newly created data instances between minority class data points and their close neighbors [chawla2002smote]. This technique not only can better model the minority classes by introducing a bias towards the minority instances, but also has lower chance of overfitting. This is due to SMOTE forcing the learners to create larger and less specific decision regions. Based on SMOTE, Hui Han et al. proposed the Boarderline-SMOTE which only synthesizes the minorities on the decision borderline [han2005borderline]
. The Boarderline-SMOTE classifies minority classes into "safe type" and "dangerous type". The "safe type" is located in the homogeneous regions where the majority of data examples belong to the same class. On the other hand, the "dangerous type" data points are outliers and most likely lie within the decision regions of the opposite class. The intention behind Boarderline-SMOTE is to give more weight to the "dangerous type" minority class as it is deemed to be more difficult to learn[napierala2016types]. Haibo He et al. adopted the same philosophy and proposed ADASYN algorithm, which uses a weighted distribution for different minority class data. The weights are assigned to minority data examples based on the level of difficulty in learning. In other words, harder data examples have more weights thus higher chance of getting more synthesized data. Prior to generating synthetic data, ADASYN inspects the nearest neighbors for each minority class data example, and counts the number of neighbors from majority class, . Next, the difficulty of learning can be calculated as a ratio of [he2008adasyn]. ADASYN assigns higher weights on the difficult minority samples. On the contrary, Safe-Level-SMOTE gives more priority to safer minority instances and has a better accuracy performance than SMOTE and Boarderline-SMOTE [bunkhumpornpat2009safe].
This technique approaches the imbalanced learning by removing a certain amount of data examples from majority class while keeping the original minority data points untouched. Random undersampling is the most common method in this category [ali2015classification]. Elhassan AT et al. combine undersampling algorithm with Tomek Link (T-Link) to create a balanced dataset [elhassan2016classification, thai2010learning]. However, undersampling method may suffer severe information loss. In this paper, we mainly focus on oversampling technique and its variants [tomek1976experiment].
2.2 Algorithm level approach
On the algorithm level, there are typically three mainstream approaches: a) Improved algorithms, b) cost sensitive learning, and c) ensemble method [ali2015classification, sun2009classification].
2.2.1 Improved algorithms
This approach generally attempts to tailor the classification algorithms to directly learn from the skewed dataset by shifting the decision boundary in favor of minority class. Tasadduq Imam et al. proposed z-SVM to counter the inherent bias in datasets by introducing a weight parameter, , for minority class to correct the decision boundary during model fitting [imam2006z]. Other modified SVM classifiers have also been reported, such as GSVM_RU and BSVM [tang2009svms, hoi2004biased]. One special form of improved algorithm for imbalanced datasets is one-class learning. This method aims to generalize the hypothesis on a training dataset which only contains the target class [manevitz2001one, devi2019learning].
2.2.2 Cost sensitive learning
This technique penalizes the misclassifications of different classes with varying costs. Specifically, it assigns more costs to misclassification of the target classes. Hence, the false negative would be penalized more than the false positives [zadrozny2001learning, margineantu2002class]. In cost sensitive learning, a cost weight distribution is predefined in favor of the target classes.
2.2.3 Ensemble method
Ensemble method trains a series of weak learners in a fixed number of iterations. A weaker learner is a classifier whose accuracy is just barely above chance. At each round, a weak learner is created and a weak hypothesis is generalized. The predictive outcome is produced by aggregating all these weak hypothesises using weighted voting method [dietterich2000ensemble]. For example, AdaBoost.M2 algorithm calculates the pseudo-loss of each weak hypothesis during boosting. The pseudo-loss is computed over all data examples with respect to the incorrect classifications and the weight distribution associated with it at each iteration. The weight distribution is updated with respect to pseudo loss at the current iteration and will be carried over to the next round of boosting. Hence, the learners in the next iteration will concentrate on the data examples which are hard to learn [freund1996experiments]. Since Adaboost is apt to learn from imbalanced dataset, several works are based on this boosting framework [chawla2003smoteboost, seiffert2010rusboost, guo2004learning]. SMOTEBoost is proposed to combine the merits of SMOTE and Boosting methods by adding a SMOTE procedure at the beginning of each round of boosting. SMOTEBoost aims to improve the true positives without sacrificing the accuracy of majority class. RUSBoost alleviates class imbalanced by introducing random undersampling technique into standard boosting procedure. Compared with SMOTEBoost, RUSBoost is a faster and simpler alternative to SMOTEBoost [seiffert2010rusboost]. Ashutosh Kumar et al. proposed RUSTBoost algorithm which adds a redundancy-driven modified Tomek-Link based undersampling procedure before RUSBoost [kumar2019improvement]. The Tomek-Link pairs are a pair of closest data points from different classes. However, all the mentioned boosting algorithms treat the data examples equally. Krystyna Napierala et al. highlighted that the various types of the minority data examples (e.g. safe, borderline, rare, and outlier) have unequal influence on the outcome of classification. As such, the algorithms should be designed to focus on the examples which are not easy to learn[napierala2016types]. DataBoost-IM is reported to discriminate different types of data examples beforehand and adjusts the weight distribution accordingly during boosting [guo2004learning].
The contributions in this paper are: 1) We identified the minority class data examples which are harder to learn at each round of boosting and generate more synthetic data for this kind. 2) We tested our proposed algorithm extensively on 18 public accessible datasets and compared the results with the most commonly used algorithms. To our knowledge, this might be first work to carry out such a comprehensive comparison study in ensemble method combined with oversampling approach. 3) We inspected the various distributions of 18 datasets and discussed why WOTBoost performs better on certain datasets.
In this section, we propose the WOTBoost algorithm which combines a weighted oversampling algorithm with the standard boosting procedure. The Weighted Oversampling Technique populates synthetic data based on the weights that are associated to each minority data. In other words, higher weighted minority data samples are synthesized more. This algorithm is an ensemble method and create a series of classifiers in an arbitrary number of iterations. The boosting procedure will be elaborated with details in Algorithm 1 and 2: a) We introduce a weighted oversampling step at the beginning of each iteration of boosting. b) We adjust the weighted oversampling strategy using the updated weights (i.e. at line 8 in algorithm 1) associated with the minority during each round of boosting [provost2000machine]. The boosting algorithm gives more weights to the data samples which were misclassified in the previous round. Hence, WOTBoost can be designed to generate more synthetic data examples for the minority data which were misclassified in the previous iterations. Meanwhile, boosting technique would also add more weights to misclassified majority class data, and forces the learner to focus on these data as well. Therefore, we combine the merits of weighted oversampling technique and AdaBoost.M2 together. The goal is to improve the discriminate power of the classifier on difficult minority examples without sacrificing the accuracy on the majority class data instances.
Algorithm 1 presents the details of the boosting procedure, which is a modified version of AdaBoost.M2 [freund1996experiments]. It takes a training dataset with m data samples, . is the ith
feature vector in-dimensional space, and is the true label associated with . is the predicted label. We initialize a mislabel distribution, , which contains all the misclassified data instances (i.e. ). In addition, we also initialize a weight distribution for the training data by assigning equal weights over all samples. During each round of boosting (step 1 - step 9), a weak learner is built on a training dataset which is the output of weight oversampling procedure. The weak learner formulates a weak hypothesis which is just slightly better than random guessing, hence the name [freund1996experiments]
. But this is good enough as the final output will aggregate all the weak hypothesis using weighted voting. As for error estimation, pseudo loss of a weak hypothesis is calculated as specified at step 5. Instead of using ordinary training loss, pseudo loss is adopted to force the ensemble method to focus on mislabeled data. More justification of using pseudo loss can be found in[freund1996experiments, freund1997decision]. Once the pseudo loss is computed, the weight distribution, , is updated accordingly and normalized at step 5 - step 8.
Algorithm 2 demonstrates the weighted oversampling procedure. The inputs to oversampling technique are the weight distribution, , and an arbitrary number of synthetic data samples, . It uses the weight distribution as the oversampling strategy to decide how to synthesize for each minority data samples, as it is described at step 1 in algorithm 2. As mentioned previously, ensemble method would assign more weights to misclassified data. Therefore, this oversampling strategy facilitates the classifier to learn a broader representation of mislabeled data by placing more similar data samples around them.
In this section, we conduct a comprehensive comparison study of WOTBoost algorithm with decision tree, SMOTE + decision tree, ADASYN + decision tree, and SMOTEBoost. Figure 2 shows how the models are built and assessed.
|Dataset||Instances||Attributes||Outcome Frequency||Imbalanced Ratio||No. of safe minority||No. of unsafe minority||unsafe minority%|
|Pima Indian Diabetes||768||9||Maj: 506 Min:268||1.9||86||182||67.9%|
|Vowel Recognition||990||14||Maj:900 Min:90||10.0||89||1||1.1%|
|Mammography||11183||7||Maj: 10923 Min: 260||42||107||153||58.8%|
|Ionosphere||351||35||Maj: 225 Min 126||1.8||57||69||54.8%|
|Vehicle||846||19||Maj: 647 Min:199||3.3||154||45||22.6%|
|Phoneme||5404||6||Ma j: 3818 Min:1580||2.4||980||606||38.2%|
|Haberman||306||4||Maj: 225 Min:81||2.8||8||73||90.1%|
|Wisconsin||569||31||Maj: 357 Min: 212||1.7||175||37||17.5%|
|Blood Transfusion||748||5||Maj: 570 Min: 178||3.2||23||83||87.1%|
|PC1||1484||9||Maj: 1032 Min: 77||13.4||8||69||89.6%|
|Heart||294||14||Maj: 188 Min: 106||1.8||17||89||84.0%|
|Segment||2310||20||Ma j: 1980 Min: 330||6.0||246||84||25.5%|
|Yeast||1484||9||Ma j: 1240 Min: 244||5.1||95||149||61.1%|
|Oil||937||50||Maj: 896 Min: 41||21.9||0||41||100.0%|
|Adult||48842||7||Maj: 37155 Min: 11687||3.2||873||10814||92.5%|
|Satimage||6430||37||Maj: 5805 Min: 625||9.3||328||297||47.5%|
|Forest cover||581012||11||Maj: 35754 Min: 2747||13.0||2079||668||24.3%|
We evaluate these 5 models extensively using 18 imbalanced datasets which are publicly accessible. The imbalanced ratio (i.e. counts of majority class samples to counts of minority class samples) of these datasets vary from 1.7 to 42. Since some testing imbalanced datasets have more than 2 classes, and we are only interested in binary class problem in this paper, we pre-processed these datasets and modified them into a binary class datasets following the rules in the literature [chawla2002smote, han2005borderline, bunkhumpornpat2009safe, he2008adasyn, chawla2003smoteboost, kumar2019improvement]. Meanwhile, only numeric attributes are included when processing datasets. The details of data cleaning is not provided here due to page limit. Characteristics of these datasets are summarized in table 1.
We compared WOTBoost algorithm with naive decision tree classifier, decision tree classifier after SMOTE, decision tree classifier after ADASYN, and SMOTEBoost. Figure 2 shows that the clean datasets were splitted evenly into training and testing during each iteration [he2008adasyn]. As a control group, a naive decision tree model learned directly from imbalanced training dataset. SMOTE and ADASYN algorithms were used separately to balanced the training dataset before inputting it to decision tree classifiers. SMOTEBoost and WOTBoost took in imbalanced training datasets and synthesized new data samples for minority at each round of boosting. Both of them used decision tree as weak learner [chawla2003smoteboost]. Models are evaluated on a separate testing dataset, and the evaluating metrics used in this study are precision, recall, F1 measure, G mean, specificity, area under ROC. The final performance assessments are averaged over 100 such runs, and they are summarized in table 3. During each testing run, we oversampled the training dataset in a way that both minority class and majority class are equally represented in all models [he2008adasyn]. For SMOTE, ADASYN, SMOTEBoost, and WOTBoost, we set the number of nearest neighbors to be 5.
5 Results and discussion
We highlighted the best model and its performance in boldface for each dataset in table 3. Figure 3 presented the performance comparison of these 5 models on G mean and AUC score in 18 datasets. To assess the effectiveness of the proposed algorithm on these imbalanced datasets, we counted the cases when WOTBoost algorithm outperforms or matches other models on each metric. The results presented in table 2 show that WOTBoost algorithm has the most winning times on G mean (6 times) and AUC (7 times). As defined in equation 4 in the first section, G mean is the square root of the product between positive accuracy (i.e. recall or sensitivity) and negative accuracy (i.e. specificity). Meanwhile, area under the ROC curve, or AUC, is typically used for model selection, and it examines the true positive rate and false positive rate at various thresholds. Hence, both evaluation metrics considers the accuracy of both classes. Therefore, we argue that WOTBoost indeed improves the learning on the minority class while keeping the accuracy of the majority class.
In table 3, we observe that WOTBoost has the best G mean and AUC score on Pima Indian Diabetes whereas SMOTEBoost is the winner on Ionoshphere with the same assessments. Considering these two datasets have similar global imbalanced ratio, it naturally raises the question: are there any other factors that are influential in the classification performance? To understand the reasons why WOTBoost performs better on certain datasets, we investigated the local characteristics of minority class in these datasets. We used t-SNE to visualize the distribution of these two datasets as shown in Figure 4. t-SNE algorithm allows us to visualize high dimensional datasets by projecting it into a two-dimensional panel. Figure 4 indicates there are more overlapping between two classes in Pima Indian Diabetes, whereas more "safe" minority class samples in Ionosphere. It is likely that WOTBoost is able to learn better when there are more difficult minority data examples. Figure 5 demonstrates the distribution of Pima Indian Diabetes before and after applying WOTBoost. We highlighted one of the the regions where minority data samples are difficult to learn. WOTBoost algorithm is able to populate synthetic data for these minority data samples.
Table 1 shows the number of safe/unsafe minority samples of 18 dataset. We consider a minority class sample to be safe if its 5 nearest neighbors contain at most 1 majority class sample; otherwise, it is labeled as an unsafe minority [han2005borderline, napierala2016types]. Unsafe minority percentage is computed by
We observed that the unsafe minority percentages are around 50% or higher in most of the datasets where WOTBoost has the best G-mean or AUC shown in table 3. For example, Adult, Haberman, Blood Transfusion, Pima Indian Diabetes, and Satimage have 92.5%, 90.1%, 87.1%, 67.9%, 47.5% unsafe minority among the total minority class samples, respectively. Meanwhile, the global imbalanced ratios of these datasets are from 1.9 to 10.0. Hence, WOTBoost might be a good candidate to tackle imbalanced datasets with large proportion of unsafe minority samples and relatively high between-class imbalance ratios.
|Dataset||Methods||OA||Precision||Recall||F_measure||G_mean||Specificity||Sensitivity||ROC AUC||Outcome Frequency||Imbalanced ratio|
|Pima Indian Diabetes||DT||0.71 0.02||0.61 0.04||0.54 0.05||0.57 0.03||0.66 0.02||0.80 0.03||0.54 0.05||0.670.02||Maj: 500 Min:268||1.9|
|Abalone||DT||0.93 0.01||0.460.12||0.460.10||0.460.08||0.660.08||0.960.01||0.460.10||0.710.04||Maj:689 Min:42||16.4|
|WOT||0.940.01||0.550.33||0.340.11||0.420.13||0.58 0.18||0.980.01||0.34 0.11||0.66 0.05|
|Vowel Recognition||DT||0.970.00||0.900.06||0.790.06||0.840.04||0.880.03||0.990.00||0.790.06||0.890.03||Maj:900 Min:90||10.0|
|Ionosphere||DT||0.860.02||0.830.06||0.730.06||0.770.04||0.820.03||0.920.04||0.730.06||0.830.03||Maj: 225 Min 126||1.8|
|Vehicle||DT||0.940.01||0.850.04||0.880.04||0.870.03||0.920.02||0.950.01||0.880.04||0.920.02||Maj: 647 Min:199||3.3|
|Phoneme||DT||0.860.00||0.750.01||0.740.01||0.750.01||0.820.00||0.900.00||0.740.01||0.820.00||Maj: 3818 Min:1580||2.4|
|Haberman||DT||0.670.03||0.380.06||0.250.08||0.300.05||0.460.05||0.830.05||0.250.08||0.540.03||Maj: 225 Min:81||2.8|
|Wisconsin||DT||0.950.01||0.930.03||0.930.01||0.930.01||0.950.01||0.960.02||0.930.01||0.950.01||Maj: 357 Min: 212||1.7|
|Blood Transfusion||DT||0.720.01||0.390.06||0.280.08||0.320.06||0.490.07||0.860.01||0.280.08||0.570.04||Maj: 570 Min: 178||3.2|
|PC1||DT||0.900.01||0.250.05||0.270.05||0.260.04||0.500.04||0.940.03||0.270.05||0.610.02||Maj: 1032 Min: 77||13.4|
|Heart||DT||0.770.03||0.680.06||0.630.08||0.650.05||0.730.04||0.840.05||0.630.08||0.740.03||Maj: 188 Min: 106||1.6|
|Segment||DT||0.960.00||0.880.04||0.880.03||0.880.02||0.930.02||0.980.00||0.880.03||0.930.01||Maj: 1980 Min: 330||6.0|
|Yeast||DT||0.830.01||0.460.03||0.590.05||0.510.03||0.720.03||0.870.01||0.590.05||0.730.02||Maj: 1240 Min: 244||5.1|
|Oil||DT||0.930.01||0.350.11||0.480.13||0.410.11||0.680.12||0.960.01||0.480.13||0.720.06||Maj: 896 Min: 41||21.9|
|Adult||DT||0.750.00||0.480.00||0.470.00||0.480.00||0.630.00||0.840.00||0.470.00||0.660.00||Maj: 37155 Min: 11687||3.2|
|Satimage||DT||0.91 0.00||0.530.02||0.510.03||0.520.02||0.700.02||0.950.00||0.510.03||0.730.01||Maj: 5805 Min: 625||9.3|
|Forest cover||DT||0.970.00||0.810.01||0.820.01||0.820.01||0.900.00||0.990.00||0.820.01||0.900.00||Maj: 35754 Min: 2747||13.0|
DT=Decision Tree, S=SMOTE, A=ADASYN, SM=SMOTEBoost, WOT=WOTBoost
In this paper, we proposed WOTBoost algorithm to better learn from imbalanced datasets. The goal was to improve performance of classification on minority class without sacrificing the accuracy of the majority class. We carried out a comprehensive comparison between WOTBoost algorithm and 4 other classification models. Results indicate that WOTBoost has the best G mean and AUC scores in 6 out of 18 datasets. WOTBoost showed more balanced performance such as in G mean, than other classification models compared to particularly SMOTEBoost. Even though WOTBoost is not cure-all method to the imbalanced learning problem, it is likely to produce promising results for datasets that contain large portion of unsafe minority samples and maybe relatively high global imbalanced ratios. We hope that our contribution to this research domain would provide more insights and directions.
In addition, our study demonstrates that having the prior knowledge of the minority class distribution could facilitate the learning performance of the classifiers [provost2000machine, guo2004learning, he2008adasyn, napierala2016types, han2005borderline, bunkhumpornpat2009safe]. Further investigating on the data-driven sampling may produce interesting findings in this domain.