MixBoost: Synthetic Oversampling with Boosted Mixup for Handling Extreme Imbalance

by   Anubha Kabra, et al.

Training a classification model on a dataset where the instances of one class outnumber those of the other class is a challenging problem. Such imbalanced datasets are standard in real-world situations such as fraud detection, medical diagnosis, and computational advertising. We propose an iterative data augmentation method, MixBoost, which intelligently selects (Boost) and then combines (Mix) instances from the majority and minority classes to generate synthetic hybrid instances that have characteristics of both classes. We evaluate MixBoost on 20 benchmark datasets, show that it outperforms existing approaches, and test its efficacy through significance testing. We also present ablation studies to analyze the impact of the different components of MixBoost.


page 1

page 4

page 9


Synthetic Over-sampling with the Minority and Majority classes for imbalance problems

Class imbalance is a substantial challenge in classifying many real-worl...

Solving the Class Imbalance Problem Using a Counterfactual Method for Data Augmentation

Learning from class imbalanced datasets poses challenges for many machin...

Identifying and Compensating for Feature Deviation in Imbalanced Deep Learning

We investigate learning a ConvNet classifier with class-imbalanced data....

GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

Imbalanced datasets are ubiquitous. Classification performance on imbala...

Fault Detection and Diagnosis with Imbalanced and Noisy Data: A Hybrid Framework for Rotating Machinery

Fault diagnosis plays an essential role in reducing the maintenance cost...

Computational Ceramicology

Field archeologists are called upon to identify potsherds, for which pur...

Neurochaos Feature Transformation and Classification for Imbalanced Learning

Learning from limited and imbalanced data is a challenging problem in th...

I Introduction

Several real-world situations (fault detection [24], disease classification [17], software failures [5], oil spill detection [16] and protein sequence detection [3]) involve learning from datasets where the instances of one class (the majority class) far outnumber those of the other class (the minority class). In certain extremely imbalanced datasets, the ratio of the number of instances from the two classes is very low () or the number of instances of the minority class is very small [19]

. For instance, in gamma-ray anomaly detection  

[27], the datasets for learning have 10 anomalies and approximately 25,000 benign signatures. In fraud detection, there are typically about 5 fraudulent examples in a dataset of over 300,000 transactions [29].

Training a binary classification model on such extremely imbalanced datasets is a challenging problem. A popular class of methods alleviates this problem by augmenting the training data with synthetic instances before training the classification model [6, 7, 13, 19]

. These data augmentation methods generate synthetic minority class instances using either minority or majority class instances from the training dataset. However, existing methods often generate instances that don’t improve (or even worsen) the classifier performance.

Fig. 1: Illustration to describe the limitation of synthetic oversampling techniques. Existing methods (SMOTE [6] and its variants, SWIM [19]) select candidate instances from one of the classes and create synthetic instances based on these selected instances. Therefore, most generated synthetic instances lie near clusters (often within the convex hull) of instances that are often already correctly classified by the classification model (region A). Further, these methods generate fewer and poorer quality synthetic instances in regions of the input space where the model does not perform well (regions B and C).
Fig. 2: High-level summary of the data-augmentation pipeline. The triangle represents the data instances in the non-augmented training dataset. Green is for minority class and yellow for majority class. The square represents the synthetic generated instances that are added to the training dataset (augmentation) prior to training. Traditional approaches such as SWIM [19], SMOTE [6], and its variants generate synthetic minority class instances (denoted by green). In contrast, MixBoost

generates synthetic hybrid instances that interpolate the minority and majority class instances (denoted by the green-yellow color scheme).

Intuitively, these methods generate good quality synthetic instances in regions of the input space where the classification model is already accurate. In regions where the model is inaccurate and requires synthetic instances, these methods often do not perform as well. Figure 1 illustrates this intuition. We believe that using the trained model (whose performance we are trying to improve) to guide the data augmentation process will allow us to augment regions of the input space where the model performs poorly. Further, existing methods generate synthetic homogeneous

instances, i.e., instances that belong to a single class (usually the minority class). Recent work in Computer Vision 

[25, 22, 20] has demonstrated the value of augmenting training datasets with non-homogeneous hybrid instances to learn more robust representations.

We describe a data augmentation method (MixBoost) that improves classifier performance on such imbalanced datasets. Our key contributions are:

  • We introduce a data augmentation method, MixBoost that generates synthetic hybrid instances to augment an imbalanced dataset (Section II). MixBoost has two innovative components. First, it mixes instances of the minority and majority classes to generate synthetic hybrid instances that have elements of both classes (Section II-D). Second, it intelligently selects the instances for mixing using a novel entropy-weighted low-high technique (Section II-C).

  • We show that MixBoost outperforms state-of-the-art data augmentation methods on multiple highly imbalanced benchmark datasets for different levels of class imbalance (Section IV).

There are several categories of methods to augment an imbalanced dataset prior to training a classification model. Under-sampling methods discard instances of the majority class at random to balance the class distribution. However, removing instances of the majority class can lead to a loss of information and subsequently degrade classifier performance. Over-sampling methods duplicate instances of the minority class at random to balance the class distribution. These methods add duplicates of existing instances to the training dataset and therefore do not add new information [11, 12]. More sophisticated data augmentation methods augment the training datasets by creating synthetic minority class instances. For instance, SMOTE [6] creates synthetic minority instances by interpolating minority class instances in the training data. However, SMOTE ignores majority class data when creating synthetic instances. Variants of SMOTE (Borderline SMOTE [13], ADASYN [15], SMOTEBoost [7]) use the distribution of majority class data to filter instances created using minority class data. However, since these methods use only minority class data, the synthetic instances they create are restricted to the convex-hull of the minority class distribution. To expand the space spanned by generated instances, SWIM [19] creates instances by inflating minority class data along the density contours of majority class data.

Fig. 3: High-level architecture of MixBoost. In the (1) candidate selection step (), MixBoost uses entropy to intelligently select instances from the training dataset. In the (2) hybrid generation step (), we mix the selected instances to create synthetic hybrid instances. is the interpolation ratio that controls the extent of overlap between the instances to be mixed. We then augment the training dataset with the generated hybrid instances and train a classifier on the augmented dataset.

Existing data augmentation methods give promising results in a variety of situations. However, these methods focus on either the majority class or the minority class. Approaches that focus on the minority class often overlook the majority class and can degrade classifier performance by generating borderline or overlapping instances. On the other hand, approaches that focus on the majority class tend to overfit the training distribution. In contrast, MixBoost combines instances from the majority and minority classes to create synthetic hybrid instances. Each synthetic hybrid instance contains elements of a majority class and a minority class instance. Further, MixBoost uses an information-theoretic measure along with the trained classifier to select the instances to combine. Figure 2 illustrates the essential idea and distinction to related approaches.

Figure 3 shows the high-level architecture of MixBoost. MixBoost generates synthetic hybrid instances iteratively. Each iteration consists of a step followed by a step. In the step, MixBoost uses the trained classifier to intelligently select pairs of instances from the minority and majority class for mixing. Intuitively, MixBoost selects pairs of instances such that one instance in each pair is close to the decision boundary of the trained classifier, and the other instance is far from the decision boundary. In the step, MixBoost mixes the selected instances to create synthetic hybrid instances that are used to augment the training dataset. The classifier is re-trained with the original data augmented with the hybrid instances before the next iteration of MixBoost.

In Section II we delineate the proposed approach. Next, in Section III we present our experimental setup and evaluation strategy. In Section  IV we compare quantitative performance of MixBoost with multiple strong baselines on 20 benchmark datasets as well as test the efficacy of our framework through significance tests. In Section V we perform ablation studies to analyze the impact of different design choices for the proposed approach. Finally, we summarize related work in Section VI

Ii Approach

In this section, we explain in detail the various steps of our proposed MixBoost algorithm. In Section II-A we define the terminologies used throughout the paper. In Section II-B we provide a high-level overview of the proposed approach followed by detailed explanation of the different stages in Sections II-C and II-D.

Ii-a Definitions

A binary classification task is characterized by a dataset and corresponding labels . The class label corresponds to the majority class, and label corresponds to the minority class throughout the paper. Let be the number of instances in the dataset, and be the number of features for an instance in the dataset. Then, the training objective is to learn a function, , that maps the instances to their corresponding labels . For imbalanced datasets, the number of instances of the majority class far outnumbers that of the minority class. The dataset, denoted by , is split into a training dataset , and a testing dataset . Individual datapoints from these datasets are referred to as and respectively. Let be the number of instances in the training dataset. We train a binary classification model on and evaluate on the test dataset using -  [16] and ROC-AUC [26] metrics.

MixBoost uses the classifier along with dataset to create synthetic hybrid instances. The instances are added to the training dataset to create the augmented dataset which is then used to in-turn train the classifier . This represents a single iteration of MixBoost. MixBoost is run for such iterations to expand the augmented dataset. We report performance by evaluating the model trained using the final augmented dataset, obtained at the end of iterations, using .

Ii-B MixBoost: High-Level Summary

MixBoost generates synthetic hybrid instances by interpolating instances from the majority and minority classes. A single iteration of MixBoost (summarized in Figure 3) consists of the following steps:

  1. Candidate Selection (): We sample candidate instances from the majority and minority classes in prior to mixing. We experiment with both random and guided sampling techniques (Section II-C).

  2. Hybrid Generation (): We mix the sampled instances to generate synthetic hybrid instances (Section II-D). denotes the set of synthetic instances generated in this step (and each synthetic instance is correspondingly referred to as ).

At the end of a MixBoost iteration, we add to and retrain the classifier on .

For ease of explanation, we focus on generating a single synthetic hybrid instance. However, the steps to generate several instances follow from this single instance case. MixBoost creates a synthetic hybrid instance by mixing a majority class instance () and a minority class instance (). There are two important considerations to this process. First, how should MixBoost select the best candidates and from ? Second, how should MixBoost intelligently combine and to create the synthetic hybrid instance ? We first consider the problem of instance selection.

Ii-C MixBoost: Candidate Selection (the step)

Unlike existing works, MixBoost creates synthetic datapoints that are conditioned on both the majority and minority class instances. For generating these synthetic hybrids, we propose two alternative strategies for selection of the candidate instances,

Ii-C1 R-Selection

First, we propose R-Selection which randomly selects (majority and minority) candidates and from

using uniform random variables. The probability of selecting an instance is equal for all instances within the majority and minority classes. Formally, if

is the probability of selecting the majority class instance and is the probability of selecting the minority class instance, then:


where is the number of majority class instances and is the number of minority class instances in .

Ii-C2 Entropy Weighted (EW) Selection

R-Selection assumes that all synthetic hybrid instances generated by mixing any combination of minority/majority class instances are equally helpful additions to the augmented dataset for training. However, it is possible that mixing certain combinations of minority/majority class instances from can result in better training performance.

For a given classification model , entropy is a measure of uncertainty in the models prediction. Consequently, we introduce an alternate strategy called Entropy-Weighted selection (EW-Selection). EW-Selection posits that selecting candidates from by actively weighting them with the uncertainty in their model predictions will result in generation of more useful hybrid instances. Formally, if

is the probability distribution over target classes output by the classifier

for the instance in , then the entropy of the sample, denoted by , is computed as,


where the summation is over the elements of the vector


For a given data instance, a high implies that model is less certain about the ground-truth class prediction of the instance while a low implies that is more certain about the ground-truth class prediction of the instance. We hypothesise that augmenting training dataset with synthetic instances in vicinity of high entropy feature sub-spaces (where model is currently uncertain about predictions) can improve model training performance.

Let ( with class ) is the sum of entropy values for majority class instances and ( with class ) denote the sum of entropy values for minority class instances. Then, and denote the entropy ratios, which are the probability of selecting a majority class instance and a minority class instance respectively for mixing. Both and are computed as,


In practice, we find that selecting one candidate instance with high entropy ratio and the other with low entropy ratio leads to generation of synthetic hybrid instances that result in the best performance. MixBoost interleaves this low-high selection with the generation of hybrid instances using the step described below.

Ii-D MixBoost: Hybrid Generation (the step)

Let (, ) and (, ) be instances selected (in the step) from the majority and minority class respectively. We adopt the mixing strategy introduced in [25].


(, ) is the generated synthetic hybrid instance. is the interpolation ratio. Intuitively, controls the extent of overlap between the two instances used to generate the synthetic hybrid instance. MIXUP [25] and BC+ [22] are existing image augmentation techniques which use linear and non-linear interpolations respectively. BC+ samples

using the Uniform Distribution

while MIXUP samples using the distribution. For most tasks, MIXUP uses , which also reduces to sampling from a Uniform Distribution [20]. In contrast, MixBoost

works with skewed data distributions arising from extreme class imbalance. Correspondingly, we explore sampling

from several probability density functions and discuss our findings in Section 


The resultant synthetic hybrid instances () are used to augment the training dataset (), generating , which is used to (re)train the classifier. The classifier is (re)trained on a cross-entropy loss by using the soft labels for the and the one-hot labels for as ground truths. In the subsequent iterations, this classifier is used to update entropy values for candidates in the original training dataset to facilitate the next iteration of MixBoost. The entire workflow of our approach is shown in Figure 3.

Iii Experimental Settings

Fig. 4: Code snippet that illustrates how to use in Python.

Iii-a Datasets

To evaluate MixBoost against the state-of-the-art oversampling methods, we compare performance on binary classification on 20 benchmark datasets (Table I, available here111Datasets can be downloaded from [8]). To ensure the evaluation consistency we use the same dataset and imbalance ratios as in [19]. For robust evaluation, we split the dataset into equal training and testing halves. We randomly down-sample the minority class instances in the training dataset to simulate different levels of extreme imbalance. Specifically, we test at three levels of imbalance, with minority training dataset sizes of 4, 7, and 10 to simulate the extreme imbalance scenarios [19]. We further skew the data distribution (2 & 3 minority class instances) to show that MixBoost outperforms existing methods when using even fewer number of minority class training instances.

Iii-B Classification

For all experiments, we use the Naive Bayes, Nearest Neighbour, Decision Tree, Support Vector Machine, and Multi-Layer Perceptron (MLP) binary classifiers. For each dataset, we compare the best performance of each data augmentation method across the various binary classifiers against the performance of

MixBoost. We use the sklearn [18] package in Python for our experiments; all classifiers use the default parameters. We use tensorflow to implement the MLP. For the MLP, we use Adam optimizer, a learning rate of

, a batch size of 500, and train it for 300 epochs. We use the same hyper-parameters for all datasets. We normalize the data before training and testing.

Iii-C Data Augmentation

We use each data augmentation method to generate synthetic instances, where is the number of instances in the training dataset. For , we find that generating the instances over 5 iterations ( in each iteration) leads to the best results. We sample the interpolation ratio from a distribution. We explore the sensitivity of to these hyper-parameters in Section V.

Iii-D Evaluation

We compare data augmentation methods on two metrics. First, we use - [16]

, which is the geometric mean of the True Positive Rate (

) (for majority class) and the True Negative Rate () (for minority class).


We use - because it is both immune to imbalance [16] and provides information about extreme imbalance [19]

. Second, for completeness, we also compare methods using the ROC-AUC scores. We repeat all experiments 30 times and report the mean and standard deviation (rounded to the nearest decimal place) of the metrics for each dataset. Figure 

4 includes a code snippet illustrating MixBoost usage.

We compare MixBoost to Random Over-sampling and Under-sampling (ROS, RUS), SMOTE [6], Borderline-SMOTE (B1, B2) [13], SMOTE with Tomek Links [16], ADASYN [15] and SWIM [19]. To make reporting easier, we merge all re-sampling approaches (ROS, RUS, SMOTE, B1, B2, SMOTE with Tomek Links, ADASYN) as Alternative Re-sampling Techniques (ALT) and report the best performing combination of data augmentation method and classifier for each dataset. We report the results from SWIM [19] separately since it is the most recent work, and it outperforms methods in ALT on most datasets. Finally, represents the case where no data augmentation is used before training the binary classifier.

Dataset Features No. of majority instances R4 R7 R10
Abalone 9-18 8 689 1:173 1:99 1:69
Diabetes 8 500 1:125 1:72 1:50
Wisconsin 9 444 1:111 1:64 1:456
Wine Q. Red 4 11 1546 1:387 1:221 1:155
Wine Q. White 11 880 1:220 1:126 1:88
Vowel 10 13 898 1:225 1:129 1:90
Pima Indians 8 500 1:125 1:72 1:50
Vehicle 0 18 641 1:160 1:91 1:64
Vehicle 1 18 624 1:156 1:89 1:62
Vehicle 2 18 622 1:155 1:88 1:62
Vehicle 3 18 627 1:156 1:89 1:62
Ring Norm 20 3736 1:934 1:534 1:374
Waveform 21 600 1:150 1:86 1:60
PC4 37 1280 1:320 1:183 1:128
Piechart 37 644 1:161 1:92 1:65
Pizza Cutter 37 609 1:153 1:87 1:61
Ada Agnostic 48 3430 1:858 1:490 1:343
Forest Cover 54 2970 1:743 1:425 1:297
Spam Base 57 2788 1:697 1:399 1:279
Mfeat Karhu. 64 1800 1:450 1:258 1:180
TABLE I: Description of the datasets used in our experiments. To ensure evaluation consistency, we use the same datasets and configuration as proposed by [19]. R4, R7, and R10 denote the ratio of class imbalance (minority:majority) after down-sampling the training datasets to have 4, 7, and 10 minority class instances respectively to simulate the extreme imbalance [19] scenarios (as discussed in Section I)
Dataset Baseline ALT SWIM MixBoost
Abalone 9-18 0.481 0.612 0.723 0.743
Diabetes 0.259 0.509 0.509 0.701
Wisconsin 0.874 0.956 0.958 0.969
Wine Q. Red 4 0.224 0.502 0.535 0.815
Wine Q.White 3v7 0.451 0.572 0.730 0.750
Vowel 10 0.724 0.738 0.812 0.845
Pima Indians 0.276 0.479 0.509 0.700
Vehicle 0 0.534 0.758 0.814 0.900
Vehicle 1 0.541 0.739 0.791 0.735
Vehicle 2 0.450 0.549 0.560 0.880
Vehicle 3 0.402 0.505 0.569 0.651
Ring Norm 0.274 0.933 0.899 0.580
Waveform 0.301 0.701 0.688 0.844
PC4 0.572 0.559 0.611 0.737
PieChart 0.455 0.516 0.576 0.741
Pizza Cutter 0.468 0.506 0.552 0.725
Ada Agnostic 0.451 0.445 0.539 0.690
Forest Cover 0.561 0.554 0.550 0.917
Spam Base 0.440 0.550 0.685 0.872
Mfeat Karhunen 0.274 0.933 0.899 0.927
TABLE II: Comparative - results (mean) for MixBoost with existing over-sampling methods. These results represent the R4 setting where the training dataset has 4 minority class instances. Baseline refers to the case where classifier is trained without data augmentation. ALT is the score of the best performing data augmentation strategy (other than SWIM [19] and MixBoost) as described in Section III-D. The best score for each dataset is highlighted in bold.

Iv Results

In this section, we present quantitative analysis of the proposed MixBoost oversampling technique. First, we compare MixBoost with existing state-of-the-art techniques and report results using the - metric (Table II). Next, for completeness of study, we report comparative performance on the ROC-AUC metric (Table IV), contrast the different MixBoost instance selection strategies (Table III) and perform significance tests for different levels of imbalance (Section IV-B).

Iv-a Quantitative Results

As in previous works, we compare MixBoost with existing state-of-the-art techniques in Table II using the - metric, which is considered a better metric for evaluating performance when dealing with extreme imbalance in datasets [19, 16]. The results shown here are obtained by down-sampling to the most extreme imbalance setting R4 where there are only 4 minority class instances (see Table  I). As evident, MixBoost outperforms the existing methods on 17 out of the 20 datasets. SWIM achieves best performance on only one while ALT achieves superior performance on 2 datasets. For each dataset, ALT reports the best performance amongst all algorithms other than SWIM and MixBoost which are stated in III-D. In MixBoost, the best results of the R-Selection and EW-Selection strategies are reported on each dataset.

Dataset MixBoost
R-Selection EW-Selection
Abalone 9-18 0.743 0.03 0.735 0.06
Diabetes 0.701 0.05 0.560 0.04
Wisconsin 0.960 0.02 0.969 0.08
Wine Q. Red 4 0.714 0.04 0.815 0.08
Wine Q. White 3v7 0.743 0.06 0.750 0.05
Vowel 10 0.845 0.04 0.854 0.07
Pima Indians 0.700 0.05 0.597 0.04
Vehicle 0 0.900 0.05 0.850 0.02
Vehicle 1 0.700 0.03 0.735 0.03
Vehicle 2 0.880 0.02 0.638 0.03
Vehicle 3 0.651 0.06 0.600 0.03
Ring Norm 0.550 0.04 0.580 0.03
Waveform 0.812 0.03 0.844 0.05
PC4 0.720 0.08 0.737 0.04
PieChart 0.611 0.06 0.741 0.07
Pizza Cutter 0.725 0.05 0.678 0.07
Ada Agnostic 0.690 0.02 0.648 0.03
Forest Cover 0.910 0.05 0.917 0.02
Spam Base 0.872 0.03 0.834 0.05
Mfeat Karhunen 0.888 0.07 0.927 0.05
TABLE III: Comparative - results (mean and standard deviation for 30 independent runs) for the different candidate selection strategies (refer to Section IV for details) of our approach MixBoost.
Dataset SWIM MixBoost
Abalone 9-18 0.790 0.08 0.820 0.06
Diabetes 0.778 0.03 0.800 0.02
Wisconsin 0.899 0.08 0.972 0.01
Wine Q. Red 4 0.950 0.03 0.971 0.03
Wine Q. White 3 vs 7 0.640 0.09 0.783 0.05
Vowel 10 0.895 0.05 0.958 0.03
Pima Indians 0.778 0.03 0.800 0.02
Vehicle 0 0.628 0.04 0.896 0.07
Vehicle 1 0.592 0.06 0.750 0.02
Vehicle 2 0.824 0.03 0.921 0.03
Vehicle 3 0.585 0.05 0.664 0.03
Ring Norm 0.704 0.24 0.892 0.08
Waveform 0.828 0.02 0.869 0.02
PC4 0.686 0.09 0.794 0.03
PieChart 0.661 0.06 0.743 0.08
Pizza Cutter 0.662 0.08 0.735 0.04
Ada Agnostic 0.671 0.05 0.798 0.02
Forest Cover 0.881 0.07 0.970 0.02
Spam Base 0.769 0.12 0.835 0.07
Mfeat Karhunen 0.980 0.01 0.992 0.00
TABLE IV: ROC-AUC scores (mean and standard deviation) for EW-Selection strategy of MixBoost and SWIM. The best score for each dataset is highlighted in bold.

Table III compares the performance of the different selection strategies used in MixBoost. The relative performance of EW-Selection and R-Selection strategies for MixBoost indicates that leveraging an entropy prior during the candidate selection leads to the creation of more useful synthetic hybrid candidates than selecting instances at random, in several cases. Interestingly, both R-Selection and EW-Selection strategies independently outperform existing methods on 17 out of 20 datasets.

Table IV reports comparative results with the ROC-AUC metric. For ease of analysis, we restrict the comparison to SWIM since it is the strongest baseline which consistently outperforms other techniques and is also the most recent state-of-the-art. The results highlight that MixBoost outperforms SWIM on the ROC-AUC metric as well by achieving superior performance on 20 out of 20 datasets.

To study the sensitivity of MixBoost, we analyse and compare standard deviation for different combinations of MixBoost along with SWIM in Tables III and  IV. The results show that MixBoost results in a mean standard deviation (over all 20 datasets) of 0.047 which is comparable with 0.044 by SWIM. In the next section, we also probe the statistical significance of our observed results, with respect to the strongest baseline (SWIM), across varying levels of class imbalance.

(a) Figure A
(b) Figure B
(c) Figure C
Fig. 5: Posteriors for SWIM versus MixBoost (left and right vertex) on the datasets with 4, 7 and 10 minority class instances ((a), (b), and (c)) for the Bayesian sign-rank test. A higher concentration of points on one of the sides of the triangle shows that a given method has a higher probability of being statistically significantly better [19]. The top vertex indicates the case where neither method is statistically significantly better than the other.

Iv-B Statistical Significance

We use the Bayesian signed test [4] to evaluate the results presented in the previous section. The Bayesian signed test is used to compare two classification methods over several datasets. We use the test to compare MixBoost

to SWIM over all 20 datasets. We compare the methods for training datasets down-sampled to 4, 7, and 10 minority class instances. Using the Bayesian method enables us to ask questions about posterior probabilities, that we cannot answer using null hypothesis tests 


Concretely, we wish to ascertain, based on the experiments, what is the probability that MixBoost is better than SWIM for data augmentation. We repeat the setup described in [19], where, based on the assumption of the Dirichlet process, the posterior probability for the Bayesian signed test is calculated as a mixture of Dirac deltas centered on the observation.

Figure 5 shows the posterior plots of the Bayesian signed test for the three down-sampled datasets. The posteriors are calculated with the prior parameter of the Dirichlet as and as suggested by [19]. The posterior plots report the samples from the posterior (blue cloud of points), the simplex (the large triangle), and three regions. The three regions of the triangle denote the following, (1) The region in the bottom left of the triangle indicates the case where it is more probable that SWIM is better than MixBoost. (2) The region in the bottom right of the triangle indicates the opposite, i.e., when it is more probable that MixBoost is better than SWIM. (3) The region at the top of the triangle indicates that it is likely that neither method is better. The closer the points are to the sides of the triangle, the larger is the statistical difference between methods. Figure 5 shows that the probability that MixBoost is better than SWIM for data augmentation is high for all three down-sampled training datasets. For all three plots, the point cloud is concentrated in the bottom right triangle, indicating that MixBoost is statistically significantly better than SWIM.

V Ablation Studies

In this section, we evaluate the specific impact of different components of MixBoost and measure the sensitivity of our framework to hyper-parameters.

To balance the consistency of results and ease of analysis and presentation, we restrict our discussion to five datasets, namely Pima Indians, Waveform, PC4, Piechart and Forest Cover. These datasets are selected to ensure maximum diversity in feature dimensionality (ranging from 8 features for Pima Indians to 54 for Forest Cover) and the extent of class imbalance (ranging from 1:125 for Pima Indians to 1:743 for Forest Cover).

For each study (unless otherwise specified), we augment the training dataset with synthetic hybrid instances, where is the number of instances in the training dataset. As in main experiments, these synthetic hybrid instances are generated equally over five MixBoost iterations ( instances in each iteration). For the scope of this analysis, we use the EW-Selection technique in MixBoost and sample the interpolation ratio from a distribution.

V-a Impact of sampling instances over multiple iterations

In this experiment, we evaluate the importance of generating synthetic hybrid instances over multiple iterations. We define a variant of MixBoost, called MixBoost-1-Iteration (denoted by MixBoost-1-Iter), where all the synthetic instances are generated in a single iteration. This single-step generation is in contrast to MixBoost used previously, where the instances are generated over five iterations, at each iteration.

We augment the training dataset with the generated instances and train the binary classifier on the augmented dataset. Table V shows the results. We observe that the sampling instances over multiple (five) iterations in MixBoost outperforms the single-step variant on each datasets. This result corroborates the importance of the iterative characteristic ( step) of MixBoost.

Dataset MixBoost-1-Iter MixBoost
Pima Indians 0.491 0.02 0.597 0.04
Waveform 0.843 0.04 0.844 0.05
PC4 0.613 0.08 0.737 0.04
Piechart 0.721 0.02 0.741 0.07
Forest Cover 0.910 0.00 0.917 0.02
TABLE V: - scores for the single step (MixBoost-1-Iter) and the proposed MixBoost. For all selected datasets, the iterative variant of MixBoost outperforms the single step one.

V-B Impact of choice of distributions for sampling

For sampling the interpolation ratio, , we experiment with Linear and Non-Linear Probability Density Functions (PDFs). Table VI shows the results. On all datasets, using a non-linear PDF ( with ) in the step leads to better classifier performance.

Dataset Uniform Beta
Pima Indians 0.389 0.08 0.597 0.04
Waveform 0.661 0.01 0.844 0.05
PC4 0.242 0.01 0.737 0.04
Piechart 0.312 0.07 0.741 0.07
Forest Cover 0.629 0.01 0.917 0.02
TABLE VI: - scores when we sample from different distributions. For all datasets, sampling from the Beta(0.5,0.5) distribution leads to the best performance.

V-C Impact of generating hybrid-instances

Existing approaches augment training datasets by oversampling the minority class instances. MixBoost generates synthetic hybrid instances that have elements of both the minority and majority class. Concretely, in the step, we generate instances where the target class vector is in-between the majority class vector and the minority class vector . Alternatively, in the step, we could have generated synthetic instances where the target class vector corresponded to either the majority or the minority class. To evaluate these alternatives, we create a variant of MixBoost in which we assign either the majority class vector (when ) or the minority class vector (when ) to the generated instance. We label this variant MixBoost-OneHot.

Dataset MixBoost-OneHot MixBoost
Pima Indians 0.458 0.03 0.597 0.04
Waveform 0.831 0.01 0.844 0.05
PC4 0.698 0.07 0.737 0.04
Piechart 0.566 0.06 0.741 0.07
Forest Cover 0.898 0.02 0.917 0.02
TABLE VII: - scores comparing MixBoost to a variant where we generate non-hybrid instances (MixBoost-OneHot) in the step. For all considered datasets, generating hybrid instances improves the performance of the binary classifier.

Table VII shows the results comparing MixBoost to MixBoost-OneHot. We observe that generating hybrid instances with in-between target class labels (from MixBoost) outperforms generating instances with either a majority or a minority class label (from MixBoost-OneHot). We posit that training with hybrid instances that are not explicitly attributed to either class enables the learning of more robust decision boundaries by helping regulate the confidence of the classifier away from the training distribution.

V-D Impact of decreasing number of minority class instances

Since MixBoost uses an intelligent selection technique for generating synthetic hybrid instances, we want to study whether it outperforms existing approaches with a fewer number of minority class training instances. We generate variations of the training dataset with 2 and 3 minority class instances. We then run MixBoost on these variations and train a classifier on the augmented dataset. Table VIII shows the results. MixBoost achieves better results than SWIM while using a fewer number of minority class training examples. For instance, for the Waveform and Pima Indians dataset, MixBoost outperforms SWIM using half the number of minority class instances (2 vs. 4).

Dataset SWIM MixBoost
Pima Indians 0.499 0.13 0.534 0.08 0.616 0.02 0.597 0.04
Waveform 0.652 0.03 0.693 0.06 0.727 0.03 0.844 0.05
PC4 0.661 0.06 0.593 0.06 0.688 0.02 0.737 0.04
PieChart 0.612 0.05 0.533 0.13 0.579 0.09 0.741 0.07
Forest Cover 0.522 0.03 0.452 0.05 0.643 0.03 0.917 0.02
TABLE VIII: - scores for data augmentation using MixBoost and SWIM as we reduce the number of minority class instances () by down-sampling. MixBoost achieves best performance on all the datasets. Additionally, MixBoost outperforms SWIM on 4/5 datasets when using fewer (3 vs 4) the minority class instances than SWIM.

V-E Impact of number of generated synthetic hybrid-instances

MixBoost generates synthetic instances iteratively. For each iteration, the classifier is retrained using all instances generated up to this iteration. We expect that the marginal gain of generating instances should decrease as more instances are generated. Figure 6 validates our expectation. We see that the performance of the classifier plateaus as MixBoost adds more synthetic instances to the training dataset. This plateau in performance is important for the practical deployment of MixBoost.

Fig. 6: Variation in - scores (averaged over 30 runs) for MixBoost as we increase the number of generated synthetic hybrid instances. represents the total number of training instances in the original dataset. The different color lines indicate different datasets. The gain of generating an additional synthetic instances is initially high and then falls gradually as the number of generated instances increases.

V-F Impact of number of minority class instances

MixBoost mixes an instance of the minority class with an instance of the majority class to create a synthetic hybrid instance. We expect the quality of the generated synthetic instances to improve as we increase the number of minority class instances in the training dataset. The quality of generated instances is measured by classifier performance. Figure 7 shows that the classifier performance improves as we increase the number of minority class instances in the training data.

Fig. 7: Variation in - scores (averaged over 30 runs) for different number of minority class instances using MixBoost. For each dataset, the classifier performance increases with an increase in the number of minority class instances.

Vi Related Work

There are broadly two categories of approaches for dealing with classification problems on imbalanced datasets. First, sampling-based approaches and second, cost-based approaches [9, 23]. In this paper, we focus on sampling-based approaches. The most straightforward re-sampling strategies are Random under Sampling (RUS), and Random over Sampling (ROS). They balance the class distribution by either randomly deleting instances of the majority class (RUS) or randomly duplicating instances of the minority class (ROS). However, deleting instances from the training dataset can lead to a loss of information. Further, duplicating instances of the minority class does not add any new information.

SMOTE (Synthetic Minority Oversampling Technique) [6] alleviates these shortcomings. SMOTE balances the class distribution by generating synthetic data instances. SMOTE generates a synthetic instance by interpolating the k-nearest neighbors of a minority class instance in the training data. SMOTE has proved useful in a variety of contexts [6, 7, 13]. However, since SMOTE generates synthetic instances using only minority class instances, the generated instances lie within the convex hull of the minority class distribution. Further, because it does not use majority class instances, SMOTE can increase the overlap between classes. Therefore, if the classes are extremely imbalanced, then SMOTE can degrade classifier performance. Extensions of SMOTE [10] add a post-processing step that tries to remove generated instances that might degrade the performance of the classifier. These methods incorporate the majority class distribution into the post-processing step of the data augmentation process. For instance, Adaptive Synthetic Oversampling (ADASYN) [15], borderline SMOTE [13], and Majority Weighted Minority Oversampling [14] use majority class instances present in the neighborhood of generated instances for post-processing. However, these methods use only the minority class distribution when generating synthetic instances. Therefore, if the number of minority class instances is very small, then the quality of generated instances and subsequently, the performance of the classifier is degraded. [2] create a variant of SMOTE by using the Mahalanobis distance (instead of Euclidean distance) for generating synthetic instances. However, as with SMOTE, they generate instances using only minority class data. SWIM [19] uses information from the majority class to generate synthetic instances and the authors show the technique to outperforms previous methods across several benchmark datasets at extreme levels of class imbalance.


approach the problem of generating synthetic instances from a different direction. They use a Generative Adversarial Network (GAN) trained on minority class data to generate synthetic training instances. However, their method is unable to consistently outperform existing state-of-the-art methods(

[21], see results). Further, training a GAN for each dataset requires significant compute and hyper-parameter tuning. For these reasons, we have not compared MixBoost to their method.

Vii Conclusion

In the present work, we tackle the problem of binary classification on extremely imbalanced datasets. To this end, we propose MixBoost, a technique for synthetic iterative oversampling. MixBoost intelligently selects and then combines instances from the majority and minority classes to generate synthetic hybrid instance. We show that MixBoost outperforms existing methods on several benchmark datasets. We also evaluate the impact of the different components of MixBoost using ablation studies. As directions of future study, we wish to focus on evaluating MixBoost for multi-class classification and adapt the idea of iterative sampling and generation through interleaved and steps for regression tasks.


  • [1]
  • [2] Lida Abdi and Sattar Hashemi. 2015. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE transactions on Knowledge and Data Engineering 28, 1 (2015), 238–251.
  • [3] Ali Al-Shahib, Rainer Breitling, and David Gilbert. 2005. Feature selection and the class imbalance problem in predicting protein function from sequence. Applied Bioinformatics 4, 3 (2005), 195–203.
  • [4] Alessio Benavoli, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri. 2014. A Bayesian Wilcoxon signed-rank test based on the Dirichlet process.. In ICML (JMLR Workshop and Conference Proceedings), Vol. 32. JMLR.org, 1026–1034. http://dblp.uni-trier.de/db/conf/icml/icml2014.html#BenavoliCMZR14
  • [5] Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, and Solomon Mensah. 2017. Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Transactions on Software Engineering 44, 6 (2017), 534–550.
  • [6] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16 (2002), 321–357.
  • [7] Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W Bowyer. 2003. SMOTEBoost: Improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery. Springer, 107–119.
  • [8] Dheeru Dua and Casey Graff. 2017.

    UCI Machine Learning Repository.

  • [9] Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd, 973–978.
  • [10] Alberto Fernández, Salvador Garcia, Francisco Herrera, and Nitesh V Chawla. 2018. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research 61 (2018), 863–905.
  • [11] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2011. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 4 (2011), 463–484.
  • [12] Salvador García and Francisco Herrera. 2009. Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary computation 17, 3 (2009), 275–306.
  • [13] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005a. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing. Springer, 878–887.
  • [14] H He, Y Bai, E Garcia, and S ADASYN Li. 2008a. Adaptive synthetic sampling approach for imbalanced learning. IEEE International Joint Conference on Neural Networks, 2008 (IEEE World Congress on Computational Intelligence).
  • [15] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008b. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, 1322–1328.
  • [16] Miroslav Kubat, Stan Matwin, et al. 1997a. Addressing the curse of imbalanced training sets: one-sided selection. In Icml, Vol. 97. Nashville, USA, 179–186.
  • [17] S Miri Rostami and M Ahmadzadeh. 2018.

    Extracting predictor variables to construct breast cancer survivability model with class imbalance problem.

    Journal of AI and Data Mining 6, 2 (2018), 263–276.
  • [18] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825–2830.
  • [19] Shiven Sharma, Colin Bellinger, Bartosz Krawczyk, Osmar Zaiane, and Nathalie Japkowicz. 2018. Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 447–456.
  • [20] Cecilia Summers and Michael J Dinneen. 2019. Improved mixed-example data augmentation. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1262–1270.
  • [21] Fabio Henrique Kiyoiti dos Santos Tanaka and Claus Aranha. 2019. Data Augmentation Using GANs. arXiv preprint arXiv:1904.09135 (2019).
  • [22] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Between-Class Learning for Image Classification.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2017), 5486–5494.
  • [23] Huihui Wang, Yang Gao, Yinghuan Shi, and Hao Wang. 2016. A fast distributed classification algorithm for large-scale imbalanced data. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1251–1256.
  • [24] Shuo Wang, Leandro L Minku, and Xin Yao. 2014. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering 27, 5 (2014), 1356–1368.
  • [25] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
  • [26] Andrew P. Bradley. 1997. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recogn. 30, 7 (July 1997), 1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2
  • [27] Shiven Sharma, Anil Somayaji, and Nathalie Japkowicz. 2018b. Learning over subconcepts: Strategies for 1-class classification. Computational Intelligence 34, 2 (2018), 440–467.
  • [28] Diding Suhandy and Meinilwita Yulia. 2017. The use of partial least square regression and spectral data in UV-visible region for quantification of adulteration in Indonesian palm civet coffee. International journal of food science 2017 (2017).
  • [29] Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449–475.