Building classifiers for imbalanced datasets is a difficult task. A dataset is said to be balanced when it has approximately the same number of samples from all classes. A well-balanced dataset provides a fair view to the classifier, helping it learn decision boundaries without any bias. Generally, the goal of classifiers is to maximize the accuracy of predictions. When a dataset is imbalanced, labeling new samples as belonging to the majority class decreases the likelihood of making mistakes during prediction. For instance, if the minority class makes up just 1% of the dataset, predicting every data point as belonging to the majority class will lead to a 99% accuracy. Consequently, the classification of minority samples is highly compromised. However, in sensitive applications like fraud detection, medical diagnosis, detection of defects in production, the minority class with rare instances is of greater interest. Thus, several methods have been proposed to balance datasets before feeding them to the classifier. For the scope of this paper, we restrict our discussion to the binary classification task only. We will assume the minority class to be positive and the majority class to be negative.
The techniques used for balancing datasets can be broadly classified into two categories – Data level and Algorithm level . Data level techniques focus on oversampling the minority class or under-sampling the majority class, either randomly or in a directed manner. Algorithm level techniques aim to alter the costs of the various classes, adjust the decision threshold or use ensemble learning. One of the most popular data level techniques, SMOTE 
, creates new minority examples by interpolating between minority observations in the original dataset. An extension of this technique, ADASYN, assigns weights to different minority class examples based on the difficulty in learning those examples.
In this paper, we present a novel data-level approach based on Genetic Algorithms  that considers both, the difficulty in learning the features of an example and the performance improvement caused by oversampling it, during the process of resampling. The process of oversampling a dataset to make it balanced can be considered to be analogous to a population growing and evolving over time. Since genetic algorithms replicate the process of natural selection and evolution in computation problems, their application to imbalanced datasets could be expected to perform well. This expectation is supported by the findings of this paper which show better classification results in terms of Score in 8 of the 9 datasets we experimented with. We believe that it is not necessary to completely balance a dataset for its features to be understood by a classifier. In fact, adding more artificial data than necessary for proper classification can sometimes degrade performance. Thus, our algorithm terminates when it encounters a decline in the Score, thereby assuring that the results will not be worse than the original classifier in the case where they cannot be improved.
The rest of this paper is organized as follows: We introduce the Genetic oversampling algorithm, GenSample in Section III. We discuss the algorithm in terms of the selection, crossover and mutation operations of Genetic Algorithms. In Section IV
, we perform experiments on 9 real-world datasets to evaluate the performance of GenSample. We compare the results with 3 benchmark algorithms, the naive Decision Tree, SMOTE and ADASYN, and show how GenSample shows better performance in general.
Ii Related Work
Traditional ways to handle an imbalance in datasets include data-level techniques like under-sampling the majority class or oversampling the minority class, and algorithm-level techniques like cost-sensitive learning and ensemble methods.
Random undersampling eliminates some majority class examples, leading to a better imbalance ratio of the dataset. The concept of Tomek Links, introduced in , was one of the first systematic procedures for oversampling, focussing on eliminating the borderline majority examples. Kubat et al. extend the concept of Tomek Links to remove majority class samples while leaving the minority class as it is, called ‘One-sided selection’ . Elhassan T. et al. also combine random undersampling with Tomek Links to balance datasets . However, there is a tendency of losing important features of the data during undersampling. Thus, it is not popularly used as a standalone technique. Instead, it is either coupled with oversampling or avoided.
The simplest oversampling techniques work by duplicating the minority class entities i.e. by generating new data which is a replica of an existing data point. The advantage of this technique is that it is extremely safe. We are only adding the examples which are valid observations, however, by repeatedly presenting them to the classifier, we are helping it learn these rare features better. Chawla et al. proposed a novel oversampling technique called SMOTE (Synthetic Minority oversampling Technique) which systematically generates new minority data along the line joining each minority point and one of its nearest neighbors . Han et al. extended the idea of SMOTE to Borderline-SMOTE which focusses on strengthening the data points near the decision boundary, oversampling the borderline points the most . ADASYN, presented in  incorporates the concept of adaptively synthesizing new points depending on the ratio of the majority class samples in the -neighborhood of a point. This idea of distinguishing between different types of examples was also proposed in .
Ii-C Algorithm-Level Approaches
Cost-Sensitive Learning tackles the imbalance problem by penalizing the misclassification of an example based on its class. This allows us to place more emphasis on correctly identifying the instances of the minority class by giving that class a higher weight. Ensemble methods have also proven to be promising for handling imbalanced datasets. For instance, SMOTEBoost oversamples the minority class by leveraging SMOTE at each step of boosting while learning the weak classifier .
The use of metaheuristic algorithms for handling imbalance in datasets has been rare. Yu et al. presented ACOSampling, which uses Ant Colony Optimization for undersampling in DNA microarray data . A recent algorithm, GASMOTE, proposed by Jiang et al. in , augments SMOTE with a genetic algorithm to perform oversampling. In this method, an individual in the population is a sequence of sampling rates for all minority instances. These individuals are evolved until the optimal oversampling rate for each example is reached.
Iii GenSample Algorithm
As described in , minority class examples can be divided into four categories: safe, borderline, rare and outliers. Safe examples are located in homogenous regions of similar examples and are easier to classify. Borderline examples are the ones close to the decision boundary, causing more difficulty during classification. Rare examples are the outlying pairs or triplets of minority class points corresponding to the under-represented minority regions, making them more fit for resampling than the former two types. Outliers, as the name suggests, are the scarce minority examples scattered in regions of majority, representing either noise or a valid, extremely rare subconcept. There are contrasting opinions on how to handle outliers, whether to discard or resample them. However, many studies on medical problems have shown that these outliers are authentic, under-represented minority samples. Hence, GenSample considers these points to be the most difficult to learn, oversampling them the most.
The main idea of the GenSample algorithm is to iteratively learn which minority samples are best suited for resampling. We set aside one-third of the training data as a validation dataset. The algorithm first generates the initial population which consists of all the minority examples from the training data. It then uses the fitness function described in section III-B to pick the fittest individual of the population as the first parent for crossover. The second parent is randomly selected from the k-nearest minority neighbors of the first parent. An interpolation between the two parents like  produces two children, each of which is evaluated for its fitness. The fitter of the two children replaces the least fit individual of the population. Since each minority example is important to our classifier, we eliminate the least fit individual only from the population, not from the original dataset used for classification. This can be interpreted as considering that individual unfit for reproduction but fit enough to survive. Thus, we can view the population as the set of individuals who participate in crossover. During evaluation, we fit a Decision Tree classifier with the entire training data, not just the population. The fitted model is tested on the validation dataset. Each of the above-mentioned steps is elaborated in the following sections and formally presented in Algorithm 1.
Iii-a Initial Population
As mentioned before, the initial population consists of all the minority class samples as individuals. To evaluate the dataset during resampling, we use the
Iii-B Fitness Function
The fitness of an individual depends on the type of minority class example it is, i.e. safe, borderline, rare or outlier, as well as the amount of performance improvement achieved by oversampling it. The more challenging it is to classify an example, the more it should be resampled. Consequently, its fitness value should be high. The fitness function interpolates between the two above measures as follows:
where is a constant such that . depends on the category of the minority class and is calculated using the number of majority class samples in the k-neighborhood of . is the change in score produced by resampling .
The is assigned as follows:
Here, is the ratio of majority samples in the -neighborhood of the point under consideration. The minority label weights used above are based on empirical results. Randomization ensures that samples belonging to the same category do not end up with the same fitness value.
After calculating the fitness of all the individuals in the population, the one with the highest fitness is selected as the first parent for crossover. The second parent is randomly picked from the nearest minority neighbors of the first parent, where can be selected by cross-validation.
The crossover mechanism generates children by interpolating between the parents, like . If we draw a line joining the two parents, the newly generated sample will lie somewhere on the line segment between the parent points.
However, if the fittest individual is an outlier, its nearest neighbors are going to be very far from it. When we randomly select a point along the line, we might not generate a point close to the outlier at all. To ensure that this does not happen, we produce two children, each one closer to one of the parents:
Here, . Hence, each child will be closer to one parent than the other.
The fitness of the children is evaluated by adding them to the dataset one after the other. We then observe which child causes a greater performance improvement, thereby the fitter child replaces the least fit individual in the population. However, the least fit individual is not removed from the dataset because it could contain important information. It is only considered unfit for reproduction.
The aim of mutation in a genetic algorithm is to maintain diversity in the population and ensure that the optimization does not get stuck in a local maximum. We use the explore-exploit trade-off of machine learning to prevent premature convergence. The selection functionexploits
its current path by choosing the fittest individual in the population most of the time. But, with a small probability, it might choose toexplore by picking a random individual from the population as the first parent. This ensures that a possible promising parent with a low fitness value might be given a chance to increase its fitness. It will also ensure that the same individual will not be picked an indefinite number of times as the parent.
The above-mentioned steps are repeated until one of the following two terminating conditions are met:
The desired imbalance ratio is reached
Adding a new sample caused a degradation in performance
The second condition ensures that we do not add samples beyond what is needed for classification. Oversampling more than necessary can lead to ambiguities in the dataset, making it harder for the classifier to find the decision boundary. Hence, we resample the minority class only as much as needed.
We evaluated the GenSample algorithm on 9 datasets with different imbalance ratios, sizes as well as the number and types of features. First, the parameter settings used for the experiments are described. Next, we describe the datasets used and the modifications made to them for our binary classification problem. A discussion of the metrics used for evaluation is presented next, and finally, the results of the experiments are put forth.
Iv-a Experimental Setup
Formally described in Algorithm 1, the GenSample algorithm begins by computing the fitness of each individual in the original population. The relative importance of the minority class type and the performance improvement obtained by resampling it are both controlled by the parameter of the fitness function. Empirically, the value of = 0.75 works the best, though it can also be calculated by cross-validation. We chose not to use cross-validation to reduce the complexity of the algorithm. The value of
for kNN is set to be 5, though=7 also produces good results in many cases.
The GenSample algorithm is compared with the naive Decision Tree (C4.5), SMOTE + Decision Tree and ADASYN + Decision Tree algorithms. We take an average of 100 runs for all the algorithms to obtain stable results. Again, the value of =5 is used for both SMOTE and ADASYN. We oversample the minority class for both the algorithms until the number of minority samples becomes equal to that of the majority samples.
|Dataset Name||Total Datapoints||Minority Datapoints||Majority Datapoints||Number of Features||Imbalance Ratio|
We tested the algorithm on 9 datasets commonly used in the literature for benchmarking. For each experiment, we randomly divide the data into 50% training and 50% testing datasets. Their attributes are summarized in Table I. Most of these datasets are publicly available on the UCI Machine Learning Repository . We made a few modifications to these datasets for the binary classification problem similar to other literary experimental setups for such problems. They are described below:
Iv-B1 Ionosphere Dataset
This dataset  consists of radar data collected by a system in Goose Bay, Labrador with 2 classes and 34 features. There are 225 ‘good radar’ instances and 126 ‘bad radar’ instances. Thus, we choose ‘good radar’ as the majority class and ‘bad radar’ as the minority class.
Iv-B2 Heart Dataset
The Heart Dataset  is a binary dataset that predicts the presence of heart diseases in patients using 13 attributes like age, sex and blood sugar. The presence of heart disease is rare and constitutes about one-third of the data points.
Iv-B3 Iris Dataset
This is a 3 class dataset 
which uses 4 features to classify an iris plant into one of the categories from ‘Iris-versicolor’, ‘Iris-setosa and ‘Iris-virginica’. Each of these classes have 50 samples each. We choose ‘Iris-virginica’ as the minority class and collapse the other two into the majority class. Thus, we get a skewed dataset with 50 minority and 100 majority samples.
Iv-B4 Parkinson Dataset
This dataset  uses 22 attributes to differentiate people with Parkinson’s Disease (PD) from those without. There are 48 positive examples of people diagnosed with PD, hence that is chosen as the minority class. The majority class has 147 examples, resulting in an imbalance ratio of 3.1.
Iv-B5 Blood Transfusion Dataset
This dataset  presents blood donation statistics where each data point represents an individual. The data points are divided into two classes based on whether the individual donated blood in March 2007 using 4 features. We select ‘yes’ as the minority class with 178 data points and ‘no’ as the majority class with 570 data points.
Iv-B6 Vehicle Dataset
This dataset  uses 2D silhouettes of objects in the form of an image to classify the kind of 3D object it is: a double-decker bus, Chevrolet van, Saab 9000 and an Opel Manta 400. We collapse the bus, Saab, and Opel into the negative class and use the van as the positive class. This results in 199 positive examples and 647 negative examples, giving an imbalance ratio of 3.3.
Iv-B7 CMC Dataset
This dataset  tries to predict the current contraceptive method choice of women from the following 3 categories: 1: No-use, 2: Long-term, 3: Short-term. It has a total of 1473 examples with 9 features. ‘Long-term’ is selected as the minority class and has 333 samples. The other 2 classes are combined into the majority class with 1140 samples.
Iv-B8 Yeast Dataset
The Yeast Dataset  classifies the localization site of protein into one of ‘MIT’, ‘CYT’, ‘NUC’, ‘ME3’, ‘ME2’, ‘ME1’, ‘EXC’, ‘VAC’, ‘POX’, ‘ERL’ classes. We choose ‘MIT’ as the minority class with 244 data points and the rest are combined into the majority class with 1240 data points. The classification is done using 8 features.
Iv-B9 PC1 Dataset
It is one of the NASA Metrics Data Program defect data sets . It is highly skewed with 1032 majority points and only 77 minority points, leading to an imbalance ratio of 13.4. Each example is represented by 21 features.
|Blood Transfusion||Decision Tree||0.38||0.32||0.34||0.58||0.71||0.52|
|Winning Times||Decision Tree||3||1||2||1||4||0|
Iv-C Evaluation Metrics
Overall Accuracy (OA) is one of the most common metrics for classification tasks in Machine Learning. It is defined as the ratio of the number of correct predictions to the total number of predictions.
In terms of positives and negatives, we can rewrite the definition as:
However, when the dataset is imbalanced, accuracy is not an effective measure of a classifier’s performance. Suppose we have 100 data points; 95 of them belong to one class, and the rest 5 to another. The classifier, having seen so many examples of the majority class, tends to predict all the samples as belonging to the majority class. It will achieve a 95% accuracy in this case but will have a terrible performance for the minority class. Since the rare examples are of greater interest in most applications, the high accuracy rate will not be indicative of the class-wise performance of the classifier. This phenomenon, also known as the ‘Accuracy Paradox’, motivates the use of the following additional metrics to evaluate classifiers. Nonetheless, we report the overall accuracy of the classifiers to examine the effect of oversampling on it.
Precision is defined as the number of samples that are actually positive out of the ones identified as positive by the classifier. Precision helps us examine how accurate the claims of our classifier on the positive class are.
Recall can be defined as the number of samples correctly identified as positive among the true positive ones. It is a very important metric in sensitive applications where it would be risky to not identify the rare instances. To illustrate, if the minority class corresponds to the presence of tumors in patients, we would not want to risk a patient’s life by not diagnosing their tumor when it exists.
Score is the harmonic mean of Precision and Recall. Since the use of a harmonic mean instead of an average punishes extremities in the precision and recall values,Score is an excellent metric for creating a balanced classification model.
Iv-C4 Geometric Mean
The geometric Mean of the accuracies of the positive and negative classes is more effective at evaluating classifiers which are dealing with imbalanced data. It can be calculated as follows:
The Area Under ROC Curve (or AUC) is another metric for evaluating classifiers. However, as demonstrated in , when the dataset is imbalanced, the AUC does not do a very good job of capturing the relative performance of two models. Hence, we chose not to report it.
Iv-D Experimental Results and Discussion
Table II presents the results of evaluating GenSample and the other benchmark algorithms- naive decision tree, SMOTE + decision tree and ADASYN + decision tree, on the 9 datasets mentioned previously. The best performance metric for each algorithm is highlighted in the table. In the end, the number of times each algorithm wins in performance over all the datasets is tabulated for ease of comparison similar to .
The first conclusion we can draw from the table is that the overall accuracy is always better with GenSample. This is an important result because it shows that our algorithm tries not to compromise on the accuracy of the majority class to improve that of the minority. The Score shows an improvement in all the datasets except for Blood Transfusion. Since GenSample terminates when the performance starts decreasing, we can also observe that even when the Score is not the best of the 4 algorithms, it is not less than the naive Decision tree. Thus, our algorithm will at least ensure that performance does not degrade when it cannot be improved. This is not guaranteed by SMOTE and ADASYN. This result can be better visualized in Figure 1.
The winning times in Table II also show that GenSample is generally high on precision. It has better performance in terms of precision in 8 out of the 9 datasets with a minimal number of ties. Even when the precision value is not the highest, it is only slightly less than the best performer. GenSample also achieves good results in terms of recall and geometric mean, doing better half of the time. Similar to Score, the performance is never worse than the decision tree.
V Conclusion and Future Work
This paper presents GenSample, a genetic algorithm for handling imbalance in datasets. This algorithm generates synthetic minority data points based on the difficulty in learning a sample point and the performance improvement achieved by oversampling it. The algorithm terminates when the desired imbalanced ratio is reached or a performance deterioration was caused by adding a synthetic data point to the dataset. Due to the early termination condition, the algorithm always ensures that the classification performance does not degrade when it cannot be further improved. We investigate the behavior of GenSample by evaluating it on 9 commonly used imbalanced datasets with 6 different metrics. We observe that for 8 of the 9 datasets, the Score and precision are better for Gensample, and the overall accuracy of GenSample is always the highest. Moreover, the recall and geometric mean have the highest value more than 50% of the time.
In the future, we will be examining other heuristics which can lead to better results. A promising avenue of research is to investigate the effectiveness of GenSample by combining it with ensemble methods. The data-level techniques have shown considerable improvement when augmented with boosting, so a similar enhancement can be expected from the combination of GenSample and boosting.
According to the ‘No Free Lunch Theorem’ , no single model can work the best for every problem. Thus, although the results for GenSample are not the best for every dataset and metric, they definitely indicate that GenSample holds promising results in the field of imbalanced learning.
N. Chawla, K. Bowyer, L. Hall, W. Kegelmeyer, “SMOTE: Synthetic Minority oversampling Technique”,
Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
H. He, Y. Bai, E. Garcia, S. Li, “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning”,
In Proceedings of the International Joint Conference on Neural Networks, pp. 1322-1328, 2008.
-  Melanie Mitchell, “An Introduction to Genetic Algorithms”, Cambridge, MA: MIT Press, 1996.
-  S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets: A review,” GESTS International Transactions on Computer Science and Engineering, vol. 30, no. 1, pp. 25-36, 2006.
-  K. Napierala, J. Stefanowski, “Types of minority class examples and their influence on learning classifiers from imbalanced data”, Journal of Intelligent Information Systems, vol. 46, no. 3, pp. 563–597, 2016.
-  D. Dua, C. Graff, “UCI Machine Learning Repository”, University of California, Irvine, School of Information and Computer Sciences, 2017.
-  I. Yeh, K. Yang, T. Ting, “Knowledge discovery on RFM model using Bernoulli sequence”, Expert Systems with Applications, 2008.
-  J. Shirabad, T. Menzies, “The PROMISE Repository of Software Engineering Databases”, School of Information Technology and Engineering, University of Ottawa, Canada, 2005.
-  M. Little, P. Mcsharry, S. Roberts, D. Costello, I. Moroz, “Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection”, BioMedical Engineering OnLine, 2007.
-  A. Swalin, “Choosing the Right Metric for Evaluating Machine Learning Models- Part 2”, https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428, 2018.
-  M. Kubat, S. Matwin, “Addressing the curse of Imbalanced Training Sets: One Sided Selection”, In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186, 1997.
-  I. Tomek: “An experiment with the edited nearest-neighbor rule”, IEEE Transactions on systems, Man, and Cybernetics, vol. 6, 1976.
-  T. Elhassan, M. Aljurf, F. Al-Mohanna, and M. Shoukri, “Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method”, Journal of Informatics and Data Mining, 2016.
-  H. Han, W. Wang, and B. Mao, “Borderline-SMOTE: A New oversampling Method in Imbalanced Data Sets Learning”, In: International conference on intelligent computing. Springer, pp. 878–887, 2005.
-  K. Jiang, J. Lu, and K. Xia, “A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE”, Arabian Journal for Science and Engineering, vol. 41, no. 8, pp. 3255–3266, 2016.
-  H. Yu, J. Ni, J. Zhao, “ACOSampling: An ant Colony Optimization-based undersampling method for classifying imbalanced DNA microarray data”, Neurocomputing vol. 101, pp. 309–318, 2012.
-  N. Chawla, A. Lazarevic, L. Hall, and K. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” in Proc. 7th Eur. Conf. Principles Pract. Knowl. Discov. Databases, Croatia, pp. 107–119, 2003.
-  D. Wolpert and W. Macready, “No Free Lunch Theorems for Optimization”, IEEE Transactions on Evolutionary Computation Vol. 1, No. 1, 1997.