I Introduction
Many real-life problems like diagnosis of diseases [1], weather prediction [2], fraud detection [3] etc. can be modelled as classification problems and can be tackled by developing machine learning models. However, in most cases, it is found that the data obtained are not balanced, that is it is not possible to collect the same number of samples for all the classes, thereby making the resulting data set class imbalanced. This problem of imbalance poses serious challenges towards developing machine learning models as follows. The models become biased towards the majority class and hence mostly fail to detect minority classes. Therefore, in such a scenario, despite obtaining a good accuracy, we do not obtain good scores in terms of other metrics of performance like F1 score [4], AUC score [5], G-mean score [6] etc. Since this issue of class imbalance is challenging and damaging, significant attention has been given to solve this issue in the literature [7]. The most common methods among these use resampling techniques to bring balance in the dataset. Resampling can be done by reducing the number of majority class samples. This technique is popularly known as undersampling. Some common undersampling techniques include cluster centroids [8], tomek’s links [9], neighbourhood cleaning rule [10] etc. Resampling can also be done by increasing the number of minority class samples by either duplicating some data or generating new data. This technique is called oversampling. SMOTE [11] and several variants of SMOTE [12] [13], ADASYN [14] etc. are some frequently used oversampling techniques.
In spite of significant works in this area in the literature, there are little scope for much improvement. With this backdrop, we revisit neural network based approaches for under-sampling. Neural networks have been successfully used for tasks like image recognition, natural language processing etc. in recent years. We explored the possibility of using the potentials of neural networks to capture intricate patterns within data to solve the issue of class imbalance.
Ii Related Work
Class imbalance, being a challenging problem, has attracted many researcher’s attention throughout the past and recent years. To bring balance in the imbalanced data, strategies like oversampling and undersampling of data were employed. These researches were conducted as early as 1972. A popular algorithm for undersampling, Edited Nearest Neighbor (ENN) rule was proposed in the paper [15]. ENN works by removing the data points whose class label does not match the majority of its k nearest neighbors. Another popular algorithm for undersampling, Tomek links removal (TLL) was introduced in [9]. This algorithm works by detecting pair of data points, called Tomek link, that are each other’s nearest neighbor but have different class labels. Undersampling can be done by either removing all Tomek links or by removing the majority class data belonging to the Tomek link. The NearMiss (NM) methods perform undersampling by removing data points from majority class based on their distances between each other [16]
. In NearMiss-1, the points in majority class whose mean distance to the k-nearest points in minority class is lowest are retained, where k is a tunable hyperparameter. Whereas, NearMiss-2 retains those points from majority class whose mean distance with k farthest points in minority class is lowest. In the final version of NearMiss, NM-3, for every data point in minority class, k nearest data points in minority class are retained. In addition to these undersampling techniques, there is another undersampler called clustered centroids
[8]which makes use of k-means clustering to balance an imbalanced dataset by reducing the number of majority samples.
Iii Methods
We use an auto-encoder and a simple artificial neural network for training the minority class. Figure 1 and Figure 2 depict two such models. We fitted the minority data using one of the two models. A threshold value was set to choose the kind of neural network to be used to train the minority samples. We have set the threshold value to 30. If the number of input attributes are more than 30, we have fitted the minority samples using an autoencoder; otherwise, we fitted those with a simple neural networks with 2/3 hidden layers. Notably, solving the issue of over-fitting was not a major concern for our task. The reason behind this is, we have to generate a minority sample with approximately 100% accuracy. If we can not fit the minority samples well, we may loose information on predicting majority samples. If the model is not strong enough, it may propagate error when predicting majority samples.
[height=5.5]
[count=4, bias=false, title=Input
layer, text=[count=5, bias=false, title=Hidden
layer 1, text=[count=5, bias=false, title=Hidden
layer 2, text=[count=4, title=Output
layer, text=
Simple Neural Network to generate input. The nodes shown in green color are inputs. Two hidden layers are shown in blue color. The output layer is shown in red color. The line between each node represents connection between each layer. The network is fully connected. There are five neurons aka nodes for each hidden layer shown in the figure.
[height=10, layertitleheight=0, nodespacing=0.8cm, layerspacing=3cm]
[count=8, bias=false, title=Input
layer, text=[count=6, bias=false, title=Hidden
layer 1, text=[count=4, bias=false, title=Hidden
layer 2, text=[count=6, bias=false, title=Hidden
layer 3, text=[count=8, title=Output
layer, text=
Iii-a Undersampling Algorithm
Iii-A1 Algorithm 1: Hard Neural Network Based Undersampling
Suppose, we have minority samples and majority samples in the dataset under consideration. In this algorithm, we train a neural network (autoencoder or feedforward, decided based on the value of a predefined threshold as discussed in the previous section) to learn the values of features of the minority samples and then we use the same neural network to predict the features of the majority samples. Then we calculate the euclidean distance between the predicted and the real values of the features. In a list, we store the values of these euclidean distances mapped by the indices of the corresponding majority class samples. We then sort the list in descending order based on the values of the euclidean distances calculated. From this sorted list we choose first data samples. The final dataset obtained is the combination of minority class data from the original dataset and majority class data chosen by our approach. So in effect we choose those samples from the majority class that are far in terms of euclidean distance from the predicted values. In other words, our under-sampling approach actually removes the majority class samples which are present in the vicinity of the minority class samples and retains the majority class samples which are located further from the minority class samples. Hence the decision boundary becomes more defined and the resulting balanced dataset becomes more separable. As a consequence, this algorithm outperforms most other undersampling algorithms for most datasets. However, we noted that, this algorithm performs the best when there is no overlap between data points, as will be evident in section 5 when we will generate some artificial data points and observe the performance of the algorithm on those data. For overlapping data, we have proposed another algorithm in the next subsection.
Iii-A2 Algorithm 2: Soft Neural Network Based Undersampling
As discussed at the end of the previous subsection, our proposed Hard Neural Network Based Undersampling algorithm (NUS-1) does not perform well when there is overlap between data points in the dataset. To resolve this issue, we have proposed a new algorithm in this subsection called the soft neural network based undersampling. The soft neural network based undersampling (NUS-2) differs from hard neural network based undersampling in how the majority samples are selected. We choose exactly the first samples from majority class from the indices which are far from its predicted values. That is why we called the algorithm Hard Neural Network Based Undersampling. At first we predict the minority samples by the model that was fitted on the samples from the minority class. The maximum euclidean distance is calculated. Besides calculating the maximum distance, we also calculate the average distances of half of the samples which are greater in value than the other half of the samples. After that, we predict the samples of the majority class with the same model. This time we choose the samples of majority class as follows. We feed one sample to the model, generated its clone by the model and calculated euclidean distance between the two. If the distance is higher than the maximum distance or the half-average distance of the samples from the minority class, we include it in the final dataset as a sample of majority class. The soft neural network based undersampling algorithm performs better than all other undersampling algorithms for sampling overlapping data, as will be observed in section 5 when we will see the effect of different undersampling algorithms on artificially generated overlapping data.
Iv Results Analysis
Iv-a Overview of the experiments
We have designed our experiments as follows. We under-sample the dataset under consideration using different undersampling algorithms. Subsequently, the under-sampled dataset is fed to a number of classifiers and we evaluate the classification results thereof. In Table I, we list the classifiers and the undersampling algorithms we used.
Undersampling Algorithms | Classfier Algorithms |
Edited Nearest Neighbour (ENN) [15] | Random forest (RF) [17] |
All KNN (AKNN) [9] |
Gradient-boosting (GradBoost) [18] |
Near Miss (NM-1 NM-2 NM-3) [16] | K-nearest neighbour [19] |
Neighbourhood Cleaning Rule (NCR) [10] | Stochastic gradient descent (SGD) [20] |
Random Undersampling (RUS) | Logistic Regresson (LR) [21] |
Tomek Link (TLL) [9] |
The result analysis section is organised as follows. First, we gave a little description of the dataset used in this paper. Then, we demonstrated the metrics used in the experiment for comparison. After that, we showed the results generated by various classifiers such as Gradient Boosting Classifier (GradBoost), Stochastic Gradient Descent Classifier (SGD), K-nearest neighbour classifier (KNN), Random Forest(RF) and Logistic Regression (LR). We have used scikit-learn, scipy, numpy, pandas packages to implement all these algorithms and for data conversion
[22, 23, 24, 8]. We have used keras package to implement the neural network and the auencoder
[25]. For graphical representation, we have used matplotlib package [24]. We made the dataset under sampled by different undersampling algorithms such as Edited Nearest Neighbour(ENN) [15], ALL KNN, Near Miss algorithm (Version- 1, 2 & 3) [16], Tomek link Undersampler (TLL) [9], Random Undersampler (RUS) and the proposed 2 algorithms Neural Network Based Undersampling 1 & 2 (NUS-1 & NUS-2) also called hard undersampling and soft undersampling algorithms using neural network. Later, these undersampled data with binary class were classified by the classifiers stated above. We showed the metric value produced by each classifier for comparison.Iv-B Evaluation Criteria
For evaluating the performance of our proposed algorithm, we use some ROC (Receiver Operating Characteristics) curve
[26] based performance metrics. Let +,- represent positive and negative class labels. Table IIcalled confusion matrix represents performance of classification algorithm. Based on the confusion matrix in Table
II the performance metrics as defined in this section are used to evaluate learning of imbalanced data sets by our proposed algorithms.Predicted | |||
+ | - | ||
Actual |
+ | True Positive (TP) | False Negative (FN) |
- | False Positive (FP) | True Negative (TN) |
For comparing the performance of different undersampling algorithms on classification, we use the metric Area under the Receiver Operating Characteristics (ROC) curve[26], the area under ROC curve is popularly known as AUC. AUC value measures the degree of separability between classes. Higher value of AUC indicates that the model is more capable of distinguishing the classes than a model with lower AUC value. The problem with imbalanced dataset is that any machine learning algorithm trained on these data becomes more biased towards the majority class. In addition, overlapping of samples from different classes also poses a problem to the performance of the model because it can not distinguish between classes. This phenomenon is reflected in lower AUC value during evaluation. Under-sampling potentially can solve the problem of imbalance by removing some samples from the majority class and thus by making the dataset more balanced. AUC value becomes higher when trained with these balanced data. In Table V, VIII and XI and XIV, we showed the AUC values of different machine learning models on some originally imbalanced datasets [27] resampled by several under-sampling techniques. The G-mean is defined as the square root of the product of true positives (TP) and false positives (FP). The equation is as follows.
(1) |
The F1 measure is another popular performance metric to evaluate the performance of classification algorithms which is defined as follows.
(2) |
The terms precision and recall in this formula refer to the ratio of true positives (TP) and false positives (FP) respectively to the total number of samples, defined as follows:
Iv-C Description of Dataset
We have used four real world datasets to do experiment on the proposed algorithms. All of them are from UCI machine learning repository [28]. The imbalanced ratio is defined as . The description of the data sets are available in Table III.
Dataset | #attribute | #min | #maj | Ratio |
Ionosphere | 34 | 126 | 225 | 1.78 |
Balance | 4 | 49 | 576 | 11.8 |
Pima | 8 | 268 | 500 | 1.9 |
Satimage | 36 | 626 | 5809 | 9.27 |
The number of majority samples selected by each under sampling algorithms are described in Table IV.
Dataset | ENN | AKNN | NM1 | NM2 | NM3 | NUS1 | NUS2 | CC | NCR | TLL | RUS |
Ionosphere | 216 | 215 | 126 | 126 | 99 | 126 | 105 | 126 | 146 | 225 | 126 |
Balance | 452 | 427 | 49 | 49 | 49 | 49 | 161 | 49 | 544 | 571 | 49 |
Pima | 279 | 249 | 268 | 268 | 268 | 268 | 204 | 268 | 261 | 450 | 268 |
Satimage | 5319 | 5213 | 626 | 626 | 626 | 626 | 3045 | 626 | 5449 | 5770 | 626 |
In almost all cases, we found that, our proposed undersamplers, NUS1 and NUS2 outperform all other undersamplers in case of almost all training algorithms. NUS1 and NUS2 resample the data in such a way that they become more separable as noted from Figures 3 and 4. This leads to higher AUC, G-mean and F1 values and hence better performance. It is to be noted that we have used a number of classifiers to verify that the proposed undersampling algorithms are not classifier dependent.
Balance Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Balance Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Balance Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Pima Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Pima Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Pima Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Satimage Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Satimage Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Satimage Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | 5 | ||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | 2 | ||||
RUS |
Ionosphere Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
NCR | |||||
TLL | |||||
RUS | |||||
CC |
Ionosphere Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
CC | |||||
NCR | |||||
TLL | |||||
RUS |
Ionosphere Dataset | |||||
Method | GradBoost | SGD | KNN | RF | LR |
ENN | |||||
AKNN | |||||
NM1 | |||||
NM2 | |||||
NM3 | |||||
NUS1 | |||||
NUS2 | |||||
NCR | |||||
TLL | |||||
RUS | |||||
CC |
V Undersampling on Artificial Dataset
Th datasets on which we experimented so far have lots of features, which makes it difficult to visualize actually how the undersamplers undersample those datasets. Hence we have used two artificial datasets to visualise the effect of different under-samplers using scikit-learn package [22]
. The first dataset consists of two features, which makes it easy to plot the dimensions and visualize the data. There are 1000 majority samples and 100 minority samples in the dataset. So, the ratio of majority samples to minority samples is 10:1. The centers of two clusters are [0.0 0.0] and [2.0 2.0] respectively. The standard deviation of the cluster samples from its center are 1.5 and 0.5 each. The effect of each undersampler is shown in the Figure
3.Next, we generated the second dataset where the majority and minority samples are overlapping in nature.For the second dataset, the ratio of majority to minority is . In this case, the number of majority samples were same as before but the number of minority samples were . We choose the center of the two classes to be [0.0 0.0] and [0.02 0.05] respectively to introduce the overlapping criteria. The standard deviation of the two cluster samples from the center were respectively. The result of each sampler is shown in Figure 3 and Figure 4.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Vi Comparison between the proposed algorithms
It is observed from the classification results and the figures that refered to the effect of each under-sampler in the data that NUS1 performs well when there exists less overlapping in data. NUS1 algorithm actually retains those majority data points which are most distant from most of the minority data points. But in case of overlapped data, there could be some minority samples overlapped with the retained majority data. In case of non-overlapped data, this problem is minimal. Hence, NUS1 makes balanced dataset linearly separable. On the other hand, NUS2 finds the perimeter of minority samples by calculating the average distance of minority samples to its generator samples generated by the model. Then it retains those majority samples that are outside of the perimeter. By this way, overlapping is removed. Hence NUS2 performs better in classifying overlapping data. However, the choosing of distance whether maximum or average is a tunable parameter. We can indirectly verify the nature of the data by these two proposed methods.
Vi-a Case study
Now we observe a particular case which may arise due to a certain distribution of the data. It may happen that majority class data consists of outliers or data points that are at far distances from the minority data points and also the ratio of majority to minority is very high. In this case, the outliers from the majority data points should be removed first before implementing the proposed hard and soft undersampling algorithms. In Figure
5 we have generated an artificial dataset using scikit-learn [22] package. The ratio of majority to minority is . The two proposed algorithms NUS-1 and NUS-2 always select the 50 points that are located far from minority class at the time of undersampling. In case of outlier, it may happen that the algorithms always choose the outlier data points at the time of undersampling. We have shown the data and effect of undersampling algorithms on data points in Figure 5.
Vii Concluding remarks
In this paper, we proposed two algorithms to solve the class imbalance problem. The main target of this paper is to balance the data i.e. bring down the number of majority samples to the number of minority samples. This approach might result into some drawbacks. If the majority to minority ratio is vary high, there is a high probability of loosing information from majority class. In this scenario, we can use the accuracy of predicting majority samples as a parameter to choose which batch of majority samples should be considered to mitigate the loss. Future works may address this issue.
References
- [1] Bartosz Krawczyk, Mikel Galar, Łukasz Jeleń, and Francisco Herrera. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Applied Soft Computing, 38:714–726, 2016.
- [2] Sun Choi, Young Jin Kim, Simon Briceno, and Dimitri Mavris. Prediction of weather-induced airline delays based on machine learning algorithms. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pages 1–6. IEEE, 2016.
- [3] Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16(4):449–475, 2013.
- [4] CJ Van Rijsbergen. Information retrieval 2nd edition butterworths. London available on internet, 1979.
- [5] Jin Huang and Charles X Ling. Using auc and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3):299–310, 2005.
- [6] Miroslav Kubat, Stan Matwin, et al. Addressing the curse of imbalanced training sets: one-sided selection. In Icml, volume 97, pages 179–186. Nashville, USA, 1997.
-
[7]
Yanmin Sun, Andrew KC Wong, and Mohamed S Kamel.
Classification of imbalanced data: A review.
International Journal of Pattern Recognition and Artificial Intelligence
, 23(04):687–719, 2009. - [8] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.
- [9] Ivan Tomek. A generalization of the k-nn rule. IEEE Transactions on Systems, Man, and Cybernetics, (2):121–126, 1976.
- [10] Jorma Laurikkala. Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe, pages 63–66. Springer, 2001.
- [11] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
- [12] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer, 2005.
- [13] Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1):20–29, 2004.
- [14] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328. IEEE, 2008.
- [15] Dennis L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421, 1972.
- [16] Inderjeet Mani and I Zhang. knn approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets, volume 126, 2003.
- [17] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
- [18] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
- [19] RO Duda and PE Hart. Pattern classification and scene analysis–john wiley & sons. New York, NY, 1973.
- [20] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM, 2004.
- [21] Raymond E Wright. Logistic regression. 1995.
- [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- [23] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.
- [24] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007.
- [25] François Chollet et al. Keras. https://keras.io, 2015.
- [26] Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
- [27] Zejin Ding. Diversified ensemble classifiers for highly imbalanced data learning and their application in bioinformatics. 2011.
- [28] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.