SMOTified-GAN for class imbalanced pattern classification problems

08/06/2021 ∙ by Anuraganand Sharma, et al. ∙ UNSW 0

Class imbalance in a dataset is a major problem for classifiers that results in poor prediction with a high true positive rate (TPR) but a low true negative rate (TNR) for a majority positive training dataset. Generally, the pre-processing technique of oversampling of minority class(es) are used to overcome this deficiency. Our focus is on using the hybridization of Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to address class imbalanced problems. We propose a novel two-phase oversampling approach that has the synergy of SMOTE and GAN. The initial data of minority class(es) generated by SMOTE is further enhanced by GAN that produces better quality samples. We named it SMOTified-GAN as GAN works on pre-sampled minority data produced by SMOTE rather than randomly generating the samples itself. The experimental results prove the sample quality of minority class(es) has been improved in a variety of tested benchmark datasets. Its performance is improved by up to 9% from the next best algorithm tested on F1-score measurements. Its time complexity is also reasonable which is around O(N^2d^2T) for a sequential algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Class imbalance problem (CIP) refers to a type of classification problems where some classes are either majorly or moderately underrepresented in comparison to other classes ling_class_2010

. The unequal distribution makes many conventional machine learning algorithms quite less effective, especially for the prediction of minority classes

castro_novel_2013. A number of solutions have been proposed at the data and algorithm levels to deal with class imbalance such as preprocessing for oversampling or under-sampling, data augmentation, cost-sensitive learning/model penalization and one-class classification ling_class_2010; makki_experimental_2019; haixiang_learning_2017; johnson_survey_2019.

The imbalance dataset exhibits a major problem for the classifiers to be bias towards the majority class. The imbalanced class distribution results in the degradation of performance of the classifier model due to biased classification towards the majority class. It causes high true positive rate (TPR) and a low true negative rate (TNR) when majority samples are positive wei_role_2013

. Data imbalance can be commonly seen in fraud/fault/anomaly detection

wei_effective_2013; lee_fault_2020; zhuo_gaussian_2020; huang_igan-ids_2020; makki_experimental_2019, medical diagnosis of lethal and rare diseases johnson_survey_2019; bria_addressing_2020; qin_gan-based_2020, software defect prediction software_defect, natural disaster etc haixiang_learning_2017.

Commonly used pre-processing technique is oversampling as undersampling removes important information and does not result in accurate classification hilario_learning_2018. Oversampling too suffers from inclusion of illegitimate samples which is still an active area of research zhu_improving_2020; tao_adaptive_2020. Synthetic oversampling technique (SMOTE) chawla2002smote

is considered a “de facto” standard for an oversampling method. It is simple and effective; however, it may not produce diverse sample. SMOTE uses interpolation to randomly generate new samples from the nearest neighborhood of minority class data. It has been successfully used in regression

torgo2013smote, and classification problems jeatrakul2010classification for a wide range of models li2020toxic. A review of SMOTE and applications has been given in fernandez2018smote.

The data samples in the case of imbalanced dataset can also be generated through classification models as well with data augmentation approach. Generative Adversarial Network (GAN) and its variations are commonly used to generate new “fake” samples zareapoor_oversampling_2021. GAN was originally designed to generate the realistic-looking images, however, it can also generate minority class samples thereby balancing the class distribution and avoiding over-fitting effectively zhang_imbalanced_2018. Imbalanced data classification is ubiquitous in application domains. Data augmentation technique based on variations of GAN have been successfully applied on many applications such as skin lesion classification qin_gan-based_2020 for better diagnosis or pipeline leakage in petrochemical system xu_predicting_2019.

Bayesian inference provides a principled framework to estimate unknown quantity represented by the posterior distribution (parameters of a model) which is updated via Bayes’ theorem as more information gets available knill1996perception; box2011bayesian; andrieu2003introduction

. Markov Chain Monte Carlo (MCMC) sampling is typically used to implement Bayesian inference

neal2011mcmc

. It features a likelihood function that takes into account the prior distribution to either accept/reject samples obtained from a proposal distribution to construct the posterior distribution of model parameters, such as weights of a neural network

mackay1992practical; andrieu2003introduction; neal2011mcmc; van_ravenzwaaij_simple_2018. A major limitation for MCMC sampling technique is high computational complexity for sampling from the posterior distribution das_racog_2015; johndrow_mcmc_2019

. There recently there has been much progress in MCMC sampling via the use of gradient-based proposals and parallel computing in Bayesian deep learning

chandra2019langevin; chandra2021revisiting; chandra2021bayesian. However, these have been mostly limited to model parameter (weights) uncertainty quantification rather than quantifying uncertainties in data or addressing class imbalanced problems. In the case of class imbalanced problems, MCMC sampling has been used for benchmark real-world imbalanced datasets das_racog_2015; das_wracog_2013.MCMC method have been applied for handing imbalanced categorical data johndrow_mcmc_2019. Das et al. in das_racog_2015 have used Gibbs sampling (an MCMC method) to generate new minority class samples.

Another example of oversampling method is data dependant cost matrix, where a weighted misclassification cost is assigned to the misclassified classes haixiang_learning_2017. It is not easy to determine the this cost das_racog_2015

. The cost-sensitive loss function has penalty based weights for misclassification errors from both majority and minority classes. Hybrid neural network with a cost-sensitive support vector machine (hybrid NN-CSSVM) in

kim_hybrid_2020 considers different cost related to each misclassification. Castro et al. in castro_novel_2013 have improved the misclassification error for the imbalanced data by using the cost parameter according to the ratio of majority samples in the training set. One-class problem irigoien_towards_2014; lee_fault_2020; gao_handling_2020

also has a “minority” class but generally it is considered outlier which is removed from the training data. One-class modeling usually uses feature mapping or feature fitting to enforce the feature learning process

gao_handling_2020.

In this paper, we a propose a novel hybrid approach that combines the strengths and overcomes the deficiency of two independent models that include SMOTE and GAN. Hence, we refer to it as SMOTified-GAN which relies on promising samples generated by SMOTE rather than using completely random samples. This could lead to more feasible and diverse data which are further enhanced through GAN to prepare better quality samples. We have obtained impressive results for our proposed method on numerical benchmark CIP datasets mainly from UCI library Dua:2019. Its efficiency is also reasonable which is the combination of SMOTE and GAN as it is a two-phased process.

The rest of the paper is organised as follows. Section 2 presents the state-of-the-art techniques to solve CIPs. Section 3 discusses the proposed method – SMOTified-GAN. Section 4 shows the experimental results and Section 5 discusses the outcome of the experiments. Lastly, Section 6 concludes the paper by summarizing the results and proposing some further extensions to the research.

2 Related Work on class imbalance problems

2.1 Synthetic Minority Oversampling TEchnique (SMOTE)

The SMOTE is a “de facto” standard for pre-processing imbalanced data. This is not a complete random sampling whereas it uses interpolation among the neighboring minority class examples. It is efficient and easy to implement. Each minority example gets

-nearest neighbors (KNN) which are randomly selected to have interpolation to create new samples. The pseudocode is given in Algorithm

1. The parameters and are the size and dimension of the minority class respectively; is the size of the majority class and parameter for -nearest neighbor. Lines 1-5 finds KNN for each minority sample then does the interpolation with them to create new samples. Lines 6-12 describes the interpolation step where samples are being created and added into minority class. Its time complexity for a single machine has the order of raschka_stat_2018; fernandez_smote_2018; chawla_smote_2002.

// Input: -dimensional minority samples of size from a training data set of size that requires over-samples. defines -nearest neighbors.
1
2 for  do
       KNN() //
       interpolate() // for
3      
4 end for
// subroutine for interpolate
5 interpolate ()
6 while  do
       // pick a random integer value from 1...k
7      
       // euclidean distance between and
       // a decimal random number between 0 to 1
8      
9 end while
return X
Algorithm 1 Pseudocode for SMOTE

There are many variants of SMOTE that have been successfully applied to various application domains such as bioinformatics, video surveillance, fault detection or high dimensional gene expression data sets won_push_2020; fernandez_smote_2018; blagus_evaluation_2012. There are many variants of SMOTE such as regular SMOTE, Borderline-SMOTE, SVM-SMOTE and KMeans-SMOTE zheng_smote_2020. Kovacs in kovacs_smote-variants_2019 has shown the implementation of 85 variants of SMOTE in python library.

2.2 Generative Adversarial Network (GAN)

GAN is a class of machine learning frameworks in which there is a contest between two neural networks with a continuous and simultaneous improvement of both neural networks. This technique learns to generate new data with the same statistics as the training set by capturing the true data distribution goodfellow_generative_2014.

GAN has been successfully used for data augmentation. The two neural networks of GAN learn the target distribution and generate new samples to achieve similar distributive structure in its generated over-sampled data. A GAN is simply the synergy of two deep learning network that produce “fake” data examples emulating the properties of the real data zhang_imbalanced_2018; suh_cegan_2021; ali-gombe_mfc-gan_2019.

GAN had not been designed for oversampling imbalanced classes but to create “fake” images of real images which should be hard to distinguish. However, its success in data augmentation for over-sampling has led to the introduction to many variations of GAN to solve CIP goodfellow_generative_2014; sorin_creating_2020; lin_fpgan_2021; yi_progressive_2020.

Figure 1: Process of “fake” sample generation with GAN

The first network is called Generator whose responsibility is to takes a vector of random values and generate the data similar to the real data used in training. The second network is called Discriminator that takes input data from both the real training data and the “fake” data from the generator, to classify them correctly. This process is shown in Figure

1.

The time-complexity of GAN can be roughly given as where the new parameters and

are layer-size and total iterations for a GAN. Its convergence rate with the Stochastic Gradient Descent would be

where

is the variance of the dataset

sharma_guided_2021. See Section 3 for further details.

3 SMOTified-GAN for Class Imbalance Problem

Our proposed method tries to overcome the deficiency of both SMOTE and GAN in a common model. We have named it SMOTified-GAN as it tries to diversify the original samples produced by SMOTE through GAN. Additionally, the quality of the sample is further enhanced by emulating them with the realistic samples. The process of SMOTified-GAN is shown in Figure 3.

Even though SMOTE is widely used as an oversampling technique, it suffers with some deficiency. The major drawback of SMOTE is that it focuses on local information and therefore it does not generate diverse set of data as shown in Figure 2(a). Additionally, Figure 2(b) shows the 5 nearest neighbors of , are firstly, blindly chosen then interpolated to get the corresponding synthetic samples . Even, there there is a high chance of miss-classification for sample with a majority sample hu_novel_2013. The generated data are generally insufficiently realistic compared to GAN that captures the true data distribution in order to generate data for the minority class zheng_conditional_2020.

(a) Low-diversity with SMOTE taken from jefferson_bank_2020
(b) Interpolation with SMOTE taken from hu_novel_2013
Figure 2: SMOTE processing for oversampling

GAN is not ideally fit for oversampling as it has been originally designed for realistic looking images with convolutional neural networks (CNN) rather than producing over-samples for the minority class. Additionally, GAN may face data scarcity problem as minority class is already in reduced form where model training requires more of its data to be sacrificed for validation and testing purpose. Though, cross-validation techniques may solve this problem to some extent. GAN has two networks as mentioned in the previous section where the objective of the generator network is to generate data that fools the discriminator network to classifies as “real”. To optimize its performance, maximize the loss of the discriminator when data is coming from the generator. That is, the objective of the generator is to generate data that the discriminator classifies as “real”.

To optimize the performance of the discriminator, the loss of the discriminator is to be minimized when given batches of both real and generated data. That is, the objective of the discriminator is to not be “fooled” by the generator noauthor_train_2021; goodfellow_generative_2014.

The discriminator score can be given as:

or

contains the discriminator output probabilities for the real data

and contains the discriminator output probabilities for the generated data .

The generator score is:

The pseudocode for the GAN algorithm is given in Algorithm 2 where and are functions to determine gradient for a mini-batch using Stochastic Gradient Descent algorithm (SGD) optimizer sharma_guided_2018 or its any other variation such as ADAM kingma_adam:_2014

or RMSprop

zeiler_adadelta:_2012, and update the weights respectively. Once the algorithm terminates ‘good’ fake samples are collected with accumulateFakeEx based on classification accuracy.

// Input: training data set examples and noise samples from appropriate random number generator. An optional parameter can be the size of fake sample needed.
// initialize parameters
// is minibatch indices for index and is total iterations.
1 GAN ()
2 for  do
       // generally step size S is 1
       // subscript and refers to discriminator and generator entity respectively
3       for  do
4            
5            
6            
7       end for
8      
9      
10      
11 end for
12 accumulateFakeEx (, )
return
Algorithm 2 Pseudocode for GAN

Goodfellow NIPS2014_IanG has used sigmoid

as the activation function that would result the following scores to minimize:

Discriminator:

Generator:

where and are the outputs of the Discriminator D and Generator model G respectively before the activation function is applied.

The formalization of SMOTified-GAN is not very different from the original GAN. Only the random generator function of GAN is replaced with the repertoire of oversample minority examples from SMOTE. The modified scores can be shown as:

discriminator score:

Generator score:

where is training samples of minority class(es) and is over-sampled data of the same class(es) generated from different algorithms such as SMOTE in this case. The pseudocode for SMOTified-GAN is given in Algorithm 3. Its implementation is not too difficult either. The Python code is available at https://github.com/anuraganands.

As illustrated in Figure 3, there are two sections of SMOTified-GAN. The first one replaces the random number generator (refer Figure 1) with the repertoire of oversamples from SMOTE. The second section continues with the process of GAN using the new samples from SMOTE. Algorithm 3 also shows this process in two steps. Line (1) calls SMOTE function given in Algorithm 1 and then Line (2) calls GAN function given in Algorithm 2. However, this time the generated samples is used instead of randome noise .

// Input: minority examples from a training data set of size that requires over-samples;
// User-defined parameter for -nearest neighbors.
// First execute SMOTE given in Algorithm 1 then GAN given in Algorithm 2
call Algorithm_1 () // generate over-sampled minority examples .
call Algorithm_2 ().
Algorithm 3 Pseudocode for SMOTified-GAN

Its time complexity for sequential algorithm is combination of SMOTE’s and GAN’s time complexity, i.e., Since is a small part of so it can be assumed is comparable to . This can further simplify the complexity to .

Figure 3: Process of sample generation with SMOTified-GAN

The major difference between our proposed method SMOTEfied-GAN and GAN is the use of ready-made repertoire of samples generated from SMOTE instead of a set of random noise to begin with. Intuitively, this helps in improvement of the input samples that produces better over-samples. This natural synergy of SMOTE and GAN guides the naïve GAN to have a jump-start with “realistic” data before going through further refinement.

4 Experiments and Results

In this section, we provide experimental results of over-sampling methods, namely, SMOTE, GAN and SMOTified-GAN on different datasets that have been taken from the literature of CIP bhowan_reusing_2014; nunez_improving_2017; napierala_types_2016. The over-sampled data is then augmented into training data that are then fed into the Neural Networks (NN) for classification. We have also done the testing on original datasets without using any data augmentation technique.

4.1 Datasets

We evaluate and compare our model on different datasets that feature class imbalance as shown in Table 1. The datasets were mainly obtained from the UCI machine learning repository Dua:2019 and KEEL-dataset repository citealcala2011keel andthat have been used in a number of methods for CIPs bhowan_reusing_2014; nunez_improving_2017; napierala_types_2016.

Dataset Features Classes Instances Minority Class (%) Description
Abalone Dua:2019 8 2 4177 20.1 Predict the age of abalone from physical measurements
Credit-Card Fraud noauthor_credit_nodate 30 2 284807 0.172 This dataset has 492 frauds out of 284,807 transactions.
Ionosphere Dua:2019 34 2 351 35.71 Classification of radar returns from the ionosphere
Shuttle Dua:2019 9 2 58000 0.294 Approximately 80 percent of the data belongs to class 1
Spambase Dua:2019 57 2 4601 39.39 Classifying Email as Spam or Non-Spam
Connect4 Dua:2019 42 2 376640 3.84 Contains connect-4 positions
Yeast Dua:2019 8 2 513 9.94 Predicting the Cellular Localization Sites of Proteins
Ecoli Dua:2019 7 2 335 5.97 This data contains protein localization sites
Pageblocks Dua:2019 10 2 471 5.94 Classifying all the blocks of the page layout of a document
Wine Dua:2019 11 2 655 2.74 Using chemical analysis determine the origin of wines
Poker Dua:2019 10 2 1476 1.15 Purpose is to predict poker hands
Table 1: Dataset description

4.2 Experimental Setup

We used naïve GAN model goodfellow_generative_2014 and naïve SMOTE chawla_smote_2002

in this paper. SMOTified-GAN uses the above two models, however, it is flexible enough to work with other combination of different variations as well. The parameter settings such as learning rate, total epochs and loss functions are shown in Table

2

. The GAN generator neural network features 3 hidden layers with 128 neurons in each layer. The GAN discriminator network is similar to the generator network with major difference of having only two layers first a linear layer followed by a leaky-ReLu layer with alpha=0.2. In GAN training, we use binary cross-entropy activation function with training data batch-size of 128 and initial learning-rate of 0.00001 with Adam optimser.

After basic pre-processing steps, SMOTE oversampling is done with neighbors. The stopping criteria for SMOTified-GAN and naïve GAN’s training are based on validation error to avoid any over-learn. Additionally, it is ensured that the discriminator and generator loss remain significant and do not approach near zero.

Parameter Neural Network Generator Discriminator
Total neurons per hidden layer: 256, 128 128, 256, 512, 1024 512, 256, 128
Optimizer : Adam Adam Adam
Loss Function : Mean Absolute Error BCEWithLogitsLoss BCEWithLogitsLoss
Activation : ReLU ReLU LeakyReLU (0.2)
Normalization : - BatchNorm1d -
Learning Rate : 0.00001 0.00001 0.00001
Table 2: Parameter Settings

4.3 Preliminary investigation

The experiment has been conducted on 11 benchmark imbalanced datasets that are trained on NN to test the efficacy of various oversampling techniques. We used SMOTE, GAN and our proposed method SMOTified-GAN for oversampling. We have also done the testing with original data without any data augmentation. The quality of classification and comparative results are shown in Table 3

. As expected all datasets show high train and test accuracy due to high imbalance in the datasets. So it is important to look into F1 scores to determine high precision and recall measures. The best F1 scores have been shown with the bold font.

It is clear from the experimental results that SMOTified-GAN has outperformed other oversampling techniques. Only Connect4 is an outlier where all oversampling techniques are showing the poor results by compared to non-oversampling technique. Surprisingly, SMOTE also performed poorly by 3.6% compared to original training dataset without any data augmentation. This result can be attributed to the fact that the dataset is highly imbalanced where minority class constitutes only 3.84% of the training dataset. This does not provide enough data for generalization. So the minority class should not be over-sampled blindly for a given dataset. Conversely, no data augmentation with datasets such Ecoli (6.0% minority class) and Wine (2.7% minority class) shows very poor and unacceptable results. Here data augmentation techniques especially with SMOTified-GAN show much better results improved by 92.2% to 52.7% respectively.

SMOTified-GAN gives the best results – considering F1 score – for all other datasets with diverse proportion of minority class such as Creditcard Fraud (0.2% minority class), Spambase (39.4% minority class), Yeast (9.9% minority class) and Wine (2.7% minority class). The rest of the datasets, Ionosphere, Shuttle, Ecoli, Pageblocks and Poker also favors SMOTified-GAN. Figure 4 on the comparative F1 score shows SMOTified-GAN outperforms other algorithms on 10/11 datasets. Its performance is significantly improved for Pageblocks by 9% and 10% for Ecoli. GAN and SMOTE gives mixed results on different datasets. GAN has produced 10/11 times better results than SMOTE. Data augmentation less training is also better than GAN and SMOTE with 2/11 times and 4/11 times respectively. SMOTified-GAN has also produced better precision and recall for most of the datasets.

Figure 4: Comparison of F1 scores

Datasets like Yeast, Ecoli, Wine, Poker and Pageblocks have small number of minority class data instances relative to the majority class which allows SMOTified-GAN to show its potential over other algorithms as seen in the respective results. In datasets like Ecoli and Wine the minority instances are so low that the non-oversampling method completely fails to predict the minority class. All models give high train and test accuracy on all datasets which is attributed to the dominance of majority class in these datasets hence the true performance index measure is the minority F1-score which depends on both the precision and recall. Overall,the proposed model of SMOTified-GAN outperforms the others in terms of F1-score and a comparatively low standard deviation of results.

The training loss curves of SMOTified-GAN’s generator and discriminator models w.r.t the number of epochs during training of selected datasets have been shown in Figure 6. In general, the Discriminator’s loss curve converges fairly quickly whereas the Generative loss curve demonstrates high fluctuations, however, these fluctuations generally gets steady at around 2000 epochs. We have also drawn validation F1-score in the same graph to determine the termination criterion. The training stops once the validation F1-score reaches its highest value to avoid any over-fitting.

4.4 Results

Table 3

presents a summary for the experimental results with NN using the respective oversampling methods – SMOTE, GAN, SMOTified-GAN and also without no augmentation. It shows training and test accuracy and measurements of F1-score, precision and recall. Its purpose is to demonstrate the effectiveness of the methods for class imbalanced datasets. We report the mean, standard deviation, and best performance using the respective evaluation metrics using 30 experimental runs where each run has a different randomised initial position in weight space. This is done to incorporate model uncertainty in our results.

Dataset Oversampling methods Train Test F1 Precision Recall
Abalone Non-oversampled 0.9080 (0.9108,0.0019) 0.9072 (0.9114,0.0028) 0.7556 (0.7658,0.0090) 0.80 0.73
SMOTE 0.8969 (0.9022,0.0035) 0.8622 (0.8827,0.0105) 0.7259 (0.7566,0.0200) 0.78 0.72
GAN 0.9422 (0.9439,0.0013) 0.9070 (0.9125,0.0040) 0.7555 (0.7687,0.0061) 0.80 0.74
SMOTified-GAN 0.9427 (0.9441,0.0008) 0.9075 (0.9126,0.0036) 0.7612 (0.7711,0.0065) 0.80 0.75
Credit-Card Fraud Non-oversampled 0.9996 (0.9997,0.0001) 0.9991 (0.9993,0.0001) 0.8066 (0.8214,0.0327) 0.84 0.80
SMOTE 0.9996 (0.9997,0.0001) 0.9990 (0.9991,0.0001) 0.7099 (0.7409,0.0210) 0.80 0.69
GAN 0.9995 (0.9997,0.0001) 0.9991 (0.9994,0.0001) 0.8069 (0.8214,0.0241) 0.84 0.80
SMOTified-GAN 0.9993 (0.9994,0.0001) 0.9992 (0.9993,0.0001) 0.8118 (0.8243,0.0202) 0.85 0.80
Ionosphere Non-oversampled 0.9878 (0.9928,0.0027) 0.9728 (0.9857,0.0126) 0.9621 (0.9803,0.0179) 0.98 0.98
SMOTE 0.9914 (0.9916,0.0007) 0.9738 (0.9857,0.0113) 0.9632 (0.9803,0.0164) 0.97 0.99
GAN 0.9901 (0.9944,0.0017) 0.9767 (1.0000,0.0086) 0.9701 (1.0000,0.0210) 1.00 1.00
SMOTified-GAN 0.9903 (0.9944,0.0023) 0.9823 (1.0000,0.0068) 0.9777 (1.0000,0.0169) 1.00 1.00
Shuttle Non-oversampled 0.9994 (0.9998,0.0006) 0.9992 (0.9996,0.0005) 0.8256 (0.9350,0.2294) 0.92 0.96
SMOTE 0.9996 (0.9996,0.0000) 0.9990 (0.9993,0.0001) 0.8465 (0.8837,0.0220) 0.83 0.97
GAN 0.9995 (0.9999,0.0005) 0.9989 (0.9996,0.0009) 0.8497 (0.9367,0.2240) 0.93 0.95
SMOTified-GAN 0.9996 (0.9997,0.0004) 0.9993 (0.9996,0.0006) 0.8632 (0.9368,0.2009) 0.93 0.95
Spambase Non-oversampled 0.9476 (0.9527,0.0020) 0.9309 (0.9380,0.0027) 0.9152 (0.9213,0.0032) 0.91 0.92
SMOTE 0.9455 (0.9526,0.0026) 0.9276 (0.9336,0.0031) 0.9129 (0.9204,0.0049) 0.92 0.92
GAN 0.9571 (0.9599,0.0019) 0.9319 (0.9380,0.0030) 0.9172 (0.9222,0.0036) 0.93 0.92
SMOTified-GAN 0.9583 (0.9602,0.0012) 0.9323 (0.9380,0.0026) 0.9174 (0.9222,0.0031) 0.94 0.91
Connect4 Non-oversampled 0.9947 (0.9966,0.0014) 0.9948 (0.9965,0.0013) 0.9361 (0.9578,0.0151) 0.92 1.00
SMOTE 0.9967 (0.9970,0.0001) 0.9912 (0.9930,0.0007) 0.9011 (0.9167,0.0068) 0.85 1.00
GAN 0.9962 (0.9982,0.0010) 0.9938 (0.9965,0.0021) 0.9251 (0.9577,0.0180) 0.92 1.00
SMOTified-GAN 0.9966 (0.9986,0.0009) 0.9946 (0.9965,0.0017) 0.9355 (0.9578,0.0158) 0.92 1.00
Yeast Non-oversampled 0.9748 (0.9780,0.0013) 0.9323 (0.9417,0.0097) 0.6987 (0.7272,0.0305) 0.80 0.67
SMOTE 0.9638 (0.9757,0.0083) 0.9139 (0.9417,0.0152) 0.7012 (0.7857,0.0446) 0.82 0.75
GAN 0.9782 (0.9811,0.0028) 0.9595 (0.9514,0.0036) 0.8173 (0.8333,0.0169) 0.83 0.83
SMOTified-GAN 0.9663 (0.9703,0.0023) 0.9611 (0.9611,0.0044) 0.8221 (0.8695,0.0223) 0.91 0.83
Ecoli Non-oversampled 0.9328 (0.9328,0.0000) 0.9701 (0.9701,0.0000) 0.0000 (0.0000,0.0000) 0.00 0.00
SMOTE 0.9905 (0.9959,0.0020) 0.9577 (0.9701,0.0100) 0.5684 (0.6666,0.0890) 0.50 1.00
GAN 0.9885 (0.9919,0.0013) 0.9880 (1.0000,0.0099) 0.8266 (1.0000,0.1964) 1.00 1.00
SMOTified-GAN 0.9861 (0.9879,0.0010) 0.9960 (1.0000,0.0077) 0.9222 (1.0000,0.1433) 1.00 1.00
Pageblocks Non-oversampled 0.9627 (0.9627,0.0000) 0.9775 (0.9894,0.0050) 0.7803 (0.9090,0.0700) 1.00 0.82
SMOTE 0.9955 (1.0000,0.0030) 0.9761 (1.0000,0.0130) 0.8480 (1.0000,0.0780) 1.00 1.00
GAN 0.9943 (0.9972,0.0020) 0.9858 (1.0000,0.0098) 0.9038 (1.0000,0.0793) 1.00 1.00
SMOTified-GAN 0.9943 (1.0000,0.0043) 0.9989 (1.0000,0.0042) 0.9926 (1.0000,0.0291) 1.00 1.00
Wine Non-oversampled 0.9770 (0.9770,0.0000) 0.9541 (0.9541,0.0000) 0.0000 (0.0000,0.0000) 0.00 0.00
SMOTE 0.9806 (0.9843,0.0020) 0.9081 (0.9389,0.0090) 0.3149 (0.5000,0.0651) 0.39 0.67
GAN 0.9841 (0.9873,0.0020) 0.9549 (0.9618,0.0073) 0.4489 (0.5454,0.1112) 0.46 0.67
SMOTified-GAN 0.9854 (0.9873,0.0010) 0.9558 (0.9694,0.0090) 0.5274 (0.6000,0.0780) 0.53 0.69
Poker Non-oversampled 0.9902 (0.9906,0.0008) 0.9949 (0.9966,0.0020) 0.6300 (0.8000,0.2530) 1.00 0.67
SMOTE 1.0000 (1.0000,0.0000) 1.0000 (1.0000,0.0000) 1.0000 (1.0000,0.0000) 1.00 1.00
GAN 1.0000 (1.0000,0.0000) 1.0000 (1.0000,0.0000) 1.0000 (1.0000,0.0000) 1.00 1.00
SMOTified-GAN 1.0000 (1.0000,0.0000) 1.0000 (1.0000,0.0000) 1.0000 (1.0000,0.0000) 1.00 1.00
Table 3: Comparison of experimental results on NN with baseline methods (SMOTE, GAN, SMOTified-GAN and non-oversampled original data)
(a) abalone-1
(b) creditcard
(c) ecoli
(d) ionosphere
(e) connect4
(f) pageblocks
(g) poker
(h) shuttle
(i) spambase
Figure 5: Precision vs Recall for AUC Curves
(a) abalone-1
(b) creditcard
(c) ecoli
(d) ionosphere
(e) pageblocks
(f) poker
Figure 6: SMOTified-GAN’s Loss and epoch for the best F1-score

Figure 5

presents the receiver operating characteristic curve or ROC curve on precision and recall for the tested datasets. The results for nine datasets have been illustrated with all the tested algorithms. It also shows the measure for the area under the curve (AUC). This is a standard performance measure for imbalanced data. It is clear from the graph that our proposed SMOTified-GAN has highest AUC for all the datasets except Abalone. For example Figure

5(h) shows AUC for Shuttle dataset with SMOTified-GAN, GAN, no data augmentation and SMOTE have the result of 0.949, 0.911, 0.891 and 0.712 respectively in descending order. SMOTified-GAN is better than others by up to 0.038 to the next best algorithm. GAN and SMOTE shows the mixed results as discussed earlier with F1-scores.

5 Discussion

A significant improvement in the quality of classification has been observed with the introduction of SMOTified-GAN as an oversampling technique. It has clearly outperformed Naïve GAN and SMOTE in most of the datasets. The F1 score has been improved by up to 9% for Ecoli dataset from the next best oversampling technique where the precision has also shown significant growth of around {7% to 8%} for Wine and Yeast datasets. The recall has not been much improved. Most notable improvement from GAN and SMOTE can be seen with {Abalone, Pageblock, Wine, and Shuttle datasets}, and {Credit-card Fraud and Wine datasets} respectively.

Furthermore, the best algorithm may not be clearly visible with ROC curves in Figure 5, however, the AUC-ROC measures for each graph shows that SMOTified-GAN outperforms other algorithms. The larger area the ROC curve occupies the better the algorithm which is shown by the AUC measures. For example, it is somewhat clear from the Shuttle that shows the best to worst in the order of SMOTEified-GAN (0.949), GAN (0.911), no oversampling technique (0.891) and then SMOTE (0.712). So SMOTEified-GAN is 3.5% better than the next best algorithm. Similarly, it is 2.1% better than the second best algorithm for Spambase.

Future work can feature a Bayesian framework where MCMC sampling methods can be used to incorporate uncertainty in the predictions and develop a probabilistic data generation process via GANs. The proposed framework can be used in a wide range of problems that face challenges when it comes to class imbalance issues. Moreover, the framework can also be used to improve few-shot learning wang_generalizing_2020 to address problems where model finds it difficult to draw decision boundaries due to lack of data. Moreover, we can also investigate if the method can be used to address the bias-variance problems in order to improve generalisation ability of the model given that the training data differs significantly from the test dataset.

6 Conclusion

We presented a framework that addressed class imbalanced pattern classification problem by combining features from GAN and SMOTE. Our results show that the proposed framework significantly improves majority of the class imbalanced problems. There were improvement of up to 9% on F1 score for the benchmark datasets. Since it is an offline pre-processing technique with the reasonable time complexity order of does not effect the efficiency of training process. We also visualised the learning process and found out that the AUC of SMOTified-GAN is better than the best algorithm up to 2.1% (for Spambase) and 3.5% (for Shuttle).

There are several possible future directions from this work such as applying SMOTified-GAN to other neural networks such as CNNs and recurrent neural networks (RNNs) to over sample imbalanced image datasets and time-series data, respectively. We would also like to investigate conjoining of GAN with other over-sampling techniques such as MCMC. Furthermore, different variations of SMOTE and GAN can improve the SMOTified-GAN further.

Code and Data

We provide Python code and data for extending this work further 111 https://github.com/sydney-machine-learning/GANclassimbalanced.

References