Imbalanced datasets are frequent occurrences in a large spectrum of fields, where Machine Learning (ML) has found its applications, including business, finance and banking as well as medical science. Oversampling approaches are a popular choice to deal with imbalanced datasets (SMOTE, Han2, He, Bunkhumpornpat2009, Barua2014). We here present Localized Randomized Affine Shadowsampling (LoRAS), which produces better ML models for imbalanced datasets, compared to state-of-the art oversampling techniques such as SMOTE and several of its extensions. We use computational analyses and a mathematical proof to demonstrate that drawing samples from an approximated data manifold of the minority class is key to successful oversampling. We validated the approach with 28 imbalanced datasets, comparing the performances of several state-of-the-art oversampling techniques with LoRAS. The average performance of LoRAS on all these datasets is better than other oversampling techniques that we investigated. In addition, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to local mean of the underlying data distribution, in some neighbourhood of the minority class data space.
For imbalanced datasets, the number of instances in one (or more) class(es) is very high (or very low) compared to the other class(es). A class having a large number of instances is called a majority class and one having far fewer instances is called a minority class. This makes it difficult to learn from such datasets using standard ML approaches. Oversampling approaches are often used to counter this problem by generating synthetic samples for the minority class to balance the number of data points for each class. SMOTE is a widely used oversampling technique, which has received various extensions since it was published by SMOTE. The key idea behind SMOTE is to randomly sample artificial minority class data points along line segments joining the minority class data points among of the minority class nearest neighbors of some arbitrary minority class data point.
The SMOTE algorithm, however has several limitations for example: it does not consider the distribution of minority classes and latent noise in a data set (Hu2009). It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model (punt). Several other limitations of SMOTE are mentioned in Blagus2013. To overcome such limitations, several algorithms have been proposed as extensions of SMOTE. Some are focusing on improving the generation of synthetic data by combining SMOTE with other oversampling techniques, including the combination of SMOTE with Tomek-links (ElhassanT2016)Gao, Wang), rough set theory (Ram), kernel based approaches (Mathew), Boosting (Chawla2), and Bagging (Hanifah). Other approaches choose subsets of the minority class data to generate SMOTE samples or cleverly limit the number of synthetic data generated (Narayan). Some examples are Borderline1/2 SMOTE (Han2), ADAptive SYNthetic (ADASYN) (He), Safe Level SMOTE (Bunkhumpornpat2009), Majority Weighted Minority Oversampling TEchnique (MWMOTE) (Barua2014)
, Modified SMOTE (MSMOTE), and Support Vector Machine-SMOTE (SVM-SMOTE)(Suh) (see Table 1) (Hu2009). Recent comparative studies have focused on SMOTE, Borderline1/2 SMOTE models, ADASYN, and SVM-SMOTE (Suh, Ah-Pine2016)
, which is why we will focus on these five models for a comparison with our newly developed oversampling technique LoRAS. LoRAS allows us to resample the data uniformly from an approximated data manifold of the minority class data points and, thus, creating a more balanced and robust model. A LoRAS oversample is an unbiased estimator of the mean of the underlying local probability distribution followed by a minority class sample (assuming that it is some random variable) such that the variance of this estimator is significantly less than that of a SMOTE generated oversample, which is also an unbiased estimator of the mean of the underlying local probability distribution followed by a minority class sample.
|Borderline1/2 SMOTE (Han2)||Identifies borderline samples and applies SMOTE on them|
|ADASYN (He)||Adaptively changes the weights of different minority samples|
|SVM-SMOTE (Suh)||Generates new minority samples near borderlines with SVM|
|Safe-Level-SMOTE (Bunkhumpornpat2009)||Generates data in areas that are completely safe|
|MWMOTE (Barua2014)||Identifies and weighs ambiguous minority class samples|
2 LoRAS: Localized Randomized Affine Shadowsampling
In this section we discuss our strategy to approximate the data manifold, given a small dataset. A typical dataset for a supervised ML problem consists of a set of features , that are used to characterize patterns in the data and a set of labels or ground truth. Ideally, the number of instances or samples should be significantly greater than the number of features. In order to maintain the mathematical rigor of our strategy we propose the following definition for a small dataset.
Consider a class or the whole dataset with samples and features. If , then we call the dataset, a small dataset.
The LoRAS algorithm is designed to learn from a small dataset by approximating the underlying data manifold. Assuming that is the best possible set of features to represent the data and all features are equally important, we can think of a data oversampling model to be a function , that is, uses parent data points (each with features) to produce an oversampled data point in .
We define a random affine combination of some arbitrary vectors as the affine linear combination of those vectors, such that the coefficients of the linear combination are chosen randomly. Formally, a vector , , is a random affine combination of vectors , () if , and are chosen randomly from a Dirichlet distribution.
The simplest way of augmenting a data point would be to take the average (or random affine combination as defined in Definition 2) of two data points as an augmented data point. But, when we have features, we can assume that the hypothetical manifold on which our data lies is -dimensional. An -dimensional manifold can be approximated by a collection of -dimensional planes.
Given sample points we could exactly derive the equation of an unique -dimensional plane containing these sample points. By Definition 1, for a small dataset, however, , and thus, there is even a possibility that . To resolve this problem, we create shadow data points or shadowsamples from our
parent data points in the minority class. Each shadow data point is generated by adding noise from a normal distribution,for all features , where is some function of the sample variance for the feature . For each of the data points we can generate shadow data points such that, . Now it is possible for us to choose shadow data points from the shadow data points even if .
Since real life data are mostly nonlinear, to approximate the data manifold effectively, we have to localize our strategy. For each parent data point in a small dataset , let us denote by the set of k-nearest neighbors (including ) of in . We can always choose in such a way that . Every time we choose shadow data points as follows: we first choose a random parent data point and then restrict the domain of choice to the shadowsamples generated by the parent data points in .
We then take a random affine combination of the chosen shadowsamples to create one augmented Localized Random Affine Shadowsample or a LoRAS sample as defined in Definition 2. Thus, a LoRAS sample is an artificially generated sample drawn from an -dimensional plane, which locally approximates the underlying hypothetical -dimensional data manifold.
|C_maj:||Majority class parent data points|
|C_min:||Minority class parent data points|
|k:||Number of nearest neighbors to be considered per parent data point (default value : if , otherwise)|
||Sp|:||Number of generated shadowsamples per parent data point (default value : )|
List of standard deviations for normal distributions for adding noise to each feature (default value :)
|Naff:||Number of shadow points to be chosen for a random affine combination (default value : )|
|Ngen:||Number of generated LoRAS points for each nearest neighbors group (default value : )|
Theoretically, we can generate LoRAS samples such that and use them for training a ML model. In this article, all imbalanced classification problems that we deal with are binary classification problems. For such a problem, there is a minority class containing a relatively less number of samples compared to a majority class . We can thus consider the minority class as a small dataset and use the LoRAS algorithm to oversample. For every data point we can denote a set of shadowsamples generated from as . In practice, one can also choose shadowsamples for an affine combination and choose a desired number of oversampled points to be generated using the algorithm. We can look at LoRAS as an oversampling algorithm as described in Algorithm 1.
The LoRAS algorithm thus described, can be used for oversampling of minority classes in case of highly imbalanced datasets. Note that the input variables for our algorithm are: number of nearest neighbors per sample k, number of generated shadow points per parent data point |Sp|, list of standard deviations for normal distributions for adding noise to every feature and thus generating the shadowsamples L, number of shadowsamples to be chosen for affine combinations Naff, and number of generated points for each nearest neighbors group Ngen.
We have mentioned the default values of the LoRAS parameters in Algorithm 1, showing the pseudocode for the LoRAS algorithm. One could use a random grid search technique to come up with appropriate parameter combinations within given ranges of parameters. As an output, our algorithm generates a LoRAS dataset for the oversampled minority class, which can be subsequently used to train a ML model.
The implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https://github.com/narek-davtyan/LoRAS. In our computational code in GitHub, |Sp| corresponds to num_shadow_points, L corresponds to list_sigma_f, Naff corresponds to num_aff_comb, Ngen corresponds to num_generated_points.
we choose three of the closest neighbors (using knn) to build a neighborhood of, depicted as the box. (b) Extracting the four data points in the closest neighborhood of (including ). (c) Drawing shadow points from a normal distribution, centered at these parent data point . (d) We randomly choose three shadow points at a time to obtain a random affine combination of them (spanning a triangle). We finally generate a novel LoRAS sample point from the neighborhood of a single data point .
3 Case studies
For testing the potential of LoRAS as an oversampling approach, we designed benchmarking experiments with a total of 28 imbalanced datasets. With this number of diverse case studies we should have a comprehensive idea of the advantages of LoRAS over other existing oversampling methods.
3.1 Datasets used for validation
Here we provide a brief description of the datasets and the sources that we have used for our studies.
Scikit-learn imbalanced benchmark datasets: The imblearn.datasets package is complementing the sklearn.datasets package. It provides 27 pre-processed datasets, which are imbalanced. The datasets span a large range of real-world problems from several fields such as business, computer science, biology, medicine, and technology. This collection of datasets was proposed in the imblearn.datasets python library by Lema and benchmarked by Ding. Many of these datasets have been used in various research articles on oversampling approaches (Ding, saez).
Credit card fraud detection dataset: We obtain the description of this dataset from the website. https://www.kaggle.com/mlg-ulb/creditcardfraud. The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where there are 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172 percent of all transactions. The dataset contains only numerical input variables, which are the result of a PCA transformation. Feature variables
are the principal components obtained with PCA, the only features that have not been transformed with PCA are the ‘Time’ and ‘Amount’. The feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ consists of the transaction amount. The labels are encoded in the ‘Class’ variable, which is the response variable and takes value 1 in case of fraud and 0 otherwise(cfraud).
For each case study, we split the dataset into 50 % training and 50 % testing data. We did a pilot study with ML classifiers such as k-nearest neighbors (knn), Support Vector Machine (svm) (linear kernel), Logistic regression (lr), Random forest (rf), and Adaboost. As inferred in(Blagus2013) knn, svm, and lr are effective models with SMOTE oversampling. For each dataset, except for the credit card fraud detection dataset, we chose the ML model among knn, svm, and lr that has the best classification accuracy for the minority class. For the Credit card dataset we used both lr and rf models. For computational coding, we used the scikit-learn (V 0.21.2), numpy (V 1.16.4), pandas (V 0.24.2), and matplotlib (V 3.1.0) libraries in Python (V 3.7.4).
First, we trained the models with the unmodified dataset to observe how they perform without any oversampling. Then, we oversampled the minority class using SMOTE, Borderline1 SMOTE, Borderline2 SMOTE, SVM SMOTE, ADASYN, and LoRAS to retrain the ML algorithms including the oversampled datasets. We then measured the performances of our models using performance metrics such as Balanced Accuracy and F1-Score. In our study, we benchmark LoRAS against several other oversampling algorithms for the 27 benchmark datasets. To ensure fairness of comparison, we oversampled such that the total number of augmented samples generated from the minority class was as close as possible to the number of samples in the majority class as allowed by each oversampling algorithm. For the credit card fraud detection dataset we compared performances of several oversampling techniques including LoRAS and several ML models as well, ensuring that we build the best possible ML model using customized parameter settings for respective oversampling techniques. For this case we also chose the ML models lr and rf since their performance were the best.
LoRAS has several parameters (k, |Sp|, L, Naff, Ngen). We have ensured, for a fair comparison with other models, to choose the same values for the parameter denoting the number of nearest neighbors of a minority class sample k, where ever applicable. In case of LoRAS, for the parameter Naff we chose the number of features as input for all datasets and for L, we chose a list consisting of a constant value of for each dataset. For the parameter Ngen we used as the default value for the 27 benchmark datasets in the imblearn.datasets Python library.
For imbalanced datasets there are more meaningful performance measures than Accuracy, including Sensitivity or Recall, Precision, and F1-Score (F-Measure), and Balanced Accuracy that can all be derived from the Confusion Matrix, generated while testing the model. For a given class, the different combinations of recall and precision have the following meanings :
High Precision & High Recall: The model handled the classification task properly
High Precision & Low Recall: The model cannot classify the data points of the particular class properly, but is highly reliable when it does so
Low Precision & High Recall: The model classifies the data points of the particular class well, but misclassifies high number of data points from other classes as the class in consideration
Low Precision & Low Recall: The model handled the classification task poorly
F1-Score, calculated as the harmonic mean of precision and recall and, therefore, balances a model in terms of precision and recall. Balanced Accuracy is the mean of the individual class accuracies and in this context, it is more informative than the usual accuracy score. High Balanced Accuracy ensures that the ML algorithm learns adequately for each individual class. These measures have been defined and discussed thoroughly byAbdElrahman2013. We will use the above mentioned performance measures wherever applicable in this article.
Scikit-learn imbalanced datasets: In Table 2 we show the Balanced Accuracies and F1-Scores for the 27 inbuilt datasets in Scikit-learn.
Calculating average performances over all datasets, LoRAS has the best Balanced Accuracy and F1-Score. As expected, SMOTE improved both Balanced Accuracy and F1-Score compared to normal model training. Interestingly, the oversampling approaches SVM-SMOTE and Borderline1 SMOTE also improved the average F1-Score compared to SMOTE, but compromised for a lower Balanced Accuracy. Between SVM-SMOTE and Borderline1 SMOTE we noted that SVM-SMOTE improved the F1-Score the most, but has the lesser Balanced Accuracy. In contrast our LoRAS approach produces a better Balanced Accuracy than SMOTE on average by maintaining the highest average F1-Score among all oversampling techniques.
|Oversampling technique||Highest Balanced Accuracy||Highest F1-Score|
From Table 3, we see that LoRAS performs best in terms of Balanced Accuracy and F1-Score for 11 and 9 datasets respectively.
Thus, LoRAS outperforms other oversampling algorithms in terms of both Balanced Accuracy and F1-Score for a maximum number of datasets.
Interestingly, we also observe a trend that the oversampling approaches that have good performances in terms of F1-Score, have a relatively weaker performance for Balanced Accuracy.
Interestingly, not only LoRAS proves to be the best choice for the highest number of datasets but also retains its performance for both of the performance measures.
Credit card fraud detection dataset: The credit card fraud detection dataset has 492 fraud instances out of 284,807 transactions. The task is to predict fraudulent transactions. In Table 4, we show the number of samples generated from several oversampling approaches. For testing, we have 246 samples of frauds and 142,158 samples of normal non-fraudulent people for each case.
|Oversampling technique||Minority Training samples||Majority Training samples|
We summarize our results in a tabular form in Table 5. In the table we show the scores of our models for the performance measures: Precision, Recall, F1-Score, and Balanced Accuracy for lr and rf ML models.
|Oversampling||ML model||Recall||Precision||F1-Score||Balanced Acc.|
From Table 5 we infer that rf model with LoRAS oversampling has the best F1-Score. Interestingly, LoRAS on both lr and rf produces a Balanced Accuracy higher than 0.85 and an F1-Score higher than 0.8. Other models such as SVM SMOTE (with both lr and rf) and ADASYN with lr also produces very good results. Thus LoRAS produces better F1-Score with a reasonable compromise on the Balanced Accuracy.
We have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to the mean of the underlying local data distribution of the minority class data space. Let be an arbitrary minority class sample. Let be the set of the k-nearest neighbors of , which will consider the neighborhood of . Both SMOTE and LoRAS focus on generating augmented samples within the neighborhood at a time. We assume that a random variable follows a shifted t-distribution with degrees of freedom, location parameter , and scaling parameter . Note that here is not referring to the standard deviation but sets the overall scaling of the distribution (Jackman), which we choose to be the sample variance in the neighborhood of . A shifted t-distribution is used to estimate population parameters, if there are less number of samples (usually, 30) and/or the population variance is unknown. Since in SMOTE or LoRAS we generate samples from a small neighborhood, we can argue in favour of our assumption that locally, a minority class sample as a random variable, follows a t-distribution. Following Blagus2013, we assume that if then and are independent. For , we also assume:
where, and denote the expectation and variance of the random variable respectively. However, the mean has to be estimated by an estimator statistic (i.e. a function of the samples). Both SMOTE and LoRAS can be considered as an estimator statistic for the mean of the t-distribution that follows locally.
Both SMOTE and LoRAS are unbiased estimators of the mean of the t-distribution that follows locally. However, the variance of the LoRAS estimator is less than the variance of SMOTE given that .
A shadowsample is a random variable where , the neighborhood of some arbitrary and follows .
assuming and are independent. Now, a LoRAS sample , where are shadowsamples generated from the elements of the neighborhood of , , such that . The affine combination coefficients follow a Dirichlet distribution with all concentration parameters assuming equal values of 1 (assuming all features to be equally important). For arbitrary ,
where denotes the covariance of two random variables and . Assuming and to be independent,
Thus is an unbiased estimator of . For ,
since is independent of . For an arbitrary , -th component of a LoRAS sample
For LoRAS, we take an affine combination of shadowsamples and SMOTE considers an affine combination of two minority class samples. Note, that since a SMOTE generated oversample can be interpreted as a random affine combination of two minority class samples, we can consider, for SMOTE, independent of the number of features. Also, from Equation 3, this implies that SMOTE is an unbiased estimator of the mean of the local data distribution. Thus, the variance of a SMOTE generated sample as an estimator of would be (since for SMOTE). But for LoRAS as an estimator of , when , the variance would be less than that of SMOTE. ∎
This implies that, locally, LoRAS can estimate the mean of the underlying t-distribution better than SMOTE.
Oversampling with LoRAS produces comparatively balanced ML model performances on average, in terms of Balanced Accuracy and F1-Score. This is due to the fact that, in most cases LoRAS produces lesser misclassifications on the majority class with a reasonably small compromise for misclassifications on the minority class. Moreover, we infer that our LoRAS oversampling strategy can better estimate the mean of the underlying local distribution for a minority class sample (considering it a random variable).
The distribution of the minority class data points is considered in the oversampling techniques such as Borderline1 SMOTE, Borderline2 SMOTE, SVM-SMOTE, and ADASYN (Gosain2017). SMOTE and LoRAS are the only two techniques, among the oversampling techniques we explored, that deal with the problem of imbalance just by generating new data points, independent of the distribution of the minority and majority class data points. Thus, comparing LoRAS and SMOTE gives a fair impression about the performance of our novel LoRAS algorithm as an oversampling technique, without considering any aspect of the distributions of the minority and majority class data points and relying just on resampling. Other extensions of SMOTE such as Borderline1 SMOTE, Borderline2 SMOTE, SVM-SMOTE, and ADASYN can also be built on the principle of LoRAS oversampling strategy. According to our analyses LoRAS already reveals great potential on a broad variety of applications and evolves as a true alternative to SMOTE, while processing highly unbalanced datasets.
Availability of code: The implementation of the algorithm in Python (V 3.7.4) and an example Jupyter Notebook for the credit card fraud detection dataset is provided on the GitHub repository https://github.com/narek-davtyan/LoRAS. In our computational code, |Sp| corresponds to num_shadow_points, L corresponds to list_sigma_f, Naff corresponds to num_aff_comb, Ngen corresponds to num_generated_points.
Acknowledgements: We thank Prof. Ria Baumgrass from Deutsches Rheuma-Forschungszentrum (DRFZ), Berlin for enlightening discussions on small datasets occuring in her research related to cancer therapy, that led us to the current work. We thank the German Network for Bioinformatics Infrastructure (de.NBI) and Establishment of Systems Medicine Consortium in Germany e:Med for their support, as well as the German Federal Ministry for Education and Research (BMBF) programs (FKZ 01ZX1709C) for funding us.