Imbalanced class distribution is a challenge that arises in many real world applications. It usually appears in the context of a binary classification problem, where members of negatively labeled class vastly outnumber the members of positively labeled class. In such cases, learning models tend to be biased towards the negatively labeled class. At the same time, the positively labeled instances are often of more importance. This issue is prevalent in the fields of medical diagnosis, fraud detection, network intrusion detection and many others involving rare events 
. To combat the problem of class imbalance researchers have proposed various strategies that can be generally divided into four categories: resampling, cost-sensitive learning, one class learning, and feature selection. Resampling involves balancing the class distribution by either undersampling the majority class or oversampling the minority class. This is a very popular approach that has been shown to perform well in various scenarios. However, it is not without its limitations as undersampling leads to loss of potentially valuable information and oversampling may lead to overfitting. Cost sensitive learning is based on the idea of increasing the penalty for misclassifying the minority class instances. Since classifier objective is to minimize the overall cost as a result there will be more emphasis put on instances of minority class . One class learning involves training a classifier on data with the target variable restricted to a single class. By ignoring all the majority class examples a classifier can get a clearer picture about the minority class . Feature selection methods attempt to identify features that are effective in discriminating minority class instances. This approach is particularly effective in cases involving high dimensional datasets .
In this paper, we propose a sampling approach based on kernel density estimation to deal with imbalanced class distribution. Kernel density estimation is a well-known method for estimating the unknown probability density distribution based on a given sample[23, 25]. It estimates the unknown density function by averaging over a set of kernel homogeneous functions that are centered at each sample point. After having estimated the density distribution of the minority class we can then generate new sample points based on the density function. The proposed technique offers an intelligent and effective approach to synthesize new instances based on well-grounded statistical theory. Numerical experiments show that our method can perform better than other existing resampling techniques such as random sampling, SMOTE, ADASYN, and NearMiss. The paper is organized as follows. In Section 2, we give an overview of the relevant literature for our study. In Section 3, we describe the methodology used in the study. We present our results in Section 4 and Section 5 concludes the paper.
The problem of class imbalance arises in a number of real-life applications and various approaches to address this issue have been put forth by researches. Krawczyk  presents a good overview of the current trends in the field. One of the common ways to tackle class imbalance is resampling whereby the majority class is undersampled and/or the minority class is oversampled.
In the former a portion the majority class instances are sampled according to some strategy to achieve a more balanced class distribution. Similarly, in the later approach the minority class is repeatedly sampled to increase its proportion relative to the majority class. One of the more popular undesampling techniques is NearMiss  where the negative samples are selected so that the average distance to the closest samples of the positive class is the smallest. In a slightly different variation of NearMiss those negative samples are selected for which the average distance to the farthest samples of the positive class is the smallest. As shown by Liu et al.  an informed undersampling technique can lead to good results. However, in general, undesampling inevitably leads to the loss of information. On the other hand, random sampling of the minority class (with replacement) can also cause issues such as overfitting . More advanced sampling techniques attempt to overcome the issue of overfitting by generating new samples of the minority class in a more intelligent manner. In this regard, Chawla et al. 
proposed a popular oversampling technique called SMOTE. In their approach new instances are generated by random linear interpolation between the existing minority samples. Given a minority sample pointa new random point is chosen along the line segment joining to one of its nearest neighbors . This method has proven to be effective in a number of applications . Another popular variant of SMOTE is an adaptive algorithm called ADASYN . It creates more examples in the neighborhood of the boundary between the two classes than in the interior of the minority class.
The sampling technique proposed in this paper relies on approximating the underlying density distribution of the minority class based on existing samples. Probability density estimation techniques can be categorized into two parts: parametric and nonparametric. In parametric methods a density function is assumed and its parameters are then estimated by maximizing the likelihood of obtaining the current sample. This approach introduces a specification bias and is susceptible to ovefitting. Nonparametric approaches estimate the density distribution directly from the data. Among the nonparametric methods kernel density estimation (KDE) is the most popular approach in the current literature [25, 26]
. It is a well established technique both within the statistical and machine learning communities ([2, 11]). KDE has been successfully used in a wide array of applications including breast cancer data analysis , image annotation , wind power forecast , and forest density estimation .
A KDE based sampling approach was used in 
where the authors applied a 2-step procedure by first oversampling the minority samples using KDE and then constructing a radial basis function classifier. Numerical experiments on 6 datasets showed that their method can perform better than comparable techniques. Our paper differs from in that we perform a more systematic study of the KDE method. We delve a little deeper to analyze the difference between KDE and other sampling techniques. We also carry out a large number of numerical experiments to compare the performance KDE to other standard sampling methods.
3. KDE sampling
Nonparametric density estimation is an important tool in statistical data analysis. It is used to model the distribution of a variable based on a random sample. The resulting density function can be utilized to investigate various properties of the variable. Let
be an i.i.d. sample drawn from an unknown probability density function. Then the kernel density estimate of is given by
where is the kernel function, is the bandwidth parameter and . Intuitively, the true value of is estimated as the average distance of from the sample data points . The ’distance’ between and is calculated via a kernel function . There exists a number of kernel functions that can be used for this purpose including Epanechnikov, exponential, tophat, linear and cosine. However, the most popular kernel function is the Gaussian function i.e.
where is the standard normal density distribution. The bandwidth parameter
controls the smoothness of the density function estimate as well as the tradeoff between the bias and variance. A large value ofresults in a very smooth (i.e. low variance), but high bias density distribution. A small value of leads to an unsmooth (high variance), but low bias density distribution. The value of has a much bigger effect on the KDE estimate than the actual kernel. The value of can be determined by optimizing the mean integrated square error:
The MISE formula cannot be used directly since it involves the unknown density function . Therefore, a number of other methods have been developed to determine the optimal value of . The two most frequently used approaches to select the bandwidth value are rule of thumb methods and cross-validation. The rule of thumb methods approximate the optimal value of under certain assumptions about the underlying density function and its estimate . A common approach is to use Scott’s rule of thumb  for the value of :
is the sample standard deviation. The optimal bandwidth value can also be determined numerically through cross-validation. It is done by applying a grid search method to find the value ofthat minimizes the sample mean integrated square error:
Kernel density estimation for multivariate variables follows essentially the same approach as the one dimensional approach described above. Given a sample of
-dimensional random vectors drawn from a common distribution described by the density functionthe kernel density estimate is defined to be
where is a bandwidth matrix. The bandwidth matrix can be chosen in a variety of ways. In this study, we use multivariate version of Scott’s rule:
is a the data covariance matrix. Furthermore, we use multivariate normal distribution as the kernel function:
We illustrate the difference between KDE sampling and other standard sampling methods in Figure 1
. The original data in the figure consists of 100 uniformly distributed blue points with the points in the radius of 2 from the center being dropped. The 25 orange points are generated in the center of the figure via Gaussian distribution with standard deviation of 2. As can be seen from the figure, KDE creates new sample points by ’spraying’ around the existing minority class points. The points are created using Gaussian distribution centered at randomly chosen existing minority class points. This process seems more intuitive than other sampling methods. On other hand, SMOTE method creates new sample points by interpolating between the existing minority class points. As a result all SMOTE generated points lie in the convex hull of the original minority class samples. Therefore, the new sampled data does not represent well the true underlying population distribution. Random sampling with replacement (ROS) creates new points by simply resampling the existing minority class points. As a result the new sampled data is little different from the original data albeit more dense at each sample location. ADASYN sampled plot resembles the SMOTE plot but it creates a bigger number of points at the edge of the minority cluster. NearMiss undersamples from the majority class thereby losing a lot of information as can be seen from its plot.
4. Numerical Experiments
In this section, we carry out a number of experiments to evaluate the performance of KDE sampling method. To this end, we compare KDE to 4 standard sampling approaches used in the literature: Random Oversampling (ROS), SMOTE, ADASYN, and NearMiss,. The implementation of all 4 sampling approaches is taken from the imblearn Python library  with their default settings. The implementation of KDE is taken from scipy.stats Python library  with its default settings. In particular, we used the multivariate Gaussian KDE with its default bandwidth value determined by the Scott’s Rule (see Equations 3 - 5). Note that the performance of the KDE method can be further optimized by choosing the bandwidth value via cross validation.
The usual measures of classifier performance such as the accuracy rate are not suitable in the context of imbalanced datasets as the the results can be misleading. For instance, given a dataset with 90 of instances labeled negative we can achieve a 90 accuracy rate by simply guessing all the instances as negative. Ideally, we would like a metric that would measure classifier performance on both classes. To address this issue, authors often use area under the ROC curve (AUC)  . AUC reflects classifier performance based on true positive and false positive rates and it is not sensitive to class imbalance 
. However, AUC requires probabilities of the predicted labels which are not available in certain algorithms such as KNN and SVM. Therefore, as an alternative to AUC we also use use G-mean, :
to measure of classifier performance.
4.1. Simulated Data
We begin by considering a situation similar to the one described in Figure 1
. We use a dataset of size 1000 where the majority class points are uniformly distributed over a square grid with points in the radius of 2 from the center removed from the set. The minority class points are simulated using a Gaussian distribution with the center at the center of the grid and standard deviation of 2. We measure the performance of the sampling methods under different class imbalance ratios. As the base classifier we use a feedforward neural network with one hidden layer. The results of the experiment are presented in Figure2. We can see that KDE sampling outperforms other methods as measured by G-mean and F1-score. Moreover, KDE holds the edge under different class imbalance ratios. In measuring AUC, KDE is the best at 80 imbalance ratio and the second best at and imbalance ratios.
Next, we consider a nearly (linearly) separable dataset as described in Figure 3. There are 500 majority class samples and 100 minority class samples both uniformly distributed. The new data generated via various sampling techniques is illustrated in Figure 3. As can be seen the new KDE minority samples are spread across a larger region. On the other hand, ROS, SMOTE, and ADASYN generated samples are more concentrated that makes them more prone to overfitting.
A feedfoward neural network is trained on each resampled dataset. The AUC results are given in Table 1. As can be seen from the table KDE significantly outperforms the other sampling techniques.
Our last illustration is in 3-dimensional space as shown in Figure 4. The majority class samples consist of 500 uniformly distributed points over the cube with the circle of radius 1.5 removed from the center of the set. The minority class samples consist of 100 points generated according to the Gaussian distribution with and . As can be seen from Figure 4, the KDE resampled data appears to be more diffused whereas ROS, SMOTE, and ADASYN generated data is more concentrated.
A feedfoward neural network is trained on each resampled dataset and the results are presented in Table 2. As can be observed from the table KDE achieves the best results in AUC and F1-score. And it is second best in terms of G-mean.
4.2. Real Life Data
In order to achieve a reasonably comprehensive evaluation of our method we used a range of datasets and classifiers. In particular, we used 12 real life datasets with various class imbalance ratios ranging from to (Table 3
|Name||Repository & Target||Ratio||#S||#F|
|1||diabetes||UCI, target: 1||1.86:1||768||8|
|2||bank||UCI, target: yes||7.6:1||43,193||24|
|3||ecoli||UCI, target: imU||8.6:1||336||7|
|4||satimage||UCI, target: 4||9.3:1||6,435||36|
|5||abalone||UCI, target: 7||9.7:1||4,177||10|
|6||spectrometer||UCI, target: =44||11:1||531||93|
|7||yeast_ml8||LIBSVM, target: 8||13:1||2,417||103|
|8||scene||LIBSVM, target: one label||13:1||2,407||294|
|9||libras_move||UCI, target: 1||14:1||360||90|
|10||wine_quality||UCI, wine, target: =4||26:1||4,898||11|
|11||letter_img||UCI, target: Z||26:1||20,000||16|
|12||yeast_me2||UCI, target: ME2||28:1||1,484||8|
|13||ozone_level||UCI, ozone, data||34:1||2,536||72|
|14||mammography||UCI, target: minority||42:1||11,183||6|
During the experiments the data was split into training and testing parts and the results based on the testing part are calculated. Furthermore, each experiment was run twice using different training/testing splits. The average value of the results of the two runs are being reported in the paper. The results for each classifier are summarized in 3 separate tables below. When using the KNN algorithm the KDE sampling method often yields significantly better results compared to other sampling methods (see Table 4). For instance, when used on ecoli dataset the KDE method produces G-mean of 0.753 which is 5 better than the second best method (SMOTE) and F1-score of 0.691 which is 6 better than the second best method. Note that the KDE method performs well on datasets with both low and high imbalance ratio.
Using SVM to compare the sampling methods produces results that are similar to KNN. As can be seen from Table 5, KDE often yields significantly better results than other sampling methods. For instance, when used on spectrometer dataset the KDE method produces G-mean of 0.924 which is 12 better than the second best method (SMOTE) and F1-score of 0.878 which is 14 better than the second best method. Note again that the KDE method performs well on datasets with both low and high imbalance ratio.
Using the NN classifier does not produce as strong of results as using KNN and SVM. Although there are still instances - ecoli, mammography - where KDE outperforms other sampling methods its performance is not overwhelming (see Table 6). This may be the result of the particular network architecture used in the experiment: a single hidden layer with 32 fully connected nodes. It is possible that other architectures may produce better results for KDE sampling.
In this paper, we studied an oversampling technique based on KDE. We believe that KDE provides a natural and statistically sound approach to generating new minority samples in an imbalanced dataset. One of the main advantages of KDE technique is its flexibility. By choosing different kernel functions researchers can customize the sampling process. Additional flexibility is offered through selection of the needed kernel bandwidth. KDE is a well researched topic with a well established statistical foundation. In addition, a variety of implementations the KDE algorithm are available in Python, R, Julia and other programming languages. This makes KDE a very appealing tool to use in oversampling. In fact, KDE can be similarly used in undersampling.
We carried out a comprehensive study of KDE sampling approach based on simulated and real life data. In particular, we used 3 simulated and 12 real life datasets that were tested on 3 different base classifiers. The results show that KDE can outperform other standard sampling methods. Based on the above analysis we conclude that KDE should be considered as a potent tool in dealing with the problem of imbalanced class distribution.
-  Abdi, L., and Hashemi, S. (2016). To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE transactions on Knowledge and Data Engineering, 28(1), 238-251.
-  Botev, Z. I., Grotowski, J. F., and Kroese, D. P. (2010). Kernel density estimation via diffusion. The annals of Statistics, 38(5), 2916-2957.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
-  Fernández, A., Garcia, S., Herrera, F., Chawla, N. V. (2018). Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61, 863-905.
-  Gao, M., Hong, X., Chen, S., Harris, C. J., Khalaf, E. (2014). PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing, 138, 248-259.
-  Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239.
-  He, H., and Garcia, E. A. (2009). Learning from Imbalanced Data IEEE Transactions on Knowledge and Data Engineering v. 21 n. 9.
-  Jeon, J., and Taylor, J. W. (2012). Using conditional kernel density estimation for wind power density forecasting. Journal of the American Statistical Association, 107(497), 66-79.
Jones E, Oliphant E, Peterson P, et al. SciPy: Open Source Scientific Tools for Python, 2001-,http://www.scipy.org/ [Online; accessed 2019-05-05].
-  Kim, J., and Scott, C. D. (2012). Robust kernel density estimation. Journal of Machine Learning Research, 13(Sep), 2529-2565.
-  Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232.
-  Lehmann, E. L. (2012). Model specification: the views of Fisher and Neyman, and later developments. In Selected Works of EL Lehmann (pp. 955-963). Springer, Boston, MA.
-  Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559-563.
-  Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J., and Wasserman, L. (2011). Forest density estimation. Journal of Machine Learning Research, 12(Mar), 907-951.
-  Liu, X. Y., Wu, J., Zhou, Z. H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
-  Maimon, O., and Rokach, L. (Eds.). (2005). Data mining and knowledge discovery handbook.
-  Maldonado, S., Weber, R., and Famili, F. (2014). Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Information Sciences, 286, 228-246.
-  Mani, I., Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126).
-  Moayedikia, A., Ong, K. L., Boo, Y. L., Yeoh, W. G., Jensen, R. (2017). Feature selection for high dimensional imbalanced class data using harmony search. Engineering Applications of Artificial Intelligence, 57, 38-49.
-  Nguyen, H. M., Cooper, E. W., Kamei, K. (2009, November). Borderline over-sampling for imbalanced data classification. In Proceedings: Fifth International Workshop on Computational Intelligence Applications (Vol. 2009, No. 1, pp. 24-29). IEEE SMC Hiroshima Chapter.
-  Raskutti, B., and Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), 60-69.
-  Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization. John Wiley Sons.
Sheikhpour, R., Sarram, M. A.,
Sheikhpour, R. (2016). Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer. Applied Soft Computing, 40, 113-131.
-  Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge.
-  Simonoff, J. S. (1996). Smoothing Methods in Statistics. Springer, New York.
-  Yavlinsky, A., Schofield, E., and Rüger, S. (2005, July). Automated image annotation using global features and robust nonparametric density estimation. In International Conference on Image and Video Retrieval (pp. 507-517). Springer, Berlin, Heidelberg.