Imbalanced datasets pose a problem in machine learning classification tasks and are present in a multitude of real-world industry datasets. These datasets are characterized as having class priors that are vastly different from one another, and that are skewed towards a majority class or classes (e.g. see Liu and Ghosh). In the case of a dataset with binary classes, an imbalance would cause the minority class to have a significantly smaller prior than the majority class, whereas a balanced dataset would have similar class priors.
Imbalanced datasets occur in a variety of industries such as retail banking, insurance, and telecommunications, with applications in fraud detection, customer acquisition, etcetera. In these cases, it is most critical to correctly classify the minority class. Incorrectly labelling a sample as a false positive can have harsh consequences and business risks, for example in the case of incorrectly labelling a credit card transaction as fraudulent and unnecessarily penalizing a customer for this transaction.
The challenge with imbalanced datasets arises as a result of classifier bias towards majority class predictions, given that their objective function does not consider class differences . The canonical methods of addressing class imbalances include sampling techniques , changing cost functions , and algorithm level methods.
In SMOTE , the minority class is over-sampled by creating synthetic examples in the same feature space as the data. The synthetic examples lie on the line segments that join K minority class neighbours. This technique forces the decision region of the minority class to be more general. Altering the class distributions of a dataset does have downsides, however; under-sampling the majority class may lead to discarding useful data and oversampling the minority class can lead to overfitting .
With ADASYN , the authors adaptively generate examples of the minority class, according to the distributions of the minority samples. More synthetic data is generated for minority samples that are difficult to learn, versus those that are easier to learn. The algorithm uses a density distribution to determine the number of additional synthetic examples needed to be generated for each minority sample. This is in contrast to SMOTE, where an equal amount of synthetic data are generated for each minority data sample.
In contrast to SMOTE and ADASYN, cost-sensitive learning techniques do not modify the imbalanced data distribution directly . Instead the problem is targeted by using different cost-matrices to describe the cost of misclassifying data samples as false negatives or false positives. These techniques can also consider learning when error costs are unequal . When misclassification costs are known, they can be incorporated directly into the cost function . In , the authors showed that cost-sensitive learning and oversampling perform similarly with no definitive winner between cost-sensitive, undersampling, or oversampling.
In contrast, generative approaches have shown promise by outperforming traditional sampling or cost-sensitive techniques 
. The authors generated synthetic data points from the minority class by first learning the probability distribution of the minority class and subsequently adding to a resampled set until the desired proportion between minority and majority classes was reached. They generated artificial documents by sampling from the learned multinomial distribution of the minority class with the objective of applying these documents for word prediction.
In this paper, we similarly focus on generative methods for oversampling and introduce a new generative modelling approach using Variational Autoencoders (VAE) to oversample the minority class in an imbalanced dataset, with a focus on binary target variables. However, the approach can be used easily in multi-class situations. We also extend our approach to image datasets, and allow our architecture to work with convolutional neural networks (CNN). Furthermore, to the best of our knowledge, the research conducted in this paper is the first of its kind to apply variational inference to oversample minority classes when dealing with imbalanced datasets.
The remainder of the paper is organized as follows. In the next section, we briefly review the Variational Autoencoder. In Section 3, we introduce the new generative model for computing synthetic observations of the minority class. The results of an application of the new method to a large real-world dataset is discussed in Section 4, where we show that the new method outperforms SMOTE with respect to a downstream binary classification task. Finally, in Section 5 we conclude with some closing remarks on the new method and present potential avenues for future research.
Ii Variational Autoencoders
Variational methods are employed in situations where the computation of complex integrals are not feasible (i.e., due to either mathematical intractability or extreme computational complexity). The essential idea in variational methods is to approximate the integrand, say , with a more simple to integrate function, say , and allow the algorithm to improve based on some a priori distributions. Variational Autoencoders were first introduced by Kingma and Welling . With VAEs, we are able to perform efficient approximate inference when learning probabilistic models whose (continuous) latent variables have intractable posterior distributions. Moreover, the objective function for VAEs is formed by obtaining a lower bound to the log marginal likelihood of the data, which is typical when learning latent variable models with variational inference. This function is specifically called the evidence lower bound (ELBO), and is given by
where is the latent variable, is the variational distribution, and denotes the Kullback-Liebler divergence between two distributions and
. We get an unbiased estimate of ELBO by samplingand performing stochastic gradient ascent to optimize this  with respect to and . It should be note that in order to utilize back propagation, a reparametrization trick is applied in order to sample . That is, we sample random noise and obtain , where is a continuous and differentiable function with respect to and . Finally, the VAE is considered as a generative model, since it learns the conditional distribution . In other words, to sample from this distribution, one first randomly samples and then samples an observation of from the distribution of .
Iii VOS: Variational Oversampling
A VAE is comprised of two neural networks, one which learns the variational distribution , and another that learns the posterior distribution . Extensions of VAEs include those which consider several layers of latent variables, each layer requiring two neural networks, one for encoding and the other for decoding as described in the previous statement.
The new approach is simple, and is one that requires only two stages of the latent structure: the first latent variable, , encodes a pattern , where as the second encoding can be seen as summarizing both the information of and the target label . This approach was inspired by Louizos, Swersky, Li, Welling, and Zemel , where the authors considered a two-stage latent structure to extract the features from a dataset, while removing the undesirable effect of sensitive features. We refer to this new oversampling method as VOS, which stands for Variational Oversampling.
The modified ELBO for the new VOS is derived in a similar manner as to that for the supervised case of the VFAE in . First, note that
It then follows, after an application of Jensen’s inequality that:
Next, we assume that , which then leads to the fact that
The assumed parametric forms of the involved distributions are as follows:
Note that is an appropriate distribution whose parameters are denoted by . For continuous variables, we assume that
; whereas for binary variables, we assume, where
represents the probability that the random variable takes on the value of 1. It is also worth mentioning that all of the Gaussians above are assumed to have covariance structures whose off-diagonal elements are all zero (i.e., the dimensions of the latent representations are normallly distributed and independent of one another).
Iv Variational Methods for Image Data
, the authors defined a normalized random displacement field, such that each pixel in an image would be displaced by this vector. The displacement was governed by two parameters,and , which controlled the strength and smoothness of the displacement. Testing their method on the MNIST dataset, however, they found that large displacements would result in images that no longer corresponded to the desired label.
Zhang, Fu, Zang, Sigal, and Agam  created synthetic images of building roofs to augment their original dataset, but found a ”synthetic gap” in the distributions of the artificially generated images and the real images. The authors tried to train a sparse autoencoder simultaneously with real and synthetic images to minimize the synthetic gap.
In , the authors employ a new VAE-based method for deep deconvolutional learning, where a CNN is used in the encoder (as the recognition model) for the posterior distribution of the decoder, which functions as the image generative model.
In this section, we consider two separate imbalanced datasets and apply the new VOS method to oversample the minority class. In order to assess the performance of the oversampling technique, we train a classifier on the balanced dataset and record the performance on an untouched (i.e., unbalanced) test set. The accuracy metrics we use to judge the quality of oversampling (and classifier) are related to the receiver operating characteristic (ROC) graphs . Under imbalanced conditions, traditional overall accuracy would not provide a comprehensive view of the learning algorithm’s performance 
. In particular, the metrics used to analyze the three conditions were F1-score, precision, and recall.
It also bears mentioning that in both of our examples, we performed -fold cross-validation to determine the number of hidden units in the hidden layers of the generative VAE. In particular, if we denote as the final loss on the -th heldout set when using architecture ; the the optimal architecture is given by
We use the scikit-learn
implementation of logistic regression and set the accompanying parameters to their defaults, except for the inverse of regularization strength which was set to 10. For SMOTE as well ADASYN, we used theimblearn implementation with its the default parameters. Our experiments were run with four NVIDIA GRID GPUS, each with 1536 CUDA cores, 32 vCPUs, 60 GiB of memory, and 240 GB of SSD storage (i.e., using an AWS g2.8xlarge instance). Our implementation of VAEs is based in Tensorflow.
V-a Dataset 1: Credit Card Fraud Detection
The credit card fraud detection dataset (e.g., see ) contains the transactions carried out by European cardholders over a two-day period in September 2013. Fraudulent transactions (i.e., the positive class) only accounted for 0.171% of the total 341,762 transactions; and so, the dataset is highly imbalanced. We randomly split the set of transactions into a training set of 284,807 observations (492 of which were fraudulent), and a test set of the remaining 56,955 observations (91 of which were fraudulent).
For confidentiality purposes, the authors of the dataset were not able to provide the original features of the dataset, hence they applied a PCA transformation to the original data to result in the obfuscated features that we used, which were essentially principal components. The untransformed features that were provided were time of transaction, transaction amount, class label. In total there were 31 features and the data itself only contained numerical variables.
We note that for the cross-validation procedure to determine the architecture of the generative model, we set and restrict architectures to having a certain symmetric structure. This resulted in an optimal architecture wherein the hidden layers of the encoding and decoding layers in an consisted of 80 units, while both and to be of dimension 20.
For the downstream classification task, we compare the results of three different classification algorithms, namely: logistic regression (LR), random forest (RF), and multi-layer perceptron (MLP). Furthermore, we also compare the accuracy metrics of the downstream task when trained on the resulting balanced datasets via SMOTE and ADASYN. We report the accuracy metrics for all pairs of oversampling techniques and classifiers in TableI (note that the predicted column represents the number of predictions of fraud transactions).
As evidenced in Table I, oversampling with VAE significantly outperformed SMOTE as well as ADASYN and helped the classifier to achieve outstanding accuracy metrics on the test set. In particular, when using LR an MLP, the precision and F1 scores of the VAE were significantly higher than other two oversampling techniques; in addition, the overall accuracy is also higher. It is important to note that the performance of the RF without any oversampling techniques is comparable to SMOTE and ADASYN, while is much better than RF combined with VOS. The results of this experiment show the potential in applying variational inference for oversampling the minority class. We also note that in the scikit-learn implementation of both LR and MLP, that the sample_weight parameter of the fit method enables one to weight synthetic observations differently from real ones. However, in our experiments, changing this value had no real significant impact. We set sample_weight to 0.2 for synthetic observations on the basis that the predictive unit should not learn too much on the generated patterns relative to the real ones (i.e., sample_weight was set to 1 for real observations).
V-B Dataset 2: Tumour Images
The second dataset used was the Breast Cancer Histopathalogical Image Classification (BreakHis) database , which has 9109 microscope images of breast tumour tissue collected from 82 patients using a range of magnifications (40X, 100X, 200X, and 400X). It contains 2480 benign and 5429 malignant samples. This is an example of a use case where the cost of misclassification is very grave. Benign tumours are slow growing and localized, whereas malignant tumours are cancerous and can spread to other parts of the body to cause death. The training set had 4931 malign samples, and 2241 benign, whereas the test set had 498 malign and 239 benign samples. The same cross-validation procedure mentioned above was used on this image dataset as well.
The png images of the breast cancer tumours were initially sized at 64 pixels x 64 pixels x 3 RGB colour channels. We flattened the images by turning each into a vector of dimension 64x64x3, and then applied a standard scalar across all of the images for normalization. The VOS algorithm was then used to oversample from this flattened vectors. Once oversampled, we reshape the flattened vectors into their original three dimensional shapes, for passing to the CNN classifier. We used three convolutional layers, with kernel size of 3x3 pixels, stride of 1, and 128 filters. We used ReLU activation functions, and applied dropout at each layer with a keep probability of 0.25. Max pooling operations were also used after each convolutional layer, and the last two layers of the network were fully connected with 1024 hidden units.
We can see from Table II, that the new VOS methods also helps to improve the accuracy when compared to using CNN without balancing.
In this paper, we introduced a new generative approach for oversampling based on variational inference. In particular, we used a two-stage latent structure VAE to learn a sampling distribution of the original dataset. In order to learn the minority class distribution, the target responses augment encodings to learn the second encodings . Our experimental results illustrated the superior performance of the new oversampling method versus SMOTE as well as ADASYN, and indeed demonstrate the promise of this new method for dealing with imbalanced datasets.
With respect to future work, the authors are interested in testing variations of VAEs that lead to lower loss and thus better reconstructions such as Importance Weighted Autoencoders . Learning richer covariance structures for the assumed Gaussian (i.e., relaxing the assumption of independence of the dimensions of the latent encodings and ) are also of interest, which we believe could also lead to lower reconstruction losses, and thereby more useful synthetic observations of the minority classes.
- Kingma and Welling  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- Liu et al.  A. Liu, J. Ghosh, and C. E. Martin, “Generative oversampling for mining imbalanced datasets.” in DMIN, 2007, pp. 66–72.
- Maloof  M. A. Maloof, “Learning when data sets are imbalanced and when costs are unequal and unknown,” in ICML-2003 workshop on learning from imbalanced data sets II, vol. 2, 2003, pp. 2–1.
Chawla et al. 
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
synthetic minority over-sampling technique,”
Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
- McCarthy et al.  K. McCarthy, B. Zabar, and G. Weiss, “Does cost-sensitive learning beat sampling for classifying rare classes?” in Proceedings of the 1st international workshop on Utility-based data mining. ACM, 2005, pp. 69–77.
- He et al.  H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008, pp. 1322–1328.
- Elkan  C. Elkan, “The foundations of cost-sensitive learning,” in International joint conference on artificial intelligence, vol. 17, no. 1. Lawrence Erlbaum Associates Ltd, 2001, pp. 973–978.
- Liang et al.  D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara, “Variational autoencoders for collaborative filtering,” arXiv preprint arXiv:1802.05814, 2018.
- Louizos et al.  C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel, “The variational fair autoencoder,” arXiv preprint arXiv:1511.00830, 2015.
- Wong et al.  S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, “Understanding data augmentation for classification: when to warp?” in Digital Image Computing: Techniques and Applications (DICTA), 2016 International Conference on. IEEE, 2016, pp. 1–6.
- Zhang et al.  X. Zhang, Y. Fu, A. Zang, L. Sigal, and G. Agam, “Learning classifiers from synthetic data using a multichannel autoencoder,” arXiv preprint arXiv:1503.03163, 2015.
Pu et al. 
Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” inAdvances in neural information processing systems, 2016, pp. 2352–2360.
- Dal Pozzolo et al.  A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating probability with undersampling for unbalanced classification,” in Computational Intelligence, 2015 IEEE Symposium Series on. IEEE, 2015, pp. 159–166.
- Fawcett  T. Fawcett, “Roc graphs: Notes and practical considerations for researchers,” Machine learning, vol. 31, no. 1, pp. 1–38, 2004.
-  F. Provost and T. Fawcett, “Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions.”
- Spanhol et al.  F. A. Spanhol, L. S. Oliveira, C. Petitjean, and L. Heutte, “A dataset for breast cancer histopathological image classification,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 7, pp. 1455–1462, 2016.
- Burda et al.  Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance weighted autoencoders,” arXiv preprint arXiv:1509.00519, 2015.