Most machine learning algorithms assume that underlying class distribution (i.e. percentage of examples belonging to each class) is balanced. However, in many real-world applications (e.g. anomaly detection, medical diagnosis, predictive maintenance, driver behavior detection or detection of oil spills), the number of examples from negative class (majority class) significantly outnumbers the number of positive class (minority class or class of interest) examples. In such situations, the traditional machine learning algorithms tend to have bias towards the majority class. This problem of machine learning is known as imbalanced learning or learning from imbalanced data.
Related Work. In the literature, many studies have been conducted to address the problem of imbalanced learning. Most of the proposed approaches can be categorized into groups depending on the way they deal with class imbalance. Data level approaches[2, 5, 19, 29, 30] focus on balancing the input data distribution in order to reduce the effect of class imbalance during the learning process. The algorithm level[13, 16, 20, 40, 38] approaches focus on developing or modifying the existing algorithms to handle imbalanced datasets by giving more significance to positive examples. Finally, the cost-sensitive approaches[12, 36, 28, 39] deals with class imbalance by incorporating different classification costs for each class.
Among these approaches, a group of techniques make use of ensembles of classifiers. Ensemble learning [9, 35], aims at combining a set of classifiers in order to build a more efficient classifier than each of the individual classifier alone. This strategy has shown to be effective in large number of applications[33, 41]. While dealing with imbalanced data, one of the main advantages of ensemble learning approaches is that they are versatile to the choice of base learning algorithm. Many ensemble learning based approaches have been proposed to deal with imbalanced datasets, including but not limited to EasyEnsemble, SMOTEBagging
, Balanced Random Forest or SMOTEBoost. In the ensemble learning literature, it is well known that controlling the trade-off between accuracy and diversity among classifiers plays a key role while learning a combination of classifiers [3, 22]. Moreover, Díez-Pastor et. al. and Yao et. al. showed that approaches that control the diversity among classifiers improves the performance of imbalanced classification tasks. With this in mind, our objective is to design an algorithm for imbalanced datasets which explicitly controls this trade-off between accuracy and diversity among classifiers.
Contribution. In this work, we propose an ensemble method that outputs a Diversity-Aware weighted Majority Vote over previously learned base classifiers for Imbalanced datasets (referred to as DAMVI). In order to learn weights over the base classifiers, we minimize the upper bound on the error of the majority vote, using PAC-Bayesian -Bound [23, 17], which allows us to control the trade-off between accuracy and diversity. Concretely, after learning base classifiers for different bootstrapped samples of input data, the algorithm i) increases the weights of positive examples (minority class) which are “hard” to classify with uniformly weighted base classifiers; and ii) then learns weights over base classifiers by optimizing the C-Bound (with focus on “hard”positive examples). The key benefits of our approach are that it does not make any prior assumption on underlying data distribution and it is independent of base learning algorithm. To show the potential of our algorithm, we empirically evaluate our approach on predictive maintenance task, credit card fraud detection, webpage classification and medical applications. From our experiments, we show that DAMVI is more “consistent”and “stable” compared to state-of-art methods both in terms of F1-measure and Average Precision (AP), in case when we have high imbalance in class distribution ( of Imbalance Ratio). This is due to the fact that our method is able to explicitly control the trade-off between accuracy and diversity among classifiers on hard positive examples.
Paper Organization. In the next section, we present general notations and setting for our algorithm. In Section III, we derive our algorithm DAMVI for imbalance datasets. Before concluding in Section V, we present obtained experimental results using our approach in Section IV.
Ii Notations and Setting
In this work, we consider a binary classification task where the examples are drawn from a fixed yet unknown distribution over , where is the -dimensional input space and the label/output space. Typically, in case of learning with imbalanced data, the percentage of examples belonging to one class is significantly smaller than the another class. In our case, we assume that examples belonging to positive class are in minority. A learning algorithm is provided with training sample of examples denoted by , that is assumed to be independently and identically distributed (i.i.d.) according to . We further assume that we have a set of classifiers from to . Given , our objective is to learn a weight distribution over that leads to a well performing weighted majority vote (), such that
has the smallest possible generalization error on which is highly imbalanced in terms of class distribution. In other words, our goal is to learn over such that it minimizes the true risk of :
where, is equal to if the predicate is true, and otherwise. An important behavior of the above risk on -weighted majority vote is that it is closely related to the Gibbs risk which is defined as the expectation of the individual risks of each classifier that appears in the majority vote. Formally, we can define Gibbs risk as follows:
In fact, if misclassifies , then at least half of the classifiers (under measure ) makes a prediction error on . Therefore. we have
Thus, an upper bound on gives rise to an upper bound on . There exist other tighter relations[23, 17, 24], such as PAC-Bayesian -Bound that involves the expected disagreement between pair of classifiers, and that can be expressed as follows (when ):
We provide the proof of above -Bound in Appendix VI-A. The expected disagreement measures the diversity/disagreement among classifiers. It is worth noting that from imbalanced data classification standpoint where the notion of diversity among classifiers is known to be important ([10, 37]), Equation 3 directly captures the trade-off between the accuracy and the diversity among classifiers. Therefore, in this work, we propose a new algorithm (presented in next Section III) for imbalanced learning which directly exploits PAC-Bayesian -Bound in order to learn a weighted majority vote classifier. Note that, the PAC-Bayesian -Bound has been shown to be an effective approach to learn a weighted majority vote over a set of classifiers in many applications, e.g. multimedia analysis and multiview learning[18, 1].
Iii Learning a Majority Vote for Imbalanced Data
Our objective is to learn weights over a set of classifiers that leads to a well-performing weighted majority vote (given by Equation 1) to deal with imbalanced datasets. It has been shown that controlling the trade-off between accuracy and diversity between the set of classifiers plays an important role for imbalanced classification problems [10, 37]. Therefore, we utilize PAC-Bayesian -Bound (given by Equation 3) which explicitly controls this trade-off in order to derive a diversity-aware ensemble learning based algorithm (referred as DAMVI, see Algorithm 1) for binary imbalanced classification tasks.
Input: Training set , where and ,
Number of classifiers
Base Learning algorithm .
Initialize: Empty set of classifiers .
For a given training set of size ; DAMVI (Algorithm 1) trains a set of base classifiers (using a base learning algorithm ) corresponding to bootstrapped samples (Step to ) 111Our algorithm is not limited to base learners learnt using bootstrapped samples. It is applicable to any set of base learners.. Then, we propose to update the weights of those training examples which belong to the minority class (in our case , ) as follows (Step ):
In Step , the weights of misclassified (resp. correctly classified) positive examples according to the uniformly weighted majority vote classifier increase (resp. decrease). Note that, here we update the weights over the learning sample just once by focusing only on positive examples. Whereas, boosting algorithms (e.g. Adaboost
) repeatedly learn a “weak” classifier using a learning algorithm with different probability distribution over. Intuitively, this step increases the weights of those positive examples which are “hard”to classify with the uniformly weighted classifier ensemble. This step allows us to focus on “hard” positive examples while learning weights over the base classifiers.
Then, we propose to learn the weights over the classifiers by optimizing the -Bound on weighted training sample , given by Equation 3 (Step ), which can be represented by the following constraint optimization problem:
Intuitively, on “hard” positive examples, the -Bound tries to diversify the classifiers and at the same time controls the classification error of the classifiers which is a key element for imbalanced datasets[10, 37]. As above optimization problem is constrained nonlinear problem, therefore we use Sequential Quadratic Programming  algorithm which uses the quasi-Newton method to find maxima of above optimization problem.
Finally, the learned weights over the classifiers leads to a well-performing majority vote, given by Equation 1, tailored for imbalanced classification tasks. For any input example , the final learned weighted majority vote is given as follows:
In this section, we present an empirical study to show the performance of our algorithm DAMVI on following datasets.
We have validated DAMVI on datasets belonging to predictive maintenance task, credit card fraud detection, webpage classification and medical applications. A description of these datasets is presented in Table I.
Predictive maintenance relies on equipment data (telemetry data) and historical maintenance data to track the performance of equipment in order to predict possible failures in advance. We considered real-world Scania dataset222https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks which is openly available and collected from heavy Scania trucks in everyday usage. The positive class (minority class) corresponds to failures of specific component of the Air Pressure System (APS) and negative class corresponds to failures of components not related to the APS system. The PCT Data consists of equipment data (sensor data) and maintenance data from trucks operating at Piraeus Container Terminal (PCT) in Athens, Greece. The positive class (minority class) corresponds to truck failures and the negative class corresponds to normally functioning trucks. This dataset is proprietary and was obtained thanks to a research collaboration.
Credit Card Fraud Detection composed of credit card transactions where positive class (minority class) examples are fraudulent transactions and negative class examples are non-fraudulent. The Credit Fraud dataset333https://www.kaggle.com/mlg-ulb/creditcardfraud is an openly available real-world dataset consisting of credit card transactions occurred during two days in September, 2013. This dataset was collected and analyzed during a research collaboration between Worldline and ULB (Université Libre de Bruxelles).
Medical Datasets: We considered openly available datasets related to medical applications: Mammography and Protein Homo444https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.datasets.fetch_datasets.html. The Mammography dataset composed of results from an eponymous breast screening method. The positive class (minority class) corresponds to a malignant mass and the negative class corresponds to a benign mass. The Protein Homo dataset is an openly available dataset from 2004 KDD-Cup competition4. It is a protein homology prediction task where homologous (resp. non-homologous) sequences correspond to the positive (resp. negative) class.
Iv-B Experimental Protocol
To study the performance of DAMVI, we considered following baseline approaches:
Random Oversampling Decision Tree(R-DT):
This approach first balances the class distribution by randomly replicating minority class examples. Then, we learn a decision tree classifier on oversampled data.
SMOTE Decision Tree (S-DT): This approach first oversamples the minority class examples using Synthetic Minority Over Sampling Technique (SMOTE) 
algorithm. SMOTE oversamples the minority class examples by interpolating between several minority class examples that lie together. After oversampling, we learn a decision tree classifier.
ADASYN Decision Tree (A-DT): This approach first oversamples the minority class examples using Adaptive Synthetic (ADASYN) sampling algorithm . ADASYN computes a weight distribution over minority class examples to synthetically generate data for minority class examples that are harder to learn. After oversampling, we learn a decision tree classifier.
ROSBagging (R-BG): This approach first oversamples the minority class examples by following Random Oversampling (ROS) approach. Then, we learn an ensemble of decision tree classifiers on bootstrapped samples of oversampled data.
SMOTEBagging (S-BG):This approach first oversamples the minority class examples following SMOTE algorithm. Then, we learn an ensemble of decision tree classifiers on bootstrapped samples of oversampled data.
ADASYNBagging (A-BG):This approach first oversamples the minority class examples following ADASYN algorithm. Then, we learn an ensemble of decision tree classifiers on bootstrapped samples of oversampled data.
Balanced Bagging (BB): This approach balances the dataset using random undersampling. Then, an ensemble of decision tree classifiers are learnt on bootstrapped samples of oversampled data.
Balanced Random Forest (BRF): This approach learns an ensemble of classification trees from balanced bootstrapped samples of original input data.
Easy Ensemble (EE): This approach learns an ensemble of AdaBoost learners trained on different balanced bootstrap samples.
For all oversampling based approaches (), we used the ROS, SMOTE and ADASYN implementations of imbalanced-learn python package  to synthetically generate new minority class examples such that the number of minority class examples is equal to the number of majority class examples. For SMOTE and ADASYN, we considered nearest neighbours to generate synthetic examples.
For and BRF, we used implementations of imbalanced-learn python package with number of base learners equals to .
For our approach DAMVI555DAMVI codes are available at https://github.com/goyalanil/DAMVI and baselines , we fix the number of decision tree classifiers to and size of bootstrapped sample to of the size of original training data.
For our approach DAMVI, we learn the weights over base classifiers by optimizing -Bound on weighted training sample .
For solving the constrained optimization problem, we used Sequential Least SQuares Programming (SLSQP) implementation of scikit-learn  (that we also used to learn the decision tree classifiers) with uniform initialization of weights over the base classifiers.
For all the experiments, we reserved of data for testing and the remaining for training.
Experiments are repeated times by each time splitting the training and the test sets at random over the
Under the imbalanced learning scenario, the conventional evaluation metrics such as accuracy are unable to adequately represent the model’s performance on the minority class examples which is typically the class of interest. Therefore, we evaluate the models based on two metrics: F1-score and Average Precision (AP), which are known to be relevant for imbalanced classification problems[8, 16, 20]
. F1-score is defined as harmonic mean of precision and recall. Whereas, Average Precision (AP) is the area under the precision-recall curve and it has been shown that AP, in case of highly imbalanced datasets, is more informative than AUC ROC.
|Webpage||Mammography||Scania||Protein Homo||Credit Fraud||PCT Data|
|Webpage||Mammography||Scania||Protein Homo||Credit Fraud||PCT Data|
Firstly, we report the comparison of our algorithm DAMVI with all the considered baselines in Table II (for F1-score) and Table III (for Average Precision). As shown in Tables II and III, our proposed algorithm DAMVI performs best compared to baseline approaches for all datasets in terms of F1-score and for out of datasets in terms of Average Precision. Moreover, on PCT Data (where we have lowest imbalance ratio i.e. ), we perform significantly better than the baselines. According to Wilcoxon rank sum test, in most of cases, we are significantly better than the baselines with . We can also remark that DAMVI is more “stable” than R-BG
(in general, second best approach) according to standard deviation values. Note thatand BRF are able to create a diverse set of base classifiers on bootstrapped samples of input data. However, these approaches don’t focus on learning the weights over the base classifiers tailored for imbalanced datasets. Whereas, DAMVI explicitly learns the weights by controlling the trade-off between the accuracy and the diversity among base classifiers by minimizing PAC-Bayesian -Bound (with focus on “hard” positive examples). Our results provide evidence that learning a diversity-aware weighted majority vote classifier is an effective way to deal with imbalanced datasets.
We also analyze the behaviour of all the approaches by artificially increasing and decreasing the imbalance for the Mammography dataset.
In order to create a dataset with a higher percentage of minority class examples than in the original dataset, we randomly undersample the majority class examples. Similarly, to create a dataset with a lower percentage of minority class examples than in the original dataset, we randomly undersample the minority class examples.
Figure 1 illustrates the obtained results by showing the evolution of F1-score and Average Precision with respect to the imbalance ratio (i.e. percentage of positive class examples) on the Mammography dataset.
As shown in Figure 1, DAMVI performs better than baselines both in terms of F1-score and AP when the imbalance ratio (IR) is less than (except at for F1-score ).
This shows that DAMVI performs well even for highly imbalanced classification tasks ( of IR).
Below of IR, we can notice that EasyEnsemble (EE) gradually performs second best in terms of AP (but worst in terms of F1-score) and ROSBagging (R-BG) performs second best in terms of F1-score (but drastically drops in terms of AP).
However, our approach DAMVI remains “consistent” and “stable” both in terms of F1-score and AP throughout the evolution of imbalance ratio.
This shows that explicitly controlling the trade-off between the accuracy and the diversity among classifiers (by focusing on “hard”positive examples) plays an important role while learning an ensemble of classifiers for imbalanced datasets.
A note on the Complexity of the Algorithm: The complexity of learning a decision tree classifier is , where is the dimension of input space. We learn the weights over the base classifiers by optimizing Equation (3 (Step 8 of our algorithm) using SLSQP method which has time complexity of . Therefore, the overall complexity of DAMVI is . Note that we can easily parallelize DAMVI: by using machines, we can learn decision tree classifiers parallelly and weights over them.
In this paper, we considered the problem of imbalanced learning where the number of negative examples (majority class) significantly outnumbers the positive class (minority class or class of interest) examples. In order to deal with imbalanced datasets, we propose an ensemble learning based algorithm (referred to as DAMVI) that learns a diversity-aware weighted majority vote classifier over the base classifiers. After learning base classifiers, the algorithm i) increases the weights of positive examples (minority class) which are “hard” to classify with uniformly weighted base classifiers; and ii) then learns weights over base classifiers by optimizing the PAC-Bayesian -Bound. We have validated our approach on various datasets and we show that DAMVI consistently performs better than state-of-art models. We also show that explicitly controlling the trade-off between the accuracy and the diversity among base classifiers (with focus on hard positive examples) is an effective strategy to deal with highly imbalanced datasets.
As future work, we would like to extend our algorithm to the semi-supervised case
, where one has access to an additionally unlabeled set during the training. One possible way is to learn base classifiers using pseudo-labels (for unlabeled data) generated from the K-means classifier trained using labeled data. We would also like to extend our algorithm to the case of multiclass imbalanced classification problems. One possible solution is to make use of multiclass-Bound to learn the diversity-aware weighted majority vote classifier.
Vi-a Proof of -Bound
and its first and second statistical moments.
Let is a random variable that outputs the margin of the weighted majority vote on the example
is a random variable that outputs the margin of the weighted majority vote on the exampledrawn from distribution , given by:
The first and second statistical moments of the margin are respectively given by
According to this definition, the risk of the weighted majority vote can be rewritten as follows:
Moreover, the risk of the Gibbs classifier can be expressed thanks to the first statistical moment of the margin. Note that in the binary setting where and , we have , and therefore
Similarly, the expected disagreement can be expressed thanks to the second statistical moment of the margin by
From above, we can easily deduce that as
. Therefore, the variance of the margin can be written as:
The proof of the -bound
Vi-B Mathematical Tools
Theorem 1 (Cantelli-Chebyshev inequality).
For any random variable s.t. and , and for any , we have
This project has partially received funding from the European Union’s Horizon 2020 research and innovation programme under the grant agreement No 768994. The content of this paper does not reflect the official opinion of the European Union. Responsibility for the information and views expressed therein lies entirely with the author(s).
-  (2019) Multiview boosting by controlling the diversity and the accuracy of view-specific voters. Neurocomputing 358, pp. 81 – 92. External Links: Cited by: §II.
-  (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter 6 (1), pp. 20–29. Cited by: §I.
-  (2010) “Good” and “bad” diversity in majority vote ensembles. In Multiple Classifier Systems, N. El Gayar, J. Kittler, and F. Roli (Eds.), Berlin, Heidelberg, pp. 124–133. External Links: Cited by: §I.
-  (2004) KDD-cup 2004: results and analysis. ACM SIGKDD Explorations Newsletter 6 (2), pp. 95–108. Cited by: 3rd item.
SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research16, pp. 321–357. Cited by: §I, 2nd item.
-  (2003) SMOTEBoost: improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery, pp. 107–119. Cited by: §I, 5th item.
-  (2015) Calibrating probability with undersampling for unbalanced classification. In 2015 IEEE Symposium Series on Computational Intelligence, pp. 159–166. Cited by: 2nd item.
-  (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §IV-B.
-  (2000) Ensemble methods in machine learning. In Multiple Classifier Systems, pp. 1–15. Cited by: §I.
-  (2015-12) Diversity techniques improve the performance of the best imbalance learning ensembles. Inf. Sci. 325 (C), pp. 98–117. External Links: Cited by: §I, §II, §III, §III.
-  (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: 1st item.
-  (2001) The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17, pp. 973–978. Cited by: §I.
-  (1999) AdaCost: misclassification cost-sensitive boosting. In Icml, Vol. 99, pp. 97–105. Cited by: §I.
-  (1997-08) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55 (1), pp. 119–139. External Links: Cited by: §III.
-  (1995-09) Boosting a weak learning algorithm by majority. Inf. Comput. 121 (2), pp. 256–285. External Links: Cited by: §III.
-  (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (4), pp. 463–484. Cited by: §I, 4th item, 7th item, §IV-B, §IV-B.
-  (2015) Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm. JMLR 16, pp. 787–860. Cited by: §I, §II, §VI-A.
-  (2017) PAC-bayesian analysis for a two-step hierarchical multiview learning approach. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 205–221. Cited by: §II.
ADASYN: adaptive synthetic sampling approach for imbalanced learning.
2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. Cited by: §I, 3rd item.
-  (2009) Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21 (9), pp. 1263–1284. Cited by: §I, §IV-B.
-  (2007) An empirical study of learning from imbalanced data using random forest. In 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), Vol. 2, pp. 310–317. Cited by: §I, 8th item.
-  (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience. External Links: Cited by: §I.
-  (2006) PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In NIPS, pp. 769–776. Cited by: §I, §II.
-  (2002) PAC-Bayes & margins. In NIPS, pp. 423–430. Cited by: §II.
-  (2014) On generalizing the c-bound to the multiclass and multi-label settings. Cited by: §V.
-  (1975) Nonparametric statistical methods based on ranks. McGraw-Hill. Cited by: §IV-C, TABLE II, TABLE III.
-  (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18 (17), pp. 1–5. Cited by: 3rd item, 4th item, §IV-B.
-  (2008) Cost-sensitive learning and the class imbalance problem. Vol. 2011, Citeseer. Cited by: §I.
-  (2007) Generative oversampling for mining imbalanced datasets.. In DMIN, pp. 66–72. Cited by: §I.
-  (2008) Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (2), pp. 539–550. Cited by: §I, §I, 9th item.
Majority vote of diverse classifiers for late fusion.
Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 153–162. Cited by: §II.
-  (2006) Numerical optimization. Springer Science & Business Media. Cited by: §III.
-  (2008) Classifier ensembles: select real-world applications. Information Fusion 9 (1), pp. 4–20. Cited by: §I.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IV-B.
-  (2012) Ensemble methods: a review. Advances in machine learning and data mining for astronomy, pp. 563–582. Cited by: §I.
-  (2010) Cost-sensitive learning methods for imbalanced data. In The 2010 International joint conference on neural networks (IJCNN), pp. 1–8. Cited by: §I.
-  (2009) Diversity analysis on imbalanced data sets by using ensemble models. In 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 324–331. Cited by: §I, §II, §III, §III.
-  (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge & Data Engineering (6). Cited by: §I.
A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recognition 77, pp. 160–172. Cited by: §I.
-  (2003) Cost-sensitive learning by cost-proportionate example weighting.. In ICDM, Vol. 3, pp. 435. Cited by: §I.
-  (2012) Ensemble machine learning: methods and applications. Springer. Cited by: §I.