Imbalanced data classification refers to tasks of classifying datasets with significantly different numbers of instances among classes [haixiang2017learning]. Specifically, for the imbalanced binary data classification problem, there is usually a large number of instances from one class dominating the whole dataset (the majority class) and there are only a few instances constituting the other class (the minority class). The problem of imbalanced binary data classification is common in engineering and scientific practices [huda2016BrainTumorImbalancedData][ashkezari2013EnergySectorApplication][triguero2015rosefw]. Since conventional classification algorithms usually suffer from unsatisfied performances under the imbalanced data circumstance, there has been studies focusing on designing specific algorithms for such type of classification.
Among the widely used algorithms, ensemble methods, especially the dynamic ensemble of classifiers, have amassed a considerable amount of attention. The dynamic ensemble of classifiers aims to train multiple classifiers characterized by different subsets and adaptively selects or combines them during the inference process. This type of method attempts to mitigate the performance degradation by only selecting the most competent base classifiers for a specific test instance [cruz2018MulticlassDynamicEnsembleSurvey][cruz2018dynamicDynamicEnsembleSurvey]. However, given the ensemble of classifiers are determined by the local competence, the risk of overfitting is increased at the same time. In this paper, inspired by the modeling capacity of Gaussian Mixture Model (GMM) [barber2012TextMoG][reynolds1995robust] for the global and regional geometry of data and the implicit regularization property of Stochastic Gradient Descent (SGD) [gardner1984learning][zhang2016implicitRegularisationSGD][schmidhuber2015SGD][bottou2018optimization], we propose a novel algorithm of Adaptive Ensemble of classifiers with Regularization (AER) to solve the binary classification problems under the imbalanced scenarios. Here, the term ’Adaptive Ensemble’ indicates that the weight of each base classifier is adaptively chosen base on heterogeneous local geometries of data, and the phrase ’with Regularization’ refers to the two types of regularization methods we introduced on the top of adaptive ensemble.
The algorithm first performs unsupervised Gaussian Mixture Model to generate two types of subsets. The first one favors the representation of the global data manifold, and the second type emphasizes the local geometries. For each sub-dataset, one base classifier will be learned, and their coefficients will be learned by optimizing the cross-entropy loss between the combined probabilistic outputs and the labels with Stochastic Gradient Descent (SGD). Consecutively, during the inference procedure, the normalized coefficient of each individual classifier will be determined by an interpolation between the on-the-fly likelihood and the trained classifier coefficient. Finally, since the output could be determined by a flexible decision threshold in binary data classification, the final model can be determined by an optimal threshold based on the validation data (rather than simply 0.5).
The proposed algorithm introduces two types of regularization on the top of standard dynamic ensemble. First, by training with SGD, the coefficients will converge to a solution with minimum norm, which is equivalent to an implicit regularization to the coefficients. Second, by interpolating the likelihoods with learned global coefficients, the global data geometry information modeled in the GMM will be utilized to normalize the likelihoods, introducing another form of regularization. By carefully choosing an interpolation parameter (often via validation), a balance between the learning capability of complex geometries and the abilities of generalization will lead to an overall-satisfying performance. Also, it could be argued that, comparing to other types of approaches, this resampling-based dynamic ensemble will introduce a better theoretical time and memory complexity, especially when the time complexity is super-lineally dependent on the number of instances (see section 3 for more details).
With a special implementation of Gradient Boosting Machine [friedman2001greedy]-based individual classifier (XGBoost [Chen2016XGBoost], we refer the combined method as AER-XGBoost), the proposed AER method is tested on three datasets: UCI Bioassay, KEEL Abalone19, and GMM-based artificially-generated data. Experimental results illustrate competitive performances of the algorithm and empirically justify the rationality of the proposed method. The overall performances of the proposed method outperform multiple standard and state-of-the-art methods, including the recently-proposed focal loss [lin2018FocalLoss]. In addition, for a specific case, when the geometries of the data follow Gaussian Mixture model, the advantage of our method is especially significant.
1.1 Related Work
Performing high-accuracy classification with imbalanced data has been a challenge for a long time, and there have been considerable numbers of academic publications discussing algorithms to address the problem. The algorithms could be roughly categorized into four types [fernandez2013analysingImbalancedMethods][krawczyk2016ImbalancedLearningChallenge]. The first branch of methods is re-sampling, which aims to generate balanced data via under-sampling the majority class and/or over-sampling the minority instances [more2016ResamlpingSurvey]. The second branch of algorithms is cost sensitive algorithms, which address the problem by using imbalance-sensitive target functions, and assigning special loss to certain types of mis-classifications [khan2017costSensitiveTarget]. The third cluster of algorithms is the one-class learning method that solves the problem by learning the representation of the majority/minority data solely [bellinger2012oneClassLearning]. The final branch of algorithms is the ensemble method, which is related to the algorithm employed in this paper.
Ensemble methods usually utilize a hybrid scheme of optimizing the model on both data distribution and algorithm parameters to obtain a satisfying overall performance [wozniak2014SurveyOnHybridIntelligentSystem]. For instance, both bagging and boosting algorithms utilize a strategy to put emphasis on certain parts of the data at each iteration, and combine multiple classifiers with adaptive parameters [de2005BaggingAndBoosing]
. Meanwhile, since the types of individual classifiers in ensemble learning could be of a broad range, numerous publications have discussed the impact of different base classifiers, ranging from Logistic Regression[galar2011LogisticRegressionEnsemble]khoshgoftaar2007RandomForestImbalancedData]wang2017novelSVMensemble]. Both bagging and boosting methods comprehensively examined class-imbalanced data classification problems with static ensemble approach [galar2012ReviewEnsembleImbalancedData]. However, confusing noises with minority data could be a major source of the performance deficient for the static ensemble methods [wozniak2014SurveyOnHybridIntelligentSystem][galar2012ReviewEnsembleImbalancedData][khoshgoftaar2011ComparingBaggingAndBoosting].
In contrast, dynamic ensemble methods change the ensemble according to the instance of inference. This technique enhances the flexibility of the model and reduce prediction biases. However, it increases computational complexity and the possibility of overfitting. Early work on dynamic ensemble, such as Woods (1997) [woods1997EearlyDynamicEnsemble], usually utilizes a ’rank-based’ selection combined with the Dynamic Classifier Selection (DCS) scheme, which selects one single model with the highest accuracy during the inference procedure. More sophisticated methods, in comparison, often adopt the Dynamic Ensemble Selection (DES) scheme, which selects multiple classifiers for prediction [ko2008DCSandDES]. For instance, Lin (2014) [lin2014libd3cClusterBased] has proposed a method to dynamically ensemble classifier based on clustering results; Cruz (2015) [cruz2015MetaLearningEnsemble] has designed an algorithm to combine classifiers with meta-learning; and Xiao (2012) [xiao2012DynamicWithCostSensitive] has used a cost-sensitive criteria to determine the ensembles of multiple classifiers. In a review of dynamic ensemble especially for multi-class problems, Cruz (2018) [cruz2018MulticlassDynamicEnsembleSurvey] has argued that dynamic ensemble of classifiers could in general provide favorable results.
The major weakness of dynamic ensemble methods is that they tend to overfit and deteriorate their performance on test data [cruz2018MulticlassDynamicEnsembleSurvey][Lima2014Improving]
. In machine learning, regularization is often used to reduce overfitting. However, there is limited research on dynamic ensemble with regularization for imbalanced data classification. For dynamic ensemble methods, a major obstacle in regularization is the constraints on the weights: conventional norm-based regularization reduces complexity by minimizing norms (magnitudes) of the weights; but in this scenario, the weight of each base classifier cannot be shrunk since they should be in [0,1] and the sum of all weights should be 1. Other existed regularization techniques are either not applicable to the scenario (e.g. NoiseOut[babaeizadeh2016noiseout]
, which is designed solely for neural networks), or considered too ’aggressive’ in the linear combination setups (e.g. dropout[srivastava2014dropout], which will opt out some base classifiers entirely and likely to cause errors). In addition, balancing the prediction error and restricting normalized weights posed a significant challenge in optimization: a derivation-based algebraic closed-form solution will not be able to obtain, and if one treats the weights as a categorical distribution to perform optimization, the corresponding likelihood (will be Bernoulli in the binary classification regime) is not conjugate with categorical. Therefore, conventional (explicit) regularization is difficult to implement in dynamic ensemble. One of the major contributions of this paper is to solve the above dilemma with the implicit regularization capability of SGD [zhang2016implicitRegularisationSGD][lin2015iterative][Lin2016Generalization] and the facilitation of the global geometry for implicit regularization via GMM.
Regarding the applications of imbalanced data classification, since this type of data exists broadly in practice, the technique has been widely applied to different areas. In bioscience and medical research, imbalanced data classification has been utilized to identify tumor [huda2016BrainTumorImbalancedData] and diagnose cancer [krawczyk2016CancerImbalanceData]. Likewise, in software engineering, such a technique has been employed to detect bugs [xia2015BugDetection] or malignant software [chen2018MalwareDetection]. In other fields, such as financial fraud detection [mardani2013FraudDetection] and power transformation [ashkezari2013EnergySectorApplication], imbalanced data classifications are also comprehensively employed. Guo (2017) [haixiang2017learning] conducts a survey for applications of imbalanced data classification, and shows the promising potential in applying such a technique to a broader range of problems.
The rest of the paper is arranged as follows: Section 2 introduces the algorithm in detail with its properties. Section 3 analyzes the advantageous time and memory complexity of the proposed algorithm. Section 4 demonstrates the experimental results on the datasets mentioned above and discussed the results and implications. Lastly, section 5 provides a general conclusion of the paper.
In this section, the proposed AER method will be introduced in details with four major parts: section 2.1 will introduce the Gaussian Mixture Model fitting and generation of the two types of subsets; regarding the training of individual ’base’ classifiers, section 2.2 will discuss the specific implementation with XGBoost; SGD training for the ensemble of classifiers will be illustrated in section 2.3; and finally, the weight interpolation and probabilistic prediction will be shown in section 2.4. The overall procedure of the algorithm is shown as Figure 1. In the section 3, the details of each component of the algorithm will be discussed and analyzed.
In this paper, denotes a single instance of the data in the dataset . To distinguish the majority and minority data, the authors use (key data, usually the minority) to denote the set of minority instances and (non-key data, usually the majority) to represent the set of majority ones. The size of data set is mostly denoted by and the dimension (the number of features) is represented by in this paper, in addition,
denotes the number of components of the ensemble, which is also the number of Gaussian distributions in the Gaussian Mixture Model.
2.1 Gaussian Mixture Model Fitting and Subset Generation
Gaussian Mixture Model (GMM) is a popular model in unsupervised learning and data manifold representation. The basic idea of GMM is straightforward: it utilizes the modeling capability of Gaussian distribution, and extends it to multiple centroids to improve the expressiveness. The likelihood of a single instance in GMM can be denoted as:
where denotes a multivariate Gaussian distribution with as mean and
as the co-variance. When fitting the model, the parameters can be obtained via maximizing the log-likelihood:
can be solved by expectation-maximization (E-M) algorithm with a super-linear convergence rate[xu1996EMalgorithmSuperLinearConvergence]. In our program, the package of Gaussian Mixture Model provided by SciKit-learn [pedregosa2011scikit] is directly adopted to perform the fitting procedure of GMM.
It is noticeable that GMM is sensitive to initialization. To obtain stable results, the optimization will be performing 5 times for each training procedure, and the model with the highest log-likelihood will be selected. Another non-learnable parameter in the GMM is the number of Gaussian distributions. In the proposed algorithm, this hyper-parameter also determines the number of components of the final ensemble. To get the optimal number of components that optimally balance likelihood and computational complexity, Bayesian information criterion(BIC) is adopted. The BIC metric can be computed as follows:
Where indicates the likelihood of instance . In terms of the GMM presented in our method, the BIC will be computed as follows:
where stands for the number of parameters of the model. In the training process, a ’pool’ of possible numbers of Gaussian centroids will be given and the algorithm will compute the BIC of each model and pick the one with the least BIC quantity. We choose BIC instead of Akaike information criterion(AIC), as BIC tends to favor the model less over-fitted the data [dziak2012AICBICmeasure]. Since regularization plays an important role in our algorithm, BIC is adopted for hyper-parameter choice.
After obtaining the GMM with Gaussian distributions, we form sub-datasets based on two types of schemes. The first type of scheme will be selecting most representative data from each of the Gaussian distribution. Specifically, for each Gaussian distribution , the algorithm will select instances with highest log-likelihood. For the second type of paradigm, the algorithm will generate subsets with majority instances selected by the highest likelihoods with respect to each Gaussian component, and concatenate it with majority instances randomly selected from the set. Both of the generated majority datasets will be combined by all of the minority instances.
After obtaining the above data, Tomek Link [tomek1976Tlink] will be used to remove instances from the first type of subset being considered as noises. Tomek Link follows the idea that if two instances are mutually nearest neighbors but belonging to different classes, they would be the ’overlapping’ instances between classes and will therefore likely to be noises. Formally, for two given data instances and given distance measure , if for any , there exists:
Then will be considered as a pair of Tomek Link. If the corresponding labels of the Tomek Link pair belong to different classes, then we consider the majority and/or minority instance in the pair as noise and remove one or both of them. In our algorithm, since the minority will be the more important part to be spotted, majority instances in the Tomek Links will be removed.
By performing the above process, there will be available sub-datasets with less significant skewness of labels. The reason for adopting a combination of two selecting schemes is that this strategy can achieve a balance between the recognition of majority and minority instances. The first type of subset is able to preserve the information of global geometry and contribute to the recognition of the majority instances, while the second type of data put emphasis on local geometry and improve the accuracy of spotting minority instances. Specifically, for the first type of subset, since the majority instances will take most of the portions, the subset will be able to preserve the global geometry (like a ‘zoom out’ version). On the other hand, for the second type of subset, since the number of majority and minority instances are almost the same, which means the choice of the majority samples are ‘highly selective’, the classifier will be able to focus on the complex boundaries near the minority samples and it could be deemed as ‘focusing on the local geometry’ (like a ‘zoom in’ version). Concatenating majority instances randomly selected is to add certain information of the global geometry to avoid overfitting.
The overall procedure of GMM fitting and subset generating is shown as algorithm 1.
2.2 Fitting of Individual Base Classifier
As it has been stated above, the specific classifier implemented in the paper is Gradient Boosting Machine (GBM), a boosting-based algorithm. The model of GBM can be expressed as follows:
Like other boosting methods, the training strategy of GBM is to learn from ’previous mistakes’. Specifically, the individual sub-model of GBM at the
-th step will set the gradient of the loss function with respect to the model up to-th step as the current ’labels’, which can be expressed as:
where is used to denote any kind of loss function, and it is usually square loss for regression and cross-entropy loss for classification. The gradient computed with equation 7 is also named ’pseudo-residual’; and since the gradient will be calculated at each step, the overall method is named Gradient Boosting Machine. After obtaining the current target, the parameter of the sub-model at the -th step can be denoted as:
The overall model at the -th step is further determine by a ’learning rate’ , which can be obtained via optimizing the following target function:
The above optimization can be simply solved either by taking partial derivative or through a line search. By iterating the procedure from equation 7 to 9 until it matches the convergence criteria, the integrated GBM model could be obtained.
In our implementation, an integrated, high-efficient, and scalable Gradient Boosting interface, namely XGBoost [Chen2016XGBoost], is employed to fit and make prediction with GBM. For each sub-set of data generated, the algorithm will be fitting one XGBoost model.
2.3 Stochastic Gradient Descent Training for the Ensemble of Classifiers
After the training with the individual models, each classifier will be able to give a class prediction (0 or 1) for every data instance. The next step is to train the combination of individual classifiers with SGD. For convenience, each individual model will be noted as in this sub-section. Hence, we can denote the linear combination of the models as:
The constraints of equation 10 is to guarantee the values of the predictions will be lying in , and thus it could trivially be transferred to a binary-class prediction. To train the model, the two-class cross-entropy loss is adopted:
where the second line of the above equation is the vectorized expression. Notice that since there is a constraint on , the above optimization cannot be accomplished by simply taking derivatives and setting it to 0. To approximate the optimal solution of the function and take regularization into consideration, SGD is adopted. Specifically, the gradient of target 11 with respect to should be:
Notice that if gradient descent is applied, the gradient in formula 12 does not guarantee to be sum up to 1, nor does it warrant that each should be in the interval of . Nevertheless, for a gradient-based method, we can simply re-normalize the weight after learning. In addition, for the weights exceeding the limit of the interval, we can re-scale them to the limit value (0 or 1). Thus, the update formula of should be:
where is the learning rate, and denotes the re-scaling(mapping) function flooring at 0 and ceiling at 1. It could be mathematically denoted as:
It is recommended that the learning rate should be set less than to ensure that the algorithm will converge. To determine whether the training procedure has been converged, the relative change in cross-entropy loss is adopted as the metric.
Another concern regarding using SGD method is how to initialize the coefficients, as the optimization is sensitive to initial values. In the proposed algorithm, the initialization of parameters is accomplished by the combination of AIC and BIC. Similar to BIC, AIC can be expressed as:
where is the parameter to make a balance between AIC and BIC, and experiments find out could be a well-performing trade-off. Notice that here the AIC and BIC are computed with respect to each classifier, and lower AIC/BIC values indicate a more credible solution. Thus, we can use the normalized reciprocal of values to initialize the linear combination. The initial values can be denoted as:
The overall procedure of the optimization of the linearly combined base classifiers can be denoted as algorithm 2.
2.4 Weight Interpolation and the Probabilistic Prediction
The above three consecutive parts have discussed generating balanced sub-dataset and training of individual and combined classifiers. As a dynamic ensemble method, the coefficient of each individual classifier should be on-the-fly according to the test instance(s) when inferring. Since we have a dozen of base classifiers, each of them is trained by its corresponding data subset and has different impacts on the test data. Intuitively, by computing the ’distance’ (denoted by likelihood) between the specific test instance and the Gaussian centroid the classifier based on, the impact of the base classifier on the test data can be evaluated without knowing the test label. Following the above strategy, an interpolation scheme is adopted to adjust weights and implement dynamic ensemble according to the test data. The interpolation is based on the log-likelihood/exp-likelihood calculated for a specific test instance and the previous trained coefficient. This likelihood can be favorable for choosing the best local competence (the highest likelihood) for each test instance. For any test instance , the component of the likelihood will be:
However, this makes the prediction dedicated for the local data geometry and easy to overfit. In the proposed algorithm, the global data geometry modelled in GMM prepared in the data pre-processing stage is included further to constrain the dynamic fitting. In this way, the second type of regularization is introduced and the influence of local data geometry might be reduced.
For any test instance , the component of the normalized likelihood will be:
And the final interpolation is computed as:
where the ’’ operation in the above equation means pairwise summation between two vectors. is the ’interpolation parameter’ and the optima can be found via the validation data through grid search. It is also noticeable that log-likelihood will naturally take multiple classifiers into consideration during classifying, while exp-likelihood would often generate a nearly one-hot vector with respect to different Gaussian distributions. The results computed by equation 20 will satisfy the condition of summing up to 1 and lay in the interval .
Following the above procedure, the algorithm will output probabilistic value in the interval for each sample
. The output can be regarded as the probability of, and instead of simply setting all samples greater than as 1 and opposite as , the threshold can be fine-tuned following the equation:
where can be regarded as a ’threshold value’ and the optima could be found via the validation data through grid search.
The overall procedure of the proposed AER with XGBoost implementation(AER-XGBoost) could be shown as algorithm 3.
3 Time and Memory Complexities
In this section, we will show that the proposed AER method has favourable time and memory complexities. In particular, we will show theoretically that, under certain assumptions and for any classifier implemented with the AER framework, the time complexity will be asymptotically at least as good as the original implementation, and the asymptotic memory complexity will always be better than the full-batch implementations.
To begin with, let us recap the notations used in the AER model. Recall that denotes the number of instances and represents the number of features. For minority and majority data, and are used respectively. The skew rate here will be denoted as , and it is straightforward to get that . The number of Gaussian centroids is given as , and there should be in most cases as it will otherwise miss the purpose of re-sampling (could simply train balanced sub-sets and include all the training set). Notice that this also implies as . The number of iterations of the GMM E-M algorithm will be denoted as and the number of iterations of SGD algorithm is denoted by . The time complexity of any machine learning classifier could be denoted as a polynomial of the numbers of instances and features , where and should be positive integers. Similarly, we will denote the memory complexity with . We care mostly about the complexities of the training process, as this will usually be the part consumes most of the time and memory.
To derive a bound not depending on the GMM-fitting or SGD part and help draw fair comparisons between AER-implemented and original methods, the analysis will be based on the assumption that the GMM covariance inversion and likelihood will be estimated with diagonal covariance approximation. This will remove high order terms ofand reduce the time complexity of computing GMM into , as the inversion and multiplication of the covariance can be completed within time. Also, we assume the choice of and is based on the validation set and its size is considerably smaller, with the condition , where is the number of validation data points.
3.1 Time Complexity
For any Machine Learning method with polynomial training time complexity , the AER time complexity can be denoted with . Under the assumptions stated above, the following theorem can be derived:
Given the conditions of and , the following property holds: If , which means , then there will be ; otherwise, if , which means , then there will be .
The theorem can be proved by a simple analysis. The memory complexity of the AER method can be decomposed into 4 parts: the complexity of computing the GMM model, the training complexity of individual classifiers, the SGD, and the validation part to get the optimal and . Each part will have the following complexities:
Fitting the GMM model. The algorithm will fit Gaussian distributions, and it will take to fit the models under diagonal covariance. The overall complexity will be . Notice that the fact is used in the derivation.
Training of individual classifiers. For the first type of re-sampled data, the number of training instances will be ; and for the second type of re-sampled data, the amount of training samples will be . Given the polynomial form time complexity , the complexity of this part will be .
Stochastic gradient descent. This part will take , where is the batch size of SGD, and is the number of iterations. is a constant and therefore can be hidden asymptotically, yielding in runtime.
Validation of the optimal and parameter. Under the option of diagonal covariance approximation, the likelihood estimation of a single data point will be . To estimate all the set, it will be . The optimal and values need to be obtained via multiple times of running, but the factor can be hidden as it will be a constant.
Summarizing the above terms, the overall complexity will be . Since we have , the first term can be hidden. And since the condition is given as and , the third term can be hidden. Finally, since we assumed a large training set and a small validation set with , the final part can be hidden, and the complexity will be .
Now for the two cases:
If one plugs in , it could derived .
If one find , there will be: . The conclusion can be drawn by taking L’hospital rule. ∎
The theorem indicates that by re-sampling the dataset, the proposed AER method can reduce time complexity when the original complexity is super-linear with respect to the amount of data , and will not be worse than the original full-batch implementation when the complexity is linear to .
Table 1 illustrates a comparison of time complexity between common Machine Learning classifiers implemented with original full-batch scheme and the AER framework. From the table, it could be observed that the higher the order of in an algorithm, the more advantages will the AER framework bring. The Gradient Boosting Machine(GBM), which is the method-of-choice in our base classifier, is also listed in the table, and denotes the number of trees in the algorithm. Notice that our implementation of the GBM is based on XGBoost, which is a parallelized GBM method and does not fall into the polynomial-time regime of our analysis. Nevertheless, the rigorous analysis of the time complexity provides a convincing advantage of the proposed AER method.
3.2 Memory Complexity
For any Machine Learning method with polynomial training memory complexity , the AER memory complexity can be denoted with . With the assumptions stated above, one could get the following theorem:
For any ,
Similar to the analysis of time complexity, the memory complexity of AER is decomposed into the 4 parts:
Fitting the GMM model. The model needs to store values under the setting of diagonal variance, thus the memory complexity will be .
Training of individual classifiers. Similar with the time complexity proof, the two types of subsets will have number of samples in and , respectively. A difference here is that for the memory complexity, one could use the same memory for every Gaussian component. Thus, the memory complexity will be .
Stochastic gradient descent. One only needs to keep slots in the memory to update weights so that the memory complexity will be .
Validation of the optimal and parameter. For each Gaussian component, the validation process will take memory, and each data point will need . The likelihood of the data will be stored, which means there should be an additional complexity. The overall complexity of this part will be .
The final complexity will be given as . And since , the complexity can be simplified to . With a simple derivation, one could get ∎
Notice that the theory on memory complexity is a stronger conclusion than its time counterpart. Firstly, it removes the restrictions on iteration times, and the memory complexity is unconditionally bounded. Secondly, theorem 2 prove a strict upper bound with little-o notation(which means asymptotically grows strictly slower), regardless of the choice of .
4 Experiments and Discussion
There are three datasets employed to test the performance of the proposed method empirically: UCI Bioassay dataset (AID 362)111Available Publicly, url: https://archive.ics.uci.edu/ml/datasets/PubChem+Bioassay+Data#, Abalone 19 dataset222Available Publicly, url: https://sci2s.ugr.es/keel/category.php?cat=imb&order=ir#sub2, and a set of artificially-generated imbalanced dataset sampled from a 10-center Gaussian Mixture Model333Available Publicly, url: https://github.com/jhwjhw0123/GMM-Generated-data-imbalance-classification. For UCI Bioassay data, the detailed figures of the performances with respect to various parameters are included; while for the other 2 sets, the paper will emphasize on the comparisons between the optimal solutions of the proposed method and existed methods. For all the three datasets, the time of executing the training, grid-search cross validation and testing processes are reported (see section 4.2.2 for details), although time for some of the methods is not available for UCI Bioassay data.
UCI Bioassay dataset was originally published in 2009 [schierz2009BioassayData], and it contains the information of the relationship between screening bioassay status and the activeness of the outcome. There are 4279 records in the dataset, of which 60 are labeled as ’active’ (minority data, labeled as 1) and others are labeled as ’inactive’ (majority data, labeled as 0) . The skew rate is about and the training and testing data are split with a rate of . The number of features of the dataset is , and there is no missing values/non-numerical values in the dataset. As a popular dataset, previous academic endeavors have tested various methods with it. There is a ’Result’ file along with the dataset for reference purpose, in which the performances of Naive Bayes, cost-sensitive SVM and cost-sensitive C4.5 (Decision Tree) are documented. In this section, the performance of our algorithm will be compared with these given methods. In addition, to demonstrate the favorable performance of the proposed ensemble paradigm over novel imbalanced data classification algorithm of other branches, a comparison between the proposed method and XGBoost with the recently-proposed focal loss [lin2018FocalLoss] is also illustrated. Notice that since the results of the conventional methods are retrieved from literature, the exact running time of these methods is not available, and we are not able to perform statistical test for this dataset as the original predictions are unknown to us.
Abalone 19 dataset was originally presented in KEEL-dataset [alcala2011keel]
as a real-life imbalanced binary classification example. There are 4174 records in the dataset, with 32 of them marked as ’positive’, which means they belongs to the minority (class 19), and 4142 items labeled as ’negative’, which indicates they are from the majority (other classes). The skew ratio is around 1:129, more significant than that in the UCI Bioassay data. The number of features is 8, with one of the columns represented as a categorical variable(sex of the abalone). Unlike UCI Bioassay, there is insufficient literature providing reliable benchmark results for Abalone 19. Thus, in the experiments, algorithms including SVM, Decision Tree (C4.5/CART), and focal-loss XGBoost are tested by the implementations with sk-learn. ALso, the execution time is included and statistical tests are performed to varify the effectiveness of the AER models.
Finally, to test the performance of the proposed method on the specific case when the data geometry truly follows a Gaussian Mixture Model distribution, a set of 8000 samples is generated through the sk-learn Make-Classification method. The skew rate is 1:79, with 7900 samples labeled as ’0’ and 100 marked as ’1’. The number of features is specified as 15, with 9 of them generated from 9-d GMM model, 3 of them obtained by combining the generated dimensions, and 2 of them filled with random noise. The number of Gaussian distributions in the corresponding GMM is 10. Notice that the ’number of Gaussian centroids’ is applied to both majority and minority data, which means the 100 positive-labelled samples are also from 10 clusters. This will increase the difficulty of classification and pose a challenge of learning a complex decision boundary while preserving a good generalization ability. As we will see in the corresponding section, the proposed method performs well for this data-set, while some of the cost-sensitive methods completely lost their ability to grasp anything meaningful.
It is noticeable that except UCI Bioassay AID 362, the other two datasets are not partitioned into training/testing parts. Furthermore, for UCI Bioassay data, the original results provided by literature did mention the validation data, which is crucial in our case to determine the value of in equation 21. Thus, in the experiments, the training data of UCI Bioassay is split with a ratio of 5:1 for both majority and minority instances. Similarly, the Abalone 19 and GMM-generated data are split into training, validation and testing data with a ratio of 3:1:1. The optimal value of is obtained via a grid search with resolution 0.025 on the validation set. One might have concerns regarding the fairness of the comparison between the performance of the proposed algorithm and of other methods. However, since there is no additional data provided for the proposed algorithm, the experiment results does not bias in favor of the proposed method and the comparisons would not be unfair against existing methods.
4.2 Evaluation Metrics
4.2.1 Performance Metrics
For an ordinary classification problem, accuracy can simply be used as the sole metric to evaluate performances. However, for label-skewed data, the algorithm often achieves a satisfying accuracy even by simply predicting every instance as the majority class. Thus, in this scenario, the spotting results of the majority and the minority data should be examined respectively. Specifically, if one regards minority data as Positive (P) and majority as Negative (N), then combining the prediction results and ground-true labels, one will get four prediction outcomes: True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). As a conventional analytical approach, precision and recall will be introduced to evaluate the quality of the classification of the majority/minority data. The computation of precision and recall metrics are given as follows:
Notice that, in this paper, the concept of ’precision’ and ’recall’ are extended to class-specified metrics, in contrast with focusing only on the positive(minority) samples in conventional statistical analysis. Thus, in our experiments, both the majority and minority recalls are reported. On the premise of sufficient recall, the TP-FP ratio can also be employed to evaluate the quality of label-skewed data classification:
where and stands for the recall of the minority and majority classes, respectively. To evaluate the overall quantities of precision and recall, score and G-Mean are introduced. The computation of score and G-Mean are shown as follows:
score and G-Mean are the commonly-used metrics in imbalanced classification problems, and noticeably, G-mean is usually a more consistent metric, thus could provide more reliable information in our experiments [kubat1997GmeanApologetics][luo2018feature].
In addition, the ’Balanced-accuracy’ is also introduced:
To sum up, the following metrics will be mainly used in this study: the Recall of both majority and minority classes, the TP-FP ratio, the score and G-mean of the minority class, and the TP-FP ratio, and the Balanced Accuracy as overall performance evaluation.
4.2.2 Execution Time
To give an illustration of the time complexity of the proposed method, the execution time of the proposed AER method on training/testing processes is reported and compared with other methods. We utilized the time.time() method in Python to capture the running time of each part of the algorithm; Thus, we are able to report the time of individual part mentioned in section 2 and grasp a better insight of the time complexity. The final execution time is calculated by summing up different parts. Notice that program running time evaluation in Python is quite inexact as the speed of program can be largely affected by non-algorithmic factors. For example, program with Python interface and C/C++-implementation could run 5+ times faster than genuine Python implementation for the same algorithm[prechelt2000empirical]. Also, if numerically invalid numbers(Nan, Inf) appear in any part of an execution, the program will considerably slow down even if it does not affect the final result, leading to potentially unfair comparisons. Nevertheless, the running time can still be viewed as a straightforward demonstration to help understand the time complexity of the proposed method.
Since the performance of machine learning algorithms can be significantly affected by hyper-parameters, grid search is often performed to obtain the best parameter setups. However, when concerning the running time, algorithms running with more tunable parameters tend to have a longer execution time, without reflecting the real time complexity of the algorithm. On the other hand, parameter searching is deemed as essential for a good performance, and a complete model training process should include this part. Thus, in this paper, both types of execution time (with and without grid search) are reported.
As is mentioned above, thanks to the flexible time-recording system in Python, one is able to capture the time of each part of the algorithm. Therefore, the run time of three levels of the AER method are illustrated: the time of training an individual base classifier, the time of training the stacking of classifiers, and the overall running time of the AER. The purpose of the first two kinds of execution time is to validate the favourable time complexity proved in section 3, as one could observe the running time is competitive against plain XGBoost even for the stacking classifiers. Notice that the diagonal approximation of covariance matrix is not used in the experiments, resulting in a relatively longer overall running time of the AER (see section 4.4 for more details).
4.2.3 Statistical Test
To further justify the performance superiority of the proposed AER method, Mcnemar’s test is applied to Abalone 19 and GMM-generated datasets. Mcnemar test is a nonparametric test commonly-used in binary classification problems[kim2003financial][pal2013kernel]
, and the idea is based on verifying ’if the two methods make mistakes on the same part of the sample’. Essentially, if the two methods are making wrong predictions on the same portion of data, there should not be a fundamental difference between them and the null hypothesis will not be rejected. The effectiveness of using Mcnemar test in binary classification tasks was comprehensively discussed in[dietterich1998approximate] and now widely accepted in the community.
The Mcnemar tests are conducted on AERs with logarithm- and exponential-likelihoods to verify their statistical significance over other methods, including Decision Tree, SVM, and plain and focal XGBoosts. The tests are implemented based on Statsmodels package in Python [seabold2010statsmodels]
, and the contingency tables are computed with Python’sNumpy package in the array form[van2011numpy]. The distribution is used in the test, and the statistics and the
-values are reported. Furthermore, to show the exact binomial distributions of the mis-classification results, statisticsare reported, which stand for the smaller value of the Yes/No and No/Yes numbers in the contingency table.
4.3 Experimental results
4.3.1 UCI Bioassay
In the experiment, the number of Gaussian centroids is chosen from a set of . After validating with the minimum BIC value following equation 4, the final number of Gaussian centroids is optimized as . This leads to the total number of base classifiers as 16, of which the first eight base classifiers are trained on the majority-dominating subsets and the rest of them are fitted with the nearly-balanced subsets. The distribution of trained weights is plotted in Figure 2 (round to 2 decimals for the convenience of plotting).
From the figure we can see that classifiers trained on majority-dominated data generally occupy larger weights (with larger values and darker colors) because these classifiers better represent the global geometry. Nevertheless, the weights from the balanced dataset also make indispensable influences on the overall prediction.
The grid search results of different values of the interpolation parameter are shown in Figures 3 and 4. The resolution of is and the system will be solely relying on the GMM if is set to and purely depending on trained coefficients if is set to . Notice that the performance on the training set is not shown in the figures because all the values in that set are around .
The performance of interpolated weights can consistently outperform those with entirely GMM likelihood or trained linear combination for virtually all types of evaluation metrics. In addition, although the optimalvalues for validation and test sets can be different, the optimal validation can lead to a satisfying performance slightly below the testing optimum. The comparison of validation-based optima and testing-based optima is given in Table 2.
|Validation optimal||Corresponding test balanced-accuracy||Optimal test balanced-accuracy|
After obtaining the optimal values of , one could examine the performance and optimize the threshold parameter in equation 21 based on training or validation data. With the selection metric stated above, for this dataset, we found the difference between the average log-likelihoods of validation and testing data is 1602.56, while the same metric between the training and testing data is 8663.5. Therefore, the validation data is picked to determine . With the interpolation parameter in Table 2, the performance with respect to the changing value of can be illustrated in Figure 5 and 6.
The optimal value based on validation data is not far from the optimum of the test data. The differences between validation-based and test-based optimal values are provided in Table 3. The overall algorithm tends to favor spotting majority instances over minority samples, as the optimal values under both settings are less than . However, given the condition that the amounts of minority instances are sparse in validation and test sets, the results are satisfying. And for the purpose of further insights of the performance of the proposed method, the statistics regarding F1 score and G-Mean are given in Figures 8 and 9 in the appendix. The figures denotes the change of F1 score and G-mean metrics for different values with Log- and Exp- likelihoods, respectively.
|Validation optimal||Test optimal|
In comparison of the proposed method, the recently-proposed focal-loss method is implemented with XGBoost (same as the individual base classifier used in our method). The results are shown in Figure 7, with the focal parameter is given as via validation. The figure demonstrates a strong performance, which verifies the effectiveness of the highly-cited work. However, for the imbalanced classification case, as one can observe from the leftmost plot, the range of leading to a satisfying performance is quite restricted, and the overall performance declines drastically when goes out of the range. In contrast, figures 5 and 6 reflect a more robust property in terms of the range of .
Finally, Table 4 and 5 illustrate the performance and execution time comparison between the classical methods(provided by the a previous paper[schierz2009BioassayData]), the (focal loss) XGBoost methods, and the proposed AER methods. From table 4, it could be observed that the proposed AER algorithm, with both log- and exp- likelihoods, achieves a competitive performance. The exponential likelihood-based adaptive ensemble method can achieve the best Balanced Accuracy, while its log-likelihood counterpart is able to score a better TP-FP ratio because of a stronger capability in spotting majority instances. Table 5 shows the run time of XGBoost and AER components mentioned before, while the pieces of execution time of the classical methods are not included for they are retrieved from [schierz2009BioassayData]. It could be found that the required running time for individual XGBoost in AER is much shorter than the ’vanilla’ version of it. Moreover, even if one sums up all the 16 individual XGBoost classifiers, the running time is still competitive against the batch-implementation counterpart. The execution time without grid search for AER overall is less appealing, partly due to the high computational complexity of fitting GMM models (diagonal approximation is not used in the experiments). However, if one takes grid search into consideration, AER is again a favourable model.
|Minority Recall||Majority Recall||TP-FP Ratio||F1 score||G-mean||Balanced Accuracy|
|Cost-sensitive Decision Tree (C4.5)||75.00%||85.16%||5.1048||0.1241||0.2253||80.08%|
|The proposed AER (Log)||75.00%||89.34%||7.0333||0.1622||0.2611||82.16%|
|The proposed AER (Exp)||83.33%||87.33%||6.5732||0.1550||0.2668||85.33%|
|Without Grid-search||With Grid-search||Test|
|AER individual classifier||ms||s||–|
|AER stacking classifiers||ms||s||–|
4.3.2 Abalone 19 Data
The base classifier implemented for Abalone 19 data is XGBoost, the same as the implementation for UCI Bioassay. The candidate list of the number of Gaussian distributions is set to , and the number is retrieved through the BIC criteria. The optimal lambda is determined as for the Log-likelihood and for the Exp-likelihood. We again pick the values lead to the optimal performance on the validation set as the ’determine threshold’. The value AER models on Abalone data is for Log-likelihood and for Exp-likelihood. The performance with respect to the changes of values can be given in Figures 10 and 11 in the appendix.
For the purpose of comparison, the classification results based on Cost-sensitive Decision Tree, Cost-sensitive SVM, plain XGBoost, and focal-loss XGBoost are reported in Table 6. For both of the cost-sensitive methods, validation selects the most competitive class weights parameter among a comprehensive list, including , , and (the real skewed rate). The Decision Tree model is implemented with SK-learn CART(very similar to C4.5), and the SVM model is fine-tuned with the best kernel among the choice between linear, RBF and Polynomial. The parameter of the focal loss is obtained via 3-fold cross-validation grid search and the final value is set to . Naive Bayes is not tested in the case as almost all the features are real number/decimals and smoothing will therefore be problematic.
|Minority Recall||Majority Recall||TP-FP Ratio||F1 score||G-mean||Balanced Accuracy|
|Cost-sensitive Decision Tree (CART)||14.29%||99.64%||–||0.1818||0.1889||56.96%|
|The proposed AER (Log)||85.71%||83.96%||5.3426||0.082||0.1924||84.83%|
|The proposed AER (Exp)||57.14%||88.90%||5.1491||0.0777||0.1543||73.02%|
From the table, it could be observed that the proposed methods (with both Log- and Exp- likelihoods) outperform existing algorithms in terms of balanced accuracy and G-mean score. AER method with exponential likelihood has a lower balanced accuracy because of a relatively lower recall on minority instances, but it is still higher than those of existing methods. XGBoost with focal loss enjoys a competitive performance, but still inferior to the proposed AER methods. Notice that the TP-FP ratio of Cost-sensitive CART is given as ’-’: since the recall of majority is , the value of minority recall is quite low (around 14); however, without acceptable minority recall, such metric cannot accurately reflect the performance and will be misleading if listed.
|Without Grid-search||With Grid-search||Test|
|AER individual classifier||ms||s||–|
|AER stacking classifiers||ms||s||–|
Another interesting perspective of the table is to see the comparison between the plain XGBoost (the base classifier used in the AER method) and the advanced methods based on it (including focal loss and AER). It could be observed that plain XGBoost method performs poorly for this specific task, with a significant bias toward minority data, but fail to spot out majority instances. Focal loss and AER could be regarded as two ’recipes’ to improve the performance, and AER is better in terms of the overall performance. However, the highly-regarded focal loss does have its own merit: the 21 majority recalls are relatively high, and the algorithm is concise.
Table 7 illustrates the running time comparison for the methods implemented. From the table it could be observed that, similar to the situation on UCI Bioassay data, XGBoost with AER framework runs with significantly less time for individual classifiers, and the overall classifier training time of AER lies in the same range of plain XGBoost. For a comparison purpose, the execution time of CART and SVM are reported. It could be observed that SVM is relatively slow, and the reason could be partly attributed to a time complexity quadratic to , and partly attributed to the implementation with an external library in SK-learn. Again, the overall running time of AER is a bit of longer because of the using of full-precision covariance matrices. However, with proper optimization in Python codes, a execution time like this is still preferable.
Finally, table 8 demonstrates the results of Mcnemar test for the significance of the performance superiority of Exp- and Log-likelihood AERs. From the table, it could be observed that the favourable performances of the AER models are corroborated by Mcnemar tests in most cases. The Null Hypothesis between AER and XGBoost with focal loss is relatively hard to reject( for both cases), confirming the strong performance of the widely-favoured method. It is interesting that the AER model with exponential likelihood failed to reject the Null Hypothesis against SVM, and we notice that such observations do not happen elsewhere in the experiments. Thus, the problem of this specific test might stem from the specific training/testing splitting pair.
4.3.3 GMM-generated Data
For GMM-generated data, similar to that for UCI Bioassay and Abalon 19, a XGBoost-based AER is provided. The model has a choice of number of Gaussian centroids between , and a -centroid setup is finally chosen. Notice that for this dataset, we actually know the number of Gaussian distributions, and the AER method correctly recovered this information. The optimal is determined as for Log-likelihood and for Exp-likelihood. Validation set is picked to determine the value of , and the performance with respect to a changing values can be shown in Figures 12 and 13 in the appendix for the Log- and Exp-likelihood methods respectively.
Again, for the purpose of comparison, existed methods, including cost-sensitive SVM, cost-sensitive decision Tree, and plain and focal-loss XGBoost, are tested on the same dataset. The results are summarized in Table 9. From the table, it could be found that only cost-sensitive Decision Tree, Focal-loss XGBoost and the AER methods can grasp useful information, while cost-sensitive SVM and plain XGBoost actually fail in learning any more-then-random classification boundary. As we have discussed above, the minority data are very hard to learn as the 100 samples come from 10 different Gaussian distributions. Nevertheless, with the AER methods, especially under the setup of exponential-likelihood, the model is able to maintain an acceptable performance (relatively high balanced accuracy and TP-FP ratio). The results in Table 9 indicates that if the data manifold follows GMM, existing algorithms would have difficulty in learning classifiers, while the proposed AER method could serve as an ideal alternative in this specific case.
|Minority Recall||Majority Recall||TP-FP Ratio||F1 score||G-mean||Balanced Accuracy|
|Cost-sensitive Decision Tree(CART)||5.00%||98.86%||4.3889||0.0513||0.0513||51.93%|
|The proposed AER(Log)||30.00%||86.58%||2.2358||0.0504||0.0909||58.29%|
|The proposed AER(Exp)||20.00%||95.06%||4.0513||0.0743||0.0988||57.53%|
Tables 10 and 11 demonstrate the execution time of different methods and the corresponding Mcnemar tests for Log- and Exp-likelihood AERs. From table 10 one can observe that SVM will take a disproportional longer time as the size of training set increases. The training time of AER classifiers appears longer than plain XGBoost, but they are still roughly stay in the same interval. CART is favourable in terms of training time, but it cannot make valuable decisions on the task. From table 11, it could be found that for the GMM-generated data, the performance significance of AER models can be verified by Mcnemar test in most cases. The only case failed to reject the Null Hypothesis is the Log-likelihood AER and Focal XGBoost, but the -value is not very large even in this case, indicating the effectiveness of the proposed AER method.
|Without Grid-search||With Grid-search||Test|
|AER individual classifier||ms||s||–|
|AER stacking classifiers||s||s||–|
In addition to the above illustrations of the experimental results on the three datasets, there are some further points worthy to be discussed:
Training Stability. For cost-sensitive method, the training process of imbalanced data classification could often be unstable, as the large weight on the minority instances will force the classifier to struggle between losing an important instance and dropping a large cluster of samples. In the experiments, one could observe that cost-sensitive methods often lead to a ’one-sided’ solution, and the training process of these methods sometimes suffer numerically (getting NaN) because of the instability. On the contrary, the proposed AER algorithm could prevent the instability because the sub-sets used in the training procedures are significantly less skewed. The property serves as another advantage of the proposed adaptive ensemble method.
Exponential and Log Likelihoods methods. Another point to discuss is that in practice, whether exponential or log- likelihood would be more preferable. As it is stated in previous sections, exponential likelihood will tend to select one dominating Gaussian centroid while log-likelihood will favor a ’soft’ combination. For most cases, the exponential likelihood-based method provides a more appealing and best performance. However, the variances of performance with different values of and are higher for exponential likelihood than the log-likelihood method. Thus, for the choice of practice, log- likelihood-based option is recommended to serve as a starting setup for general scenarios.
Execution Time. We theoretically proved the favourable time and memory complexities in section 3. However, from the tables in this section, some pieces of execution time of the proposed AER methods are longer than plain XGBoost. This theory-practice discrepancy can be explained by two factors: the logarithm complexity of XGBoost and full precision of covariance matrices. The first factor means that the time complexity of XGBoost does not fall into the regime of our proof, and the second factor indicates a complexity in computing the matrix inverse and multiplications (in contrast with the complexity when using diagonal approximation), which will cost a long time to complete. The second factor can be verified by the observations from the tables, where AER with only training of classifiers takes approximately similar time with plain XGBoost, while the overall time surges after counting GMM modeling time in.
Effectiveness of SGD-based implicit and interpolation-based regularizations. The two types of regularization are the most significant contributions in our method, and their effectiveness could be verified by inspecting the optimal values for the performance. For the UCI Bioassay data, from figures 3 and 4, it could be observed that the optimal for the validation and testing sets are neither nor , indicating that ’interpolate’ the likelihood-based weights, which could be regarded as a form of regularization, will lead to a better performance (otherwise the optimum should happen near ). Results on Abalone 19 and GMM-generated data illustrate similar results, which could further support our claim. Meanwhile, from figures 3 and 4, it could be observed that solely using SGD-learned weights(left-most point) could outperform purely likelihood-based method(right-most point), which could verify the effectiveness of the SGD-based implicit regularization.
In this paper, a novel method, namely Adaptive Ensemble of Classifiers with Regularization (AER), has been proposed for binary imbalanced data classification. Details of the method, including implementations with XGBoost, are provided and related training formulas are derived. In addition to the regularization properties, theoretical proofs illustrate that the method has favourable time and memory complexities. The performance of the proposed algorithm is tested on three datasets, and empirical evidences illustrate that the overall performance is competitive compared to the classical and the latest algorithms. In addition, the proposed method has other advantageous properties like preferable training stability, and it is novel in terms of implementing regularization for dynamic ensemble methods.
Three major contributions have been made in this paper: First, the paper has proposed an algorithm with state-of-the-art performance on binary imbalanced data. Comparing to the existing optimization methods and recent developments in the area (like focal loss), the performance of the proposed method is competitive and even better in terms of the comprehensive performance (G-mean and Balanced Accuracy). Second, the proposed method has multiple advantages other than classification performance, including a stable training process and preferable time and memory complexities. Third, the paper has investigated the regularization problem in dynamic ensemble methods, which is relatively under-developed in the previous publications. Experimental results show that regularization with Stochastic Gradient Descent and weight interpolation of the global geometry of data could improve performances and have huge potentials in the classification of binary imbalanced data.
Thanks go to the editor and reviewers for their constructive comments. Also we thank Michael Tan of University College London for his writing suggestions.
This work is supported by Research Project of State Key Laboratory of Southwest Jiaotong University (No.TPL1502), University-Enterprise Cooperation Project (17H1199, 19H0355) and Natural Science Foundation of China (NSFC 51475391).