1 Introduction
Imbalanced data classification refers to tasks of classifying datasets with significantly different numbers of instances among classes [haixiang2017learning]. Specifically, for the imbalanced binary data classification problem, there is usually a large number of instances from one class dominating the whole dataset (the majority class) and there are only a few instances constituting the other class (the minority class). The problem of imbalanced binary data classification is common in engineering and scientific practices [huda2016BrainTumorImbalancedData][ashkezari2013EnergySectorApplication][triguero2015rosefw]. Since conventional classification algorithms usually suffer from unsatisfied performances under the imbalanced data circumstance, there has been studies focusing on designing specific algorithms for such type of classification.
Among the widely used algorithms, ensemble methods, especially the dynamic ensemble of classifiers, have amassed a considerable amount of attention. The dynamic ensemble of classifiers aims to train multiple classifiers characterized by different subsets and adaptively selects or combines them during the inference process. This type of method attempts to mitigate the performance degradation by only selecting the most competent base classifiers for a specific test instance [cruz2018MulticlassDynamicEnsembleSurvey][cruz2018dynamicDynamicEnsembleSurvey]. However, given the ensemble of classifiers are determined by the local competence, the risk of overfitting is increased at the same time. In this paper, inspired by the modeling capacity of Gaussian Mixture Model (GMM) [barber2012TextMoG][reynolds1995robust] for the global and regional geometry of data and the implicit regularization property of Stochastic Gradient Descent (SGD) [gardner1984learning][zhang2016implicitRegularisationSGD][schmidhuber2015SGD][bottou2018optimization], we propose a novel algorithm of Adaptive Ensemble of classifiers with Regularization (AER) to solve the binary classification problems under the imbalanced scenarios. Here, the term ’Adaptive Ensemble’ indicates that the weight of each base classifier is adaptively chosen base on heterogeneous local geometries of data, and the phrase ’with Regularization’ refers to the two types of regularization methods we introduced on the top of adaptive ensemble.
The algorithm first performs unsupervised Gaussian Mixture Model to generate two types of subsets. The first one favors the representation of the global data manifold, and the second type emphasizes the local geometries. For each subdataset, one base classifier will be learned, and their coefficients will be learned by optimizing the crossentropy loss between the combined probabilistic outputs and the labels with Stochastic Gradient Descent (SGD). Consecutively, during the inference procedure, the normalized coefficient of each individual classifier will be determined by an interpolation between the onthefly likelihood and the trained classifier coefficient. Finally, since the output could be determined by a flexible decision threshold in binary data classification, the final model can be determined by an optimal threshold based on the validation data (rather than simply 0.5).
The proposed algorithm introduces two types of regularization on the top of standard dynamic ensemble. First, by training with SGD, the coefficients will converge to a solution with minimum norm, which is equivalent to an implicit regularization to the coefficients. Second, by interpolating the likelihoods with learned global coefficients, the global data geometry information modeled in the GMM will be utilized to normalize the likelihoods, introducing another form of regularization. By carefully choosing an interpolation parameter (often via validation), a balance between the learning capability of complex geometries and the abilities of generalization will lead to an overallsatisfying performance. Also, it could be argued that, comparing to other types of approaches, this resamplingbased dynamic ensemble will introduce a better theoretical time and memory complexity, especially when the time complexity is superlineally dependent on the number of instances (see section 3 for more details).
With a special implementation of Gradient Boosting Machine [friedman2001greedy]based individual classifier (XGBoost [Chen2016XGBoost], we refer the combined method as AERXGBoost), the proposed AER method is tested on three datasets: UCI Bioassay, KEEL Abalone19, and GMMbased artificiallygenerated data. Experimental results illustrate competitive performances of the algorithm and empirically justify the rationality of the proposed method. The overall performances of the proposed method outperform multiple standard and stateoftheart methods, including the recentlyproposed focal loss [lin2018FocalLoss]. In addition, for a specific case, when the geometries of the data follow Gaussian Mixture model, the advantage of our method is especially significant.
1.1 Related Work
Performing highaccuracy classification with imbalanced data has been a challenge for a long time, and there have been considerable numbers of academic publications discussing algorithms to address the problem. The algorithms could be roughly categorized into four types [fernandez2013analysingImbalancedMethods][krawczyk2016ImbalancedLearningChallenge]. The first branch of methods is resampling, which aims to generate balanced data via undersampling the majority class and/or oversampling the minority instances [more2016ResamlpingSurvey]. The second branch of algorithms is cost sensitive algorithms, which address the problem by using imbalancesensitive target functions, and assigning special loss to certain types of misclassifications [khan2017costSensitiveTarget]. The third cluster of algorithms is the oneclass learning method that solves the problem by learning the representation of the majority/minority data solely [bellinger2012oneClassLearning]. The final branch of algorithms is the ensemble method, which is related to the algorithm employed in this paper.
Ensemble methods usually utilize a hybrid scheme of optimizing the model on both data distribution and algorithm parameters to obtain a satisfying overall performance [wozniak2014SurveyOnHybridIntelligentSystem]. For instance, both bagging and boosting algorithms utilize a strategy to put emphasis on certain parts of the data at each iteration, and combine multiple classifiers with adaptive parameters [de2005BaggingAndBoosing]
. Meanwhile, since the types of individual classifiers in ensemble learning could be of a broad range, numerous publications have discussed the impact of different base classifiers, ranging from Logistic Regression
[galar2011LogisticRegressionEnsemble][khoshgoftaar2007RandomForestImbalancedData][wang2017novelSVMensemble]. Both bagging and boosting methods comprehensively examined classimbalanced data classification problems with static ensemble approach [galar2012ReviewEnsembleImbalancedData]. However, confusing noises with minority data could be a major source of the performance deficient for the static ensemble methods [wozniak2014SurveyOnHybridIntelligentSystem][galar2012ReviewEnsembleImbalancedData][khoshgoftaar2011ComparingBaggingAndBoosting].In contrast, dynamic ensemble methods change the ensemble according to the instance of inference. This technique enhances the flexibility of the model and reduce prediction biases. However, it increases computational complexity and the possibility of overfitting. Early work on dynamic ensemble, such as Woods (1997) [woods1997EearlyDynamicEnsemble], usually utilizes a ’rankbased’ selection combined with the Dynamic Classifier Selection (DCS) scheme, which selects one single model with the highest accuracy during the inference procedure. More sophisticated methods, in comparison, often adopt the Dynamic Ensemble Selection (DES) scheme, which selects multiple classifiers for prediction [ko2008DCSandDES]. For instance, Lin (2014) [lin2014libd3cClusterBased] has proposed a method to dynamically ensemble classifier based on clustering results; Cruz (2015) [cruz2015MetaLearningEnsemble] has designed an algorithm to combine classifiers with metalearning; and Xiao (2012) [xiao2012DynamicWithCostSensitive] has used a costsensitive criteria to determine the ensembles of multiple classifiers. In a review of dynamic ensemble especially for multiclass problems, Cruz (2018) [cruz2018MulticlassDynamicEnsembleSurvey] has argued that dynamic ensemble of classifiers could in general provide favorable results.
The major weakness of dynamic ensemble methods is that they tend to overfit and deteriorate their performance on test data [cruz2018MulticlassDynamicEnsembleSurvey][Lima2014Improving]
. In machine learning, regularization is often used to reduce overfitting. However, there is limited research on dynamic ensemble with regularization for imbalanced data classification. For dynamic ensemble methods, a major obstacle in regularization is the constraints on the weights: conventional normbased regularization reduces complexity by minimizing norms (magnitudes) of the weights; but in this scenario, the weight of each base classifier cannot be shrunk since they should be in [0,1] and the sum of all weights should be 1. Other existed regularization techniques are either not applicable to the scenario (e.g. NoiseOut
[babaeizadeh2016noiseout], which is designed solely for neural networks), or considered too ’aggressive’ in the linear combination setups (e.g. dropout
[srivastava2014dropout], which will opt out some base classifiers entirely and likely to cause errors). In addition, balancing the prediction error and restricting normalized weights posed a significant challenge in optimization: a derivationbased algebraic closedform solution will not be able to obtain, and if one treats the weights as a categorical distribution to perform optimization, the corresponding likelihood (will be Bernoulli in the binary classification regime) is not conjugate with categorical. Therefore, conventional (explicit) regularization is difficult to implement in dynamic ensemble. One of the major contributions of this paper is to solve the above dilemma with the implicit regularization capability of SGD [zhang2016implicitRegularisationSGD][lin2015iterative][Lin2016Generalization] and the facilitation of the global geometry for implicit regularization via GMM.Regarding the applications of imbalanced data classification, since this type of data exists broadly in practice, the technique has been widely applied to different areas. In bioscience and medical research, imbalanced data classification has been utilized to identify tumor [huda2016BrainTumorImbalancedData] and diagnose cancer [krawczyk2016CancerImbalanceData]. Likewise, in software engineering, such a technique has been employed to detect bugs [xia2015BugDetection] or malignant software [chen2018MalwareDetection]. In other fields, such as financial fraud detection [mardani2013FraudDetection] and power transformation [ashkezari2013EnergySectorApplication], imbalanced data classifications are also comprehensively employed. Guo (2017) [haixiang2017learning] conducts a survey for applications of imbalanced data classification, and shows the promising potential in applying such a technique to a broader range of problems.
The rest of the paper is arranged as follows: Section 2 introduces the algorithm in detail with its properties. Section 3 analyzes the advantageous time and memory complexity of the proposed algorithm. Section 4 demonstrates the experimental results on the datasets mentioned above and discussed the results and implications. Lastly, section 5 provides a general conclusion of the paper.
2 Methods
In this section, the proposed AER method will be introduced in details with four major parts: section 2.1 will introduce the Gaussian Mixture Model fitting and generation of the two types of subsets; regarding the training of individual ’base’ classifiers, section 2.2 will discuss the specific implementation with XGBoost; SGD training for the ensemble of classifiers will be illustrated in section 2.3; and finally, the weight interpolation and probabilistic prediction will be shown in section 2.4. The overall procedure of the algorithm is shown as Figure 1. In the section 3, the details of each component of the algorithm will be discussed and analyzed.
In this paper, denotes a single instance of the data in the dataset . To distinguish the majority and minority data, the authors use (key data, usually the minority) to denote the set of minority instances and (nonkey data, usually the majority) to represent the set of majority ones. The size of data set is mostly denoted by and the dimension (the number of features) is represented by in this paper, in addition,
denotes the number of components of the ensemble, which is also the number of Gaussian distributions in the Gaussian Mixture Model.
2.1 Gaussian Mixture Model Fitting and Subset Generation
Gaussian Mixture Model (GMM) is a popular model in unsupervised learning and data manifold representation. The basic idea of GMM is straightforward: it utilizes the modeling capability of Gaussian distribution, and extends it to multiple centroids to improve the expressiveness. The likelihood of a single instance in GMM can be denoted as:
(1) 
where denotes a multivariate Gaussian distribution with as mean and
as the covariance. When fitting the model, the parameters can be obtained via maximizing the loglikelihood:
(2)  
Equation 2
can be solved by expectationmaximization (EM) algorithm with a superlinear convergence rate
[xu1996EMalgorithmSuperLinearConvergence]. In our program, the package of Gaussian Mixture Model provided by SciKitlearn [pedregosa2011scikit] is directly adopted to perform the fitting procedure of GMM.It is noticeable that GMM is sensitive to initialization. To obtain stable results, the optimization will be performing 5 times for each training procedure, and the model with the highest loglikelihood will be selected. Another nonlearnable parameter in the GMM is the number of Gaussian distributions. In the proposed algorithm, this hyperparameter also determines the number of components of the final ensemble. To get the optimal number of components that optimally balance likelihood and computational complexity, Bayesian information criterion(BIC) is adopted. The BIC metric can be computed as follows:
(3) 
Where indicates the likelihood of instance . In terms of the GMM presented in our method, the BIC will be computed as follows:
(4) 
where stands for the number of parameters of the model. In the training process, a ’pool’ of possible numbers of Gaussian centroids will be given and the algorithm will compute the BIC of each model and pick the one with the least BIC quantity. We choose BIC instead of Akaike information criterion(AIC), as BIC tends to favor the model less overfitted the data [dziak2012AICBICmeasure]. Since regularization plays an important role in our algorithm, BIC is adopted for hyperparameter choice.
After obtaining the GMM with Gaussian distributions, we form subdatasets based on two types of schemes. The first type of scheme will be selecting most representative data from each of the Gaussian distribution. Specifically, for each Gaussian distribution , the algorithm will select instances with highest loglikelihood. For the second type of paradigm, the algorithm will generate subsets with majority instances selected by the highest likelihoods with respect to each Gaussian component, and concatenate it with majority instances randomly selected from the set. Both of the generated majority datasets will be combined by all of the minority instances.
After obtaining the above data, Tomek Link [tomek1976Tlink] will be used to remove instances from the first type of subset being considered as noises. Tomek Link follows the idea that if two instances are mutually nearest neighbors but belonging to different classes, they would be the ’overlapping’ instances between classes and will therefore likely to be noises. Formally, for two given data instances and given distance measure , if for any , there exists:
(5) 
Then will be considered as a pair of Tomek Link. If the corresponding labels of the Tomek Link pair belong to different classes, then we consider the majority and/or minority instance in the pair as noise and remove one or both of them. In our algorithm, since the minority will be the more important part to be spotted, majority instances in the Tomek Links will be removed.
By performing the above process, there will be available subdatasets with less significant skewness of labels. The reason for adopting a combination of two selecting schemes is that this strategy can achieve a balance between the recognition of majority and minority instances. The first type of subset is able to preserve the information of global geometry and contribute to the recognition of the majority instances, while the second type of data put emphasis on local geometry and improve the accuracy of spotting minority instances. Specifically, for the first type of subset, since the majority instances will take most of the portions, the subset will be able to preserve the global geometry (like a ‘zoom out’ version). On the other hand, for the second type of subset, since the number of majority and minority instances are almost the same, which means the choice of the majority samples are ‘highly selective’, the classifier will be able to focus on the complex boundaries near the minority samples and it could be deemed as ‘focusing on the local geometry’ (like a ‘zoom in’ version). Concatenating majority instances randomly selected is to add certain information of the global geometry to avoid overfitting.
The overall procedure of GMM fitting and subset generating is shown as algorithm 1.
2.2 Fitting of Individual Base Classifier
As it has been stated above, the specific classifier implemented in the paper is Gradient Boosting Machine (GBM), a boostingbased algorithm. The model of GBM can be expressed as follows:
(6) 
Like other boosting methods, the training strategy of GBM is to learn from ’previous mistakes’. Specifically, the individual submodel of GBM at the
th step will set the gradient of the loss function with respect to the model up to
th step as the current ’labels’, which can be expressed as:(7) 
where is used to denote any kind of loss function, and it is usually square loss for regression and crossentropy loss for classification. The gradient computed with equation 7 is also named ’pseudoresidual’; and since the gradient will be calculated at each step, the overall method is named Gradient Boosting Machine. After obtaining the current target, the parameter of the submodel at the th step can be denoted as:
(8) 
The overall model at the th step is further determine by a ’learning rate’ , which can be obtained via optimizing the following target function:
(9) 
The above optimization can be simply solved either by taking partial derivative or through a line search. By iterating the procedure from equation 7 to 9 until it matches the convergence criteria, the integrated GBM model could be obtained.
In our implementation, an integrated, highefficient, and scalable Gradient Boosting interface, namely XGBoost [Chen2016XGBoost], is employed to fit and make prediction with GBM. For each subset of data generated, the algorithm will be fitting one XGBoost model.
2.3 Stochastic Gradient Descent Training for the Ensemble of Classifiers
After the training with the individual models, each classifier will be able to give a class prediction (0 or 1) for every data instance. The next step is to train the combination of individual classifiers with SGD. For convenience, each individual model will be noted as in this subsection. Hence, we can denote the linear combination of the models as:
(10)  
The constraints of equation 10 is to guarantee the values of the predictions will be lying in , and thus it could trivially be transferred to a binaryclass prediction. To train the model, the twoclass crossentropy loss is adopted:
(11)  
where the second line of the above equation is the vectorized expression. Notice that since there is a constraint on , the above optimization cannot be accomplished by simply taking derivatives and setting it to 0. To approximate the optimal solution of the function and take regularization into consideration, SGD is adopted. Specifically, the gradient of target 11 with respect to should be:
(12)  
Notice that if gradient descent is applied, the gradient in formula 12 does not guarantee to be sum up to 1, nor does it warrant that each should be in the interval of . Nevertheless, for a gradientbased method, we can simply renormalize the weight after learning. In addition, for the weights exceeding the limit of the interval, we can rescale them to the limit value (0 or 1). Thus, the update formula of should be:
(13)  
where is the learning rate, and denotes the rescaling(mapping) function flooring at 0 and ceiling at 1. It could be mathematically denoted as:
(14) 
It is recommended that the learning rate should be set less than to ensure that the algorithm will converge. To determine whether the training procedure has been converged, the relative change in crossentropy loss is adopted as the metric.
Another concern regarding using SGD method is how to initialize the coefficients, as the optimization is sensitive to initial values. In the proposed algorithm, the initialization of parameters is accomplished by the combination of AIC and BIC. Similar to BIC, AIC can be expressed as:
(15) 
where here means the likelihood of instance . Combining equation 3 and 15, we could compute a combined metric:
(16) 
where is the parameter to make a balance between AIC and BIC, and experiments find out could be a wellperforming tradeoff. Notice that here the AIC and BIC are computed with respect to each classifier, and lower AIC/BIC values indicate a more credible solution. Thus, we can use the normalized reciprocal of values to initialize the linear combination. The initial values can be denoted as:
(17) 
The overall procedure of the optimization of the linearly combined base classifiers can be denoted as algorithm 2.
2.4 Weight Interpolation and the Probabilistic Prediction
The above three consecutive parts have discussed generating balanced subdataset and training of individual and combined classifiers. As a dynamic ensemble method, the coefficient of each individual classifier should be onthefly according to the test instance(s) when inferring. Since we have a dozen of base classifiers, each of them is trained by its corresponding data subset and has different impacts on the test data. Intuitively, by computing the ’distance’ (denoted by likelihood) between the specific test instance and the Gaussian centroid the classifier based on, the impact of the base classifier on the test data can be evaluated without knowing the test label. Following the above strategy, an interpolation scheme is adopted to adjust weights and implement dynamic ensemble according to the test data. The interpolation is based on the loglikelihood/explikelihood calculated for a specific test instance and the previous trained coefficient. This likelihood can be favorable for choosing the best local competence (the highest likelihood) for each test instance. For any test instance , the component of the likelihood will be:
(18) 
However, this makes the prediction dedicated for the local data geometry and easy to overfit. In the proposed algorithm, the global data geometry modelled in GMM prepared in the data preprocessing stage is included further to constrain the dynamic fitting. In this way, the second type of regularization is introduced and the influence of local data geometry might be reduced.
For any test instance , the component of the normalized likelihood will be:
(19) 
And the final interpolation is computed as:
(20) 
where the ’’ operation in the above equation means pairwise summation between two vectors. is the ’interpolation parameter’ and the optima can be found via the validation data through grid search. It is also noticeable that loglikelihood will naturally take multiple classifiers into consideration during classifying, while explikelihood would often generate a nearly onehot vector with respect to different Gaussian distributions. The results computed by equation 20 will satisfy the condition of summing up to 1 and lay in the interval .
Following the above procedure, the algorithm will output probabilistic value in the interval for each sample
. The output can be regarded as the probability of
, and instead of simply setting all samples greater than as 1 and opposite as , the threshold can be finetuned following the equation:(21) 
where can be regarded as a ’threshold value’ and the optima could be found via the validation data through grid search.
The overall procedure of the proposed AER with XGBoost implementation(AERXGBoost) could be shown as algorithm 3.
3 Time and Memory Complexities
In this section, we will show that the proposed AER method has favourable time and memory complexities. In particular, we will show theoretically that, under certain assumptions and for any classifier implemented with the AER framework, the time complexity will be asymptotically at least as good as the original implementation, and the asymptotic memory complexity will always be better than the fullbatch implementations.
To begin with, let us recap the notations used in the AER model. Recall that denotes the number of instances and represents the number of features. For minority and majority data, and are used respectively. The skew rate here will be denoted as , and it is straightforward to get that . The number of Gaussian centroids is given as , and there should be in most cases as it will otherwise miss the purpose of resampling (could simply train balanced subsets and include all the training set). Notice that this also implies as . The number of iterations of the GMM EM algorithm will be denoted as and the number of iterations of SGD algorithm is denoted by . The time complexity of any machine learning classifier could be denoted as a polynomial of the numbers of instances and features , where and should be positive integers. Similarly, we will denote the memory complexity with . We care mostly about the complexities of the training process, as this will usually be the part consumes most of the time and memory.
To derive a bound not depending on the GMMfitting or SGD part and help draw fair comparisons between AERimplemented and original methods, the analysis will be based on the assumption that the GMM covariance inversion and likelihood will be estimated with diagonal covariance approximation. This will remove high order terms of
and reduce the time complexity of computing GMM into , as the inversion and multiplication of the covariance can be completed within time. Also, we assume the choice of and is based on the validation set and its size is considerably smaller, with the condition , where is the number of validation data points.3.1 Time Complexity
For any Machine Learning method with polynomial training time complexity , the AER time complexity can be denoted with . Under the assumptions stated above, the following theorem can be derived:
Theorem 1.
Given the conditions of and , the following property holds: If , which means , then there will be ; otherwise, if , which means , then there will be .
Proof.
The theorem can be proved by a simple analysis. The memory complexity of the AER method can be decomposed into 4 parts: the complexity of computing the GMM model, the training complexity of individual classifiers, the SGD, and the validation part to get the optimal and . Each part will have the following complexities:

Fitting the GMM model. The algorithm will fit Gaussian distributions, and it will take to fit the models under diagonal covariance. The overall complexity will be . Notice that the fact is used in the derivation.

Training of individual classifiers. For the first type of resampled data, the number of training instances will be ; and for the second type of resampled data, the amount of training samples will be . Given the polynomial form time complexity , the complexity of this part will be .

Stochastic gradient descent. This part will take , where is the batch size of SGD, and is the number of iterations. is a constant and therefore can be hidden asymptotically, yielding in runtime.

Validation of the optimal and parameter. Under the option of diagonal covariance approximation, the likelihood estimation of a single data point will be . To estimate all the set, it will be . The optimal and values need to be obtained via multiple times of running, but the factor can be hidden as it will be a constant.
Summarizing the above terms, the overall complexity will be . Since we have , the first term can be hidden. And since the condition is given as and , the third term can be hidden. Finally, since we assumed a large training set and a small validation set with , the final part can be hidden, and the complexity will be .
Now for the two cases:
If one plugs in , it could derived .
If one find , there will be: . The conclusion can be drawn by taking L’hospital rule.
∎
The theorem indicates that by resampling the dataset, the proposed AER method can reduce time complexity when the original complexity is superlinear with respect to the amount of data , and will not be worse than the original fullbatch implementation when the complexity is linear to .
Table 1 illustrates a comparison of time complexity between common Machine Learning classifiers implemented with original fullbatch scheme and the AER framework. From the table, it could be observed that the higher the order of in an algorithm, the more advantages will the AER framework bring. The Gradient Boosting Machine(GBM), which is the methodofchoice in our base classifier, is also listed in the table, and denotes the number of trees in the algorithm. Notice that our implementation of the GBM is based on XGBoost, which is a parallelized GBM method and does not fall into the polynomialtime regime of our analysis. Nevertheless, the rigorous analysis of the time complexity provides a convincing advantage of the proposed AER method.
original scheme  AER  

Naive Bayes  
Decision Tree  
SVM  
Gradient Boosting 
3.2 Memory Complexity
For any Machine Learning method with polynomial training memory complexity , the AER memory complexity can be denoted with . With the assumptions stated above, one could get the following theorem:
Theorem 2.
For any ,
Proof.
Similar to the analysis of time complexity, the memory complexity of AER is decomposed into the 4 parts:

Fitting the GMM model. The model needs to store values under the setting of diagonal variance, thus the memory complexity will be .

Training of individual classifiers. Similar with the time complexity proof, the two types of subsets will have number of samples in and , respectively. A difference here is that for the memory complexity, one could use the same memory for every Gaussian component. Thus, the memory complexity will be .

Stochastic gradient descent. One only needs to keep slots in the memory to update weights so that the memory complexity will be .

Validation of the optimal and parameter. For each Gaussian component, the validation process will take memory, and each data point will need . The likelihood of the data will be stored, which means there should be an additional complexity. The overall complexity of this part will be .
The final complexity will be given as . And since , the complexity can be simplified to . With a simple derivation, one could get ∎
Notice that the theory on memory complexity is a stronger conclusion than its time counterpart. Firstly, it removes the restrictions on iteration times, and the memory complexity is unconditionally bounded. Secondly, theorem 2 prove a strict upper bound with littleo notation(which means asymptotically grows strictly slower), regardless of the choice of .
4 Experiments and Discussion
4.1 Datasets
There are three datasets employed to test the performance of the proposed method empirically: UCI Bioassay dataset (AID 362)^{1}^{1}1Available Publicly, url: https://archive.ics.uci.edu/ml/datasets/PubChem+Bioassay+Data#, Abalone 19 dataset^{2}^{2}2Available Publicly, url: https://sci2s.ugr.es/keel/category.php?cat=imb&order=ir#sub2, and a set of artificiallygenerated imbalanced dataset sampled from a 10center Gaussian Mixture Model^{3}^{3}3Available Publicly, url: https://github.com/jhwjhw0123/GMMGenerateddataimbalanceclassification. For UCI Bioassay data, the detailed figures of the performances with respect to various parameters are included; while for the other 2 sets, the paper will emphasize on the comparisons between the optimal solutions of the proposed method and existed methods. For all the three datasets, the time of executing the training, gridsearch cross validation and testing processes are reported (see section 4.2.2 for details), although time for some of the methods is not available for UCI Bioassay data.
UCI Bioassay dataset was originally published in 2009 [schierz2009BioassayData], and it contains the information of the relationship between screening bioassay status and the activeness of the outcome. There are 4279 records in the dataset, of which 60 are labeled as ’active’ (minority data, labeled as 1) and others are labeled as ’inactive’ (majority data, labeled as 0) . The skew rate is about and the training and testing data are split with a rate of . The number of features of the dataset is , and there is no missing values/nonnumerical values in the dataset. As a popular dataset, previous academic endeavors have tested various methods with it. There is a ’Result’ file along with the dataset for reference purpose, in which the performances of Naive Bayes, costsensitive SVM and costsensitive C4.5 (Decision Tree) are documented. In this section, the performance of our algorithm will be compared with these given methods. In addition, to demonstrate the favorable performance of the proposed ensemble paradigm over novel imbalanced data classification algorithm of other branches, a comparison between the proposed method and XGBoost with the recentlyproposed focal loss [lin2018FocalLoss] is also illustrated. Notice that since the results of the conventional methods are retrieved from literature, the exact running time of these methods is not available, and we are not able to perform statistical test for this dataset as the original predictions are unknown to us.
Abalone 19 dataset was originally presented in KEELdataset [alcala2011keel]
as a reallife imbalanced binary classification example. There are 4174 records in the dataset, with 32 of them marked as ’positive’, which means they belongs to the minority (class 19), and 4142 items labeled as ’negative’, which indicates they are from the majority (other classes). The skew ratio is around 1:129, more significant than that in the UCI Bioassay data. The number of features is 8, with one of the columns represented as a categorical variable(sex of the abalone). Unlike UCI Bioassay, there is insufficient literature providing reliable benchmark results for Abalone 19. Thus, in the experiments, algorithms including SVM, Decision Tree (C4.5/CART), and focalloss XGBoost are tested by the implementations with sklearn. ALso, the execution time is included and statistical tests are performed to varify the effectiveness of the AER models.
Finally, to test the performance of the proposed method on the specific case when the data geometry truly follows a Gaussian Mixture Model distribution, a set of 8000 samples is generated through the sklearn MakeClassification method. The skew rate is 1:79, with 7900 samples labeled as ’0’ and 100 marked as ’1’. The number of features is specified as 15, with 9 of them generated from 9d GMM model, 3 of them obtained by combining the generated dimensions, and 2 of them filled with random noise. The number of Gaussian distributions in the corresponding GMM is 10. Notice that the ’number of Gaussian centroids’ is applied to both majority and minority data, which means the 100 positivelabelled samples are also from 10 clusters. This will increase the difficulty of classification and pose a challenge of learning a complex decision boundary while preserving a good generalization ability. As we will see in the corresponding section, the proposed method performs well for this dataset, while some of the costsensitive methods completely lost their ability to grasp anything meaningful.
It is noticeable that except UCI Bioassay AID 362, the other two datasets are not partitioned into training/testing parts. Furthermore, for UCI Bioassay data, the original results provided by literature did mention the validation data, which is crucial in our case to determine the value of in equation 21. Thus, in the experiments, the training data of UCI Bioassay is split with a ratio of 5:1 for both majority and minority instances. Similarly, the Abalone 19 and GMMgenerated data are split into training, validation and testing data with a ratio of 3:1:1. The optimal value of is obtained via a grid search with resolution 0.025 on the validation set. One might have concerns regarding the fairness of the comparison between the performance of the proposed algorithm and of other methods. However, since there is no additional data provided for the proposed algorithm, the experiment results does not bias in favor of the proposed method and the comparisons would not be unfair against existing methods.
4.2 Evaluation Metrics
4.2.1 Performance Metrics
For an ordinary classification problem, accuracy can simply be used as the sole metric to evaluate performances. However, for labelskewed data, the algorithm often achieves a satisfying accuracy even by simply predicting every instance as the majority class. Thus, in this scenario, the spotting results of the majority and the minority data should be examined respectively. Specifically, if one regards minority data as Positive (P) and majority as Negative (N), then combining the prediction results and groundtrue labels, one will get four prediction outcomes: True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). As a conventional analytical approach, precision and recall will be introduced to evaluate the quality of the classification of the majority/minority data. The computation of precision and recall metrics are given as follows:
precision  (22)  
recall 
Notice that, in this paper, the concept of ’precision’ and ’recall’ are extended to classspecified metrics, in contrast with focusing only on the positive(minority) samples in conventional statistical analysis. Thus, in our experiments, both the majority and minority recalls are reported. On the premise of sufficient recall, the TPFP ratio can also be employed to evaluate the quality of labelskewed data classification:
(23) 
where and stands for the recall of the minority and majority classes, respectively. To evaluate the overall quantities of precision and recall, score and GMean are introduced. The computation of score and GMean are shown as follows:
(24)  
score and GMean are the commonlyused metrics in imbalanced classification problems, and noticeably, Gmean is usually a more consistent metric, thus could provide more reliable information in our experiments [kubat1997GmeanApologetics][luo2018feature].
In addition, the ’Balancedaccuracy’ is also introduced:
(25) 
To sum up, the following metrics will be mainly used in this study: the Recall of both majority and minority classes, the TPFP ratio, the score and Gmean of the minority class, and the TPFP ratio, and the Balanced Accuracy as overall performance evaluation.
4.2.2 Execution Time
To give an illustration of the time complexity of the proposed method, the execution time of the proposed AER method on training/testing processes is reported and compared with other methods. We utilized the time.time() method in Python to capture the running time of each part of the algorithm; Thus, we are able to report the time of individual part mentioned in section 2 and grasp a better insight of the time complexity. The final execution time is calculated by summing up different parts. Notice that program running time evaluation in Python is quite inexact as the speed of program can be largely affected by nonalgorithmic factors. For example, program with Python interface and C/C++implementation could run 5+ times faster than genuine Python implementation for the same algorithm[prechelt2000empirical]. Also, if numerically invalid numbers(Nan, Inf) appear in any part of an execution, the program will considerably slow down even if it does not affect the final result, leading to potentially unfair comparisons. Nevertheless, the running time can still be viewed as a straightforward demonstration to help understand the time complexity of the proposed method.
Since the performance of machine learning algorithms can be significantly affected by hyperparameters, grid search is often performed to obtain the best parameter setups. However, when concerning the running time, algorithms running with more tunable parameters tend to have a longer execution time, without reflecting the real time complexity of the algorithm. On the other hand, parameter searching is deemed as essential for a good performance, and a complete model training process should include this part. Thus, in this paper, both types of execution time (with and without grid search) are reported.
As is mentioned above, thanks to the flexible timerecording system in Python, one is able to capture the time of each part of the algorithm. Therefore, the run time of three levels of the AER method are illustrated: the time of training an individual base classifier, the time of training the stacking of classifiers, and the overall running time of the AER. The purpose of the first two kinds of execution time is to validate the favourable time complexity proved in section 3, as one could observe the running time is competitive against plain XGBoost even for the stacking classifiers. Notice that the diagonal approximation of covariance matrix is not used in the experiments, resulting in a relatively longer overall running time of the AER (see section 4.4 for more details).
4.2.3 Statistical Test
To further justify the performance superiority of the proposed AER method, Mcnemar’s test is applied to Abalone 19 and GMMgenerated datasets. Mcnemar test is a nonparametric test commonlyused in binary classification problems[kim2003financial][pal2013kernel]
, and the idea is based on verifying ’if the two methods make mistakes on the same part of the sample’. Essentially, if the two methods are making wrong predictions on the same portion of data, there should not be a fundamental difference between them and the null hypothesis will not be rejected. The effectiveness of using Mcnemar test in binary classification tasks was comprehensively discussed in
[dietterich1998approximate] and now widely accepted in the community.The Mcnemar tests are conducted on AERs with logarithm and exponentiallikelihoods to verify their statistical significance over other methods, including Decision Tree, SVM, and plain and focal XGBoosts. The tests are implemented based on Statsmodels package in Python [seabold2010statsmodels]
, and the contingency tables are computed with Python’s
Numpy package in the array form[van2011numpy]. The distribution is used in the test, and the statistics and thevalues are reported. Furthermore, to show the exact binomial distributions of the misclassification results, statistics
are reported, which stand for the smaller value of the Yes/No and No/Yes numbers in the contingency table.4.3 Experimental results
4.3.1 UCI Bioassay
In the experiment, the number of Gaussian centroids is chosen from a set of . After validating with the minimum BIC value following equation 4, the final number of Gaussian centroids is optimized as . This leads to the total number of base classifiers as 16, of which the first eight base classifiers are trained on the majoritydominating subsets and the rest of them are fitted with the nearlybalanced subsets. The distribution of trained weights is plotted in Figure 2 (round to 2 decimals for the convenience of plotting).
From the figure we can see that classifiers trained on majoritydominated data generally occupy larger weights (with larger values and darker colors) because these classifiers better represent the global geometry. Nevertheless, the weights from the balanced dataset also make indispensable influences on the overall prediction.
The grid search results of different values of the interpolation parameter are shown in Figures 3 and 4. The resolution of is and the system will be solely relying on the GMM if is set to and purely depending on trained coefficients if is set to . Notice that the performance on the training set is not shown in the figures because all the values in that set are around .
The performance of interpolated weights can consistently outperform those with entirely GMM likelihood or trained linear combination for virtually all types of evaluation metrics. In addition, although the optimal
values for validation and test sets can be different, the optimal validation can lead to a satisfying performance slightly below the testing optimum. The comparison of validationbased optima and testingbased optima is given in Table 2.Validation optimal  Corresponding test balancedaccuracy  Optimal test balancedaccuracy  

Loglikelihood  0.25  0.8217  0.8294 
Explikelihood  0.15  0.8669  0.8698 
After obtaining the optimal values of , one could examine the performance and optimize the threshold parameter in equation 21 based on training or validation data. With the selection metric stated above, for this dataset, we found the difference between the average loglikelihoods of validation and testing data is 1602.56, while the same metric between the training and testing data is 8663.5. Therefore, the validation data is picked to determine . With the interpolation parameter in Table 2, the performance with respect to the changing value of can be illustrated in Figure 5 and 6.
The optimal value based on validation data is not far from the optimum of the test data. The differences between validationbased and testbased optimal values are provided in Table 3. The overall algorithm tends to favor spotting majority instances over minority samples, as the optimal values under both settings are less than . However, given the condition that the amounts of minority instances are sparse in validation and test sets, the results are satisfying. And for the purpose of further insights of the performance of the proposed method, the statistics regarding F1 score and GMean are given in Figures 8 and 9 in the appendix. The figures denotes the change of F1 score and Gmean metrics for different values with Log and Exp likelihoods, respectively.
Validation optimal  Test optimal  

Loglikelihood  0.35  0.35 
Explikelihood  0.30  0.325 
In comparison of the proposed method, the recentlyproposed focalloss method is implemented with XGBoost (same as the individual base classifier used in our method). The results are shown in Figure 7, with the focal parameter is given as via validation. The figure demonstrates a strong performance, which verifies the effectiveness of the highlycited work. However, for the imbalanced classification case, as one can observe from the leftmost plot, the range of leading to a satisfying performance is quite restricted, and the overall performance declines drastically when goes out of the range. In contrast, figures 5 and 6 reflect a more robust property in terms of the range of .
Finally, Table 4 and 5 illustrate the performance and execution time comparison between the classical methods(provided by the a previous paper[schierz2009BioassayData]), the (focal loss) XGBoost methods, and the proposed AER methods. From table 4, it could be observed that the proposed AER algorithm, with both log and exp likelihoods, achieves a competitive performance. The exponential likelihoodbased adaptive ensemble method can achieve the best Balanced Accuracy, while its loglikelihood counterpart is able to score a better TPFP ratio because of a stronger capability in spotting majority instances. Table 5 shows the run time of XGBoost and AER components mentioned before, while the pieces of execution time of the classical methods are not included for they are retrieved from [schierz2009BioassayData]. It could be found that the required running time for individual XGBoost in AER is much shorter than the ’vanilla’ version of it. Moreover, even if one sums up all the 16 individual XGBoost classifiers, the running time is still competitive against the batchimplementation counterpart. The execution time without grid search for AER overall is less appealing, partly due to the high computational complexity of fitting GMM models (diagonal approximation is not used in the experiments). However, if one takes grid search into consideration, AER is again a favourable model.
Minority Recall  Majority Recall  TPFP Ratio  F1 score  Gmean  Balanced Accuracy  
Naive Bayes  75.00%  80.92%  3.9317  0.0989  0.1993  77.96% 
Costsensitive SVM  75.00%  84.93%  5.0238  0.1224  0.2236  79.97% 
Costsensitive Decision Tree (C4.5)  75.00%  85.16%  5.1048  0.1241  0.2253  80.08% 
Plain XGBoost  83.33%  0.47%  0.8373  0.0232  0.0990  41.90% 
Focalloss XGBoost  83.33%  78.19%  3.8225  0.0971  0.2073  80.77% 
The proposed AER (Log)  75.00%  89.34%  7.0333  0.1622  0.2611  82.16% 
The proposed AER (Exp)  83.33%  87.33%  6.5732  0.1550  0.2668  85.33% 
Without Gridsearch  With Gridsearch  Test  

XGBoost  ms  s  ms 
AER individual classifier  ms  s  – 
AER stacking classifiers  ms  s  – 
AER overall  s  s  ms 
4.3.2 Abalone 19 Data
The base classifier implemented for Abalone 19 data is XGBoost, the same as the implementation for UCI Bioassay. The candidate list of the number of Gaussian distributions is set to , and the number is retrieved through the BIC criteria. The optimal lambda is determined as for the Loglikelihood and for the Explikelihood. We again pick the values lead to the optimal performance on the validation set as the ’determine threshold’. The value AER models on Abalone data is for Loglikelihood and for Explikelihood. The performance with respect to the changes of values can be given in Figures 10 and 11 in the appendix.
For the purpose of comparison, the classification results based on Costsensitive Decision Tree, Costsensitive SVM, plain XGBoost, and focalloss XGBoost are reported in Table 6. For both of the costsensitive methods, validation selects the most competitive class weights parameter among a comprehensive list, including , , and (the real skewed rate). The Decision Tree model is implemented with SKlearn CART(very similar to C4.5), and the SVM model is finetuned with the best kernel among the choice between linear, RBF and Polynomial. The parameter of the focal loss is obtained via 3fold crossvalidation grid search and the final value is set to . Naive Bayes is not tested in the case as almost all the features are real number/decimals and smoothing will therefore be problematic.
Minority Recall  Majority Recall  TPFP Ratio  F1 score  Gmean  Balanced Accuracy  
Costsensitive SVM  28.57%  88.66%  2.5198  0.0388  0.0772  58.62% 
Costsensitive Decision Tree (CART)  14.29%  99.64%  –  0.1818  0.1889  56.96% 
Plain XGBoost  100%  0.12%  1.0012  0.0166  0.0916  50.06% 
Focalloss XGBoost  42.86%  85.52%  2.9607  0.0462  0.1022  64.19% 
The proposed AER (Log)  85.71%  83.96%  5.3426  0.082  0.1924  84.83% 
The proposed AER (Exp)  57.14%  88.90%  5.1491  0.0777  0.1543  73.02% 
From the table, it could be observed that the proposed methods (with both Log and Exp likelihoods) outperform existing algorithms in terms of balanced accuracy and Gmean score. AER method with exponential likelihood has a lower balanced accuracy because of a relatively lower recall on minority instances, but it is still higher than those of existing methods. XGBoost with focal loss enjoys a competitive performance, but still inferior to the proposed AER methods. Notice that the TPFP ratio of Costsensitive CART is given as ’’: since the recall of majority is , the value of minority recall is quite low (around 14); however, without acceptable minority recall, such metric cannot accurately reflect the performance and will be misleading if listed.
Without Gridsearch  With Gridsearch  Test  

CART  ms  ms  ms 
SVM  ms  s  ms 
XGBoost  ms  s  ms 
AER individual classifier  ms  s  – 
AER stacking classifiers  ms  s  – 
AER overall  s  s  ms 
Another interesting perspective of the table is to see the comparison between the plain XGBoost (the base classifier used in the AER method) and the advanced methods based on it (including focal loss and AER). It could be observed that plain XGBoost method performs poorly for this specific task, with a significant bias toward minority data, but fail to spot out majority instances. Focal loss and AER could be regarded as two ’recipes’ to improve the performance, and AER is better in terms of the overall performance. However, the highlyregarded focal loss does have its own merit: the 21 majority recalls are relatively high, and the algorithm is concise.
Table 7 illustrates the running time comparison for the methods implemented. From the table it could be observed that, similar to the situation on UCI Bioassay data, XGBoost with AER framework runs with significantly less time for individual classifiers, and the overall classifier training time of AER lies in the same range of plain XGBoost. For a comparison purpose, the execution time of CART and SVM are reported. It could be observed that SVM is relatively slow, and the reason could be partly attributed to a time complexity quadratic to , and partly attributed to the implementation with an external library in SKlearn. Again, the overall running time of AER is a bit of longer because of the using of fullprecision covariance matrices. However, with proper optimization in Python codes, a execution time like this is still preferable.
Finally, table 8 demonstrates the results of Mcnemar test for the significance of the performance superiority of Exp and Loglikelihood AERs. From the table, it could be observed that the favourable performances of the AER models are corroborated by Mcnemar tests in most cases. The Null Hypothesis between AER and XGBoost with focal loss is relatively hard to reject( for both cases), confirming the strong performance of the widelyfavoured method. It is interesting that the AER model with exponential likelihood failed to reject the Null Hypothesis against SVM, and we notice that such observations do not happen elsewhere in the experiments. Thus, the problem of this specific test might stem from the specific training/testing splitting pair.
4.3.3 GMMgenerated Data
For GMMgenerated data, similar to that for UCI Bioassay and Abalon 19, a XGBoostbased AER is provided. The model has a choice of number of Gaussian centroids between , and a centroid setup is finally chosen. Notice that for this dataset, we actually know the number of Gaussian distributions, and the AER method correctly recovered this information. The optimal is determined as for Loglikelihood and for Explikelihood. Validation set is picked to determine the value of , and the performance with respect to a changing values can be shown in Figures 12 and 13 in the appendix for the Log and Explikelihood methods respectively.
Again, for the purpose of comparison, existed methods, including costsensitive SVM, costsensitive decision Tree, and plain and focalloss XGBoost, are tested on the same dataset. The results are summarized in Table 9. From the table, it could be found that only costsensitive Decision Tree, Focalloss XGBoost and the AER methods can grasp useful information, while costsensitive SVM and plain XGBoost actually fail in learning any morethenrandom classification boundary. As we have discussed above, the minority data are very hard to learn as the 100 samples come from 10 different Gaussian distributions. Nevertheless, with the AER methods, especially under the setup of exponentiallikelihood, the model is able to maintain an acceptable performance (relatively high balanced accuracy and TPFP ratio). The results in Table 9 indicates that if the data manifold follows GMM, existing algorithms would have difficulty in learning classifiers, while the proposed AER method could serve as an ideal alternative in this specific case.
Minority Recall  Majority Recall  TPFP Ratio  F1 score  Gmean  Balanced Accuracy  
Costsensitive SVM  0%  100%  –  –  –  50.00% 
Costsensitive Decision Tree(CART)  5.00%  98.86%  4.3889  0.0513  0.0513  51.93% 
Plain XGBoost  100%  0.063%  1.0006  –  –  50.03% 
Focalloss XGBoost  25.00%  84.49%  1.6122  0.0370  0.0707  54.74% 
The proposed AER(Log)  30.00%  86.58%  2.2358  0.0504  0.0909  58.29% 
The proposed AER(Exp)  20.00%  95.06%  4.0513  0.0743  0.0988  57.53% 
Tables 10 and 11 demonstrate the execution time of different methods and the corresponding Mcnemar tests for Log and Explikelihood AERs. From table 10 one can observe that SVM will take a disproportional longer time as the size of training set increases. The training time of AER classifiers appears longer than plain XGBoost, but they are still roughly stay in the same interval. CART is favourable in terms of training time, but it cannot make valuable decisions on the task. From table 11, it could be found that for the GMMgenerated data, the performance significance of AER models can be verified by Mcnemar test in most cases. The only case failed to reject the Null Hypothesis is the Loglikelihood AER and Focal XGBoost, but the value is not very large even in this case, indicating the effectiveness of the proposed AER method.
Without Gridsearch  With Gridsearch  Test  

CART  ms  s  ms 
SVM  s  s  ms 
XGBoost  ms  s  ms 
AER individual classifier  ms  s  – 
AER stacking classifiers  s  s  – 
AER overall  s  s  ms 
4.4 Discussion
In addition to the above illustrations of the experimental results on the three datasets, there are some further points worthy to be discussed:

Training Stability. For costsensitive method, the training process of imbalanced data classification could often be unstable, as the large weight on the minority instances will force the classifier to struggle between losing an important instance and dropping a large cluster of samples. In the experiments, one could observe that costsensitive methods often lead to a ’onesided’ solution, and the training process of these methods sometimes suffer numerically (getting NaN) because of the instability. On the contrary, the proposed AER algorithm could prevent the instability because the subsets used in the training procedures are significantly less skewed. The property serves as another advantage of the proposed adaptive ensemble method.

Exponential and Log Likelihoods methods. Another point to discuss is that in practice, whether exponential or log likelihood would be more preferable. As it is stated in previous sections, exponential likelihood will tend to select one dominating Gaussian centroid while loglikelihood will favor a ’soft’ combination. For most cases, the exponential likelihoodbased method provides a more appealing and best performance. However, the variances of performance with different values of and are higher for exponential likelihood than the loglikelihood method. Thus, for the choice of practice, log likelihoodbased option is recommended to serve as a starting setup for general scenarios.

Execution Time. We theoretically proved the favourable time and memory complexities in section 3. However, from the tables in this section, some pieces of execution time of the proposed AER methods are longer than plain XGBoost. This theorypractice discrepancy can be explained by two factors: the logarithm complexity of XGBoost and full precision of covariance matrices. The first factor means that the time complexity of XGBoost does not fall into the regime of our proof, and the second factor indicates a complexity in computing the matrix inverse and multiplications (in contrast with the complexity when using diagonal approximation), which will cost a long time to complete. The second factor can be verified by the observations from the tables, where AER with only training of classifiers takes approximately similar time with plain XGBoost, while the overall time surges after counting GMM modeling time in.

Effectiveness of SGDbased implicit and interpolationbased regularizations. The two types of regularization are the most significant contributions in our method, and their effectiveness could be verified by inspecting the optimal values for the performance. For the UCI Bioassay data, from figures 3 and 4, it could be observed that the optimal for the validation and testing sets are neither nor , indicating that ’interpolate’ the likelihoodbased weights, which could be regarded as a form of regularization, will lead to a better performance (otherwise the optimum should happen near ). Results on Abalone 19 and GMMgenerated data illustrate similar results, which could further support our claim. Meanwhile, from figures 3 and 4, it could be observed that solely using SGDlearned weights(leftmost point) could outperform purely likelihoodbased method(rightmost point), which could verify the effectiveness of the SGDbased implicit regularization.
5 Conclusions
In this paper, a novel method, namely Adaptive Ensemble of Classifiers with Regularization (AER), has been proposed for binary imbalanced data classification. Details of the method, including implementations with XGBoost, are provided and related training formulas are derived. In addition to the regularization properties, theoretical proofs illustrate that the method has favourable time and memory complexities. The performance of the proposed algorithm is tested on three datasets, and empirical evidences illustrate that the overall performance is competitive compared to the classical and the latest algorithms. In addition, the proposed method has other advantageous properties like preferable training stability, and it is novel in terms of implementing regularization for dynamic ensemble methods.
Three major contributions have been made in this paper: First, the paper has proposed an algorithm with stateoftheart performance on binary imbalanced data. Comparing to the existing optimization methods and recent developments in the area (like focal loss), the performance of the proposed method is competitive and even better in terms of the comprehensive performance (Gmean and Balanced Accuracy). Second, the proposed method has multiple advantages other than classification performance, including a stable training process and preferable time and memory complexities. Third, the paper has investigated the regularization problem in dynamic ensemble methods, which is relatively underdeveloped in the previous publications. Experimental results show that regularization with Stochastic Gradient Descent and weight interpolation of the global geometry of data could improve performances and have huge potentials in the classification of binary imbalanced data.
Acknowledgement
Thanks go to the editor and reviewers for their constructive comments. Also we thank Michael Tan of University College London for his writing suggestions.
Funding Statement
This work is supported by Research Project of State Key Laboratory of Southwest Jiaotong University (No.TPL1502), UniversityEnterprise Cooperation Project (17H1199, 19H0355) and Natural Science Foundation of China (NSFC 51475391).