The SAMME.C2 algorithm for severely imbalanced multi-class classification

Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often considered the more interesting class yet developing a scientific learning algorithm suitable for the observations presents countless challenges. In this article, we suggest a novel multi-class classification algorithm specialized to handle severely imbalanced classes based on the method we refer to as SAMME.C2. It blends the flexible mechanics of the boosting techniques from SAMME algorithm, a multi-class classifier, and Ada.C2 algorithm, a cost-sensitive binary classifier designed to address highly class imbalances. Not only do we provide the resulting algorithm but we also establish scientific and statistical formulation of our proposed SAMME.C2 algorithm. Through numerical experiments examining various degrees of classifier difficulty, we demonstrate consistent superior performance of our proposed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

01/01/2015

Consistent Classification Algorithms for Multi-class Non-Decomposable Performance Metrics

We study consistency of learning algorithms for a multi-class performanc...
04/07/2020

Combined Cleaning and Resampling Algorithm for Multi-Class Imbalanced Data with Label Noise

The imbalanced data classification is one of the most crucial tasks faci...
03/13/2014

Box Drawings for Learning with Imbalanced Data

The vast majority of real world classification problems are imbalanced, ...
02/06/2019

A Bayesian Approach for Accurate Classification-Based Aggregates

In this paper, we study the accuracy of values aggregated over classes p...
02/04/2017

Latent Hinge-Minimax Risk Minimization for Inference from a Small Number of Training Samples

Deep Learning (DL) methods show very good performance when trained on la...
06/09/2011

SMOTE: Synthetic Minority Over-sampling Technique

An approach to the construction of classifiers from imbalanced datasets ...
02/10/2018

Distributed One-class Learning

We propose a cloud-based filter trained to block third parties from uplo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In machine learning, classification involves the accurate assignment of a target class or label to input observations. When there are only two labels, it is called binary classification; when there are more than two labels, it is often referred to as multi-class classification. Classification algorithms may generally fall into three categories (see

hastie2009):

  • Linear classifiers: This type of algorithm separates the input observations using a line or hyperplane based on a linear combination of the input features. Examples in this algorithm include logistic regression, probit regression, and linear discriminant analysis.

  • Nonlinear classifiers: This type of algorithm separates the input observations based on nonlinear functions. Examples include decision trees,

    -nearest neighbors (KNN), support vector machines (SVM), and neural networks.

  • Ensemble methods: This type of algorithm combines the predictions produced from multiple models. Examples includes random forests, stochastic gradient boosting (e.g., XGBoost), and adaptive boosting (AdaBoost).

In imbalanced classification, the distribution of observations across classes is biased or skewed. In this case, minority classes have much fewer observations to learn from than those from majority classes. In spite of the sparsity, the minority class is often considered the more interesting class yet developing a scientific learning algorithm suitable for the observations presents countless challenges. Several research have been conducted dealing with imbalanced data, yet mostly in the context of binary problems. The commonplace and more direct approach is to use algorithms listed above and handle the class imbalance at the data level. In this case, the class distribution of the input observations is rebalanced by oversampling (or undersampling) from the underrepresented (or overrepresented) classes. One popular approach is the oversampling of underrepresented classes based on the

SMOTE (Synthetic Minority Oversampling Technique), a technique developed by chawla2002smote. It is worth noting that generating synthetic observations to rebalance class distributions, especially with multi-class classification, has the disadvantage of increasing the overlapping classes with unnecessary additional noise.

One popular class of algorithms that is believed to be one of the most powerful techniques is boosting, which is based on training a sequence of weak models into a strong learner in order to improve predictive power. A specific boosting technique primarily developed for classification is AdaBoost, a class of so-called adaptive boosting algorithms. AdaBoost.M1, which combines several weak classifiers to produce a strong classifier, is the first practical boosting algorithm introduced by freund1997decision. AdaBoost.M1 is an iterative process that starts with a distribution of equal observation weights. At each iteration, the process fits one weak classifier and subsequently adjusts the observation weights based on the idea that more weights are given to input observations that have been misclassified, allowing for increased learning. See Algorithm 1 in Appendix A.

AdaBoost.M1 has been extended to handle multi-class classification problems. One such extension is the so-called AdaBoost.M2, developed by freund1997decision that is based on the optimization of a pseudo-loss function suitable for handling multi-class problems. Another extension is the AdaBoost.MH developed by schapire1999improv that is based on the optimization of the Hamming loss function. Both these extensions solve multi-class classification problems by reducing them into several different binary problems; Such procedures can be slow and inefficient. A more popular multi-class AdaBoost extension is the algorithm called SAMME (Stagewise Additive Modeling using a Multi-class Exponential Loss Function) proposed by zhu2009mclass, which avoids computational inefficiencies without the multiple binary problems. See so2021cost for details of the iteration process of this algorithm. According to friedman2000addlog and hastie2009, the SAMME algorithm is equivalent to an additive model with a minimization of a multi-class exponential loss function and belongs to the traditional statistical family of forward stagewise additive models. Additional variations to these AdaBoost algorithms have appeared in ferreira2012review; a recent work of tanha2020boosting provides a comprehensive survey.

In order to further improve prediction within an imbalanced classification, cost-sensitive learning algorithms provides for a necessary additional layer of complexity in the algorithm that takes costs into consideration. The work of pazzani1994reduce

was the first to introduce cost-sensitive algorithms that minimize misclassification costs in classification problems. The cost values, estimated as hyperparameters, are additional inputs to the learning procedure and are generally used to reduce misclassification costs, which attach penalty to predictions that lead to significant errors. These costs indeed are used to modify the updating of the observation weights at each iteration within the context of adaptive boosting algorithms. For binary classification,

Ada.C2 is the most well-known and attractive method of AdaBoost that combines cost-sensitive learning (sun2007cost). For details of this algorithm, please see so2021cost.

In this article, we suggest a novel multi-class classification algorithm, which we refer to as SAMME.C2, especially designed to handle imbalanced classes. This algorithm is inspired by combining the advantages drawn from two algorithms we earlier described: (1) SAMME, one of the Adaboost algorithms for multi-class classifiers that do not decompose the classification task into multiple binary classes to avoid computational inefficiencies, and (2) cost-sensitive learning employed in Ada.C2. zhu2009mclass showed that SAMME is equivalent to a forward stagewise additive modeling with a minimization of a multi-class exponential loss function and has been proven to be Bayes classifier. These mathematical proofs are important statistical justifications that the resulting classifiers are optimal. However, we find that the training purpose of the SAMME algorithm is to reduce test error rates and this works quite well when classes are generally considered balanced. In the case when classes are severely imbalanced, the SAMME algorithm places more observation weight on classifying majority classes accurately because this contributes more on decreasing test errors. Further, this results in a huge sacrifice of being able to accurately classify minority classes. This leads us to embrace the idea of adding the attributes of cost-sensitive learning techniques to this algorithm. When cost-sensitive learning is added to SAMME, SAMME.C2 is able to demonstrate the superiority of controlling these peculiar issues attributable to class imbalances. This article extends the mathematical proofs that with the addition of cost values, SAMME.C2 retains the same statistical foundations with SAMME.

The practical importance of multi-class classification tasks, especially with severely imbalanced classes, extends to multiple disciplines. Various ad-hoc algorithms, some of which are described above, have been employed. The works of liu2017hybrid), yuan2018regularized, jeong2020comparison, and mahmudah2021machine address real life biomedical applications of such classification tasks in the detection of disease. Spam detection is widely studied in computer engineering; see mohammad2020improved, talpur2020multi, and dewi2017multiclass. The research probe conducted by kim2016detecting applies multi-class classification tasks with cost-sensitive learning mechanisms to detect financial misstatements associated with fraud intention in finance. In operations research, han2019fault proposes a fault diagnosis model for planetary gear carrier packs as a detection tool for manufacture fault. Finally, in insurance, so2021cost examines the frequency of accidents as a multi-class classification problem with a highly imbalanced class using observations of insured drivers with additional telematics information about driving behavior through a usage-based insurance policy.

The remainder of the papers is as follows. Section 2 introduces the details about this new SAMME.C2 algorithm, which is largely based on the integration of SAMME and Ada.C2. Section 3 presents the mathematical proofs that SAMME.C2 follows a forward stagewise additive model and is an optimal Bayes classifier. To demonstrate the algorithmic superiority of SAMME.C2, Section 4 presents numerical experiment results based on simulated datasets. To show the many varied applications of our work, this section additionally lists some practical researches on multi-class classification. Section 5 concludes the chapter.

2 The samme.c2 algorithm

For our purpose, let us consider a set of input observations denoted by for where is a set of feature variables and is target classification variable belonging to one of classes. In the case of binary, . An important input variable is the cost value for which we denote here as to emphasize that it is a function of the target variable pre-determined by hyperparameter optimization technique described below.

SAMME.C2 combines the benefits of boosting and cost-sensitive algorithms for handling class imbalances in multi-class classification problems. Given the input data , the algorithm is an iterative process of fitting weak classifiers denoted by at iteration and the process stops at time . The stopping time can be a tuned hyperparameter. At iteration , we set equal observation weights as . In subsequent iteration , we train weak classifiers using the distribution . Any weak classifier can be used but for our purpose, the simplest weak classifiers are decision stumps. We update the distribution of the observation weights using

(1)

which depends on the error rate of the -th weak classifier given by

(2)

and the weight of the -th weak classifier given by

(3)

The final classifier is then determined at the final iteration using

(4)

For details of the algorithm is in Algorithm 4 in the appendix.

2.1 Comparison with Ada.C2 and Samme

The iteration process for all the three algorithms (Ada.C2, SAMME, and SAMME.C2) are exactly the same. However, the primary differences lie in the comparison of the error rate and the weight of the -th classifier, as well as the updating of the distribution of the observation weights.

For SAMME and SAMME.C2, the computation of the error rate is exactly the same despite that the SAMME algorithm does not have cost values. For the Ada.C2, unlike SAMME.C2, the cost values are used to compute error rate of the -th classifier using

(5)

For SAMME and SAMME.C2, the computation of the weight of the -th classifier is exactly the same despite that the SAMME algorithm does not have cost values. For the Ada.C2, the weight of the -th classifier is given by

(6)

For the misclassified training samples to be properly boosted, the classification error at each iteration should be less than 1/2, otherwise, , which is a function of the classification error will be negative and observation weights will be updated in the wrong direction. In which case, after the iteration, the classification error can no longer be improved. In the case of binary as in Ada.C2, this just requires that each weak learner performs a little better than random guessing. However, when , the random guessing accuracy rate is , which is less than 1/2. Hence, multi-class problems need much more accurate weak learner than binary problem, and if weak learner is not chosen and trained accurately enough, the algorithm may fail. zhu2009mclass pointed this out and suggested SAMME algorithm, which directly extend AdaBoost.M1 to the multi-class cases as adding one term, , to the updating equation of at each iteration .

The updating of the distribution of the observation weights for the subsequent iteration is exactly the same for the Ada.C2 and SAMME.C2 algorithms; this is not at all surprising since both algorithms consider cost values. For the SAMME algorithm for which it does not have cost values, the distribution of the observation weights for the subsequent iteration is given by

(7)

The updating principle is based on the idea on how the algorithm correctly classifies (or misclassifies) majority and minority classes. For the SAMME algorithm without cost values, there is an even redistribution of correct classification (or misclassification) regardless of whether it belongs to a majority or minority class. For the Ada.C2 and SAMME.C2, with addition of cost values, the redistribution becomes uneven by assigning heavier weights to observations that belong to minority classes. This leads us to conclude that after enough number of iterations, for cost-sensitive learning mechanisms, weak classifiers are trained with a heavy emphasis on misclassified observations that are in the minority class. See Figure 1 of so2021cost.

For a graphical display of the iteration process with emphasis on these differences, please refer to Figure 1. It can be noted that SAMME is a special case of SAMME.C2 by assigning all the cost values to 1, that is, for all and .

Initial sample weights

Train weak classifier using the distribution , for

Get weak classifier

Compute error rate of SAMME & SAMME.C2: ϵ_t = ∑i=1NDt(i) I(yi≠ht(xi))∑i=1NDt(i) Ada.C2: ϵ_t = ∑i=1NC(yi)   Dt(i) I(yi≠ht(xi))∑i=1NC(yi)   Dt(i)

Calculate weight of the -th weak classifier SAMME & SAMME.C2: α_t = log(1-ϵtϵt) + log(K-1) Ada.C2: α_t = 12 log(1-ϵtϵt)

Update sample weights Ada.C2 & SAMME.C2: D_t+1(i) = C(yi)   Dt(i) exp(-αtI(yi= ht(xi)))∑j=1NC(yj)   Dt(j) exp(-αtI(yj= ht(xj))) SAMME: D_t+1(i) = Dt(i) exp(-αtI(yi= ht(xi)))∑j=1NDt(j) exp(-αtI(yj= ht(xj)))

Return (final) classifier H(x_i) = kargmax  ∑_t=1^T α_t I(h_t(x_i) = k)
Figure 1: Three AdaBoost algorithms: SAMME, Ada.C2, and SAMME.C2

2.2 The cost optimization

The critical work involved in implementing SAMME.C2 is the process of determining the cost value given to each class. From the perspective of SAMME.C2, because cost values can be regarded as hyperparameters, this process can be regarded as optimizing or tuning a hyperparameter in a learning algorithm. Various optimization methods of hyperparameters may be used to optimize the cost values. Some of the frequently used optimizing strategies are grid search, random search (bergstra2012randomsearch), and sequential model-based optimization (bergstra2011smbo). The simple and widely used optimization algorithms are the grid search and the random search. However, since the next trial set of hyperparameters are not chosen based on previous results, it is time-consuming. One of the most powerful strategies is the sequential model-based optimization, also sometimes referred to as Bayesian optimization. The subsequent set of hyperparameters is determined based on the result of the previously determined sets of hyperparameters.

bergstra2011smbo and snoek2012practical showed that sequential model-based optimization outperforms both grid and random searches. However, to use the sequential model-based optimization, advanced level of statistical knowledge is required. For our purpose with SAMME.C2

, we employ Genetic Algorithm (GA), which is simple, easily understandable, and at the same time, computationally efficient. Developed by

holland1975 and described in muhlenbein1997GA, GA is one kind of random search techniques, but the primary difference from general random searches is that the subsequent trial set of hyperparameters are decided based on the result of previously determined sets of hyperparameters just like the sequential model-based optimization.

In this algorithm, we first create the population set consisting of

arbitrary cost vectors. The cost vector has

elements for -class problem. We then run SAMME.C2 and perform evaluation step to get the performance metric corresponding to each cost vector. Here, performance metric is referred to as the objective function.

  • In the selection step, two cost vectors are chosen from the vectors, employing the “choice by roulette” method typically used as an operator in GA algorithm with the objective of selecting cost vectors having a larger performance metric with a higher possibility.

  • In the crossover step, we combine the selected two cost vectors into a single vector using arithmetic average.

  • In the mutation step, we pick a random number within a tiny interval that is used to adjust the elements in the cost vector.

Repeating this selection, crossover, and mutation steps, we can produce a new population with new cost vectors, for which the procedure is iteratively repeated number of times to generate the population that will produce the optimal cost vectors.

3 Proof of optimality

In this section, we provide a theoretical justification of the SAMME.C2 algorithm. Recall that an advantage of the SAMME algorithm is that it is statistically explainable or justifiable. In particular, zhu2009mclass proved that the SAMME algorithm is equivalent to fitting a forward stagewise additive model using a multi-class exponential loss function expressed as

(8)

In the same fashion, we demonstrate that the addition of cost sensitive learning to SAMME preserves these same theoretical aspects. To prove this, instead of (8), we use a loss function multiplied by cost values, which we may call a multi-class cost sensitive exponential loss function expressed as

(9)

Just as in the work of zhu2009mclass, we justify the use of multi-class cost-sensitive exponential loss function in (9) by first showing that the resulting classifier minimizing (9) is the optimal Bayes classifier. Note that the symbols , , and cost vector will be well defined in the subsequent subsections.

3.1 Terminology

Suppose we are given a set of data denoted by for where is a set of feature variables, is the corresponding response which is a classification variable that belongs to the set and is the corresponding cost value which is a function of . For each observation, we attach a cost value that depends on which class observation belongs to and these are generated outside the algorithm but are based on the minority/majority characteristics of the classification variable. The objective is to learn from the data so that we can build a predictive model for identifying a particular observation will belong to a particular class, given the set of feature variables. Without loss of generality, we re-code the response ; all entries in this vector will be equal to except a value of 1 in position if the observation . In effect, we have where

This re-coding is for carrying over the symmetry of class label representation in the binary case (lee2004multicategory). It is straightforward to show that for all . There is a one-to-one correspondence between and and will be interchangeably used for convenience and clarity whenever possible; each equivalently refers to the class the observation belongs to.

3.2 The loss function for the optimal Bayes classifier

This section provides a theoretical justification for the use of the multi-class cost-sensitive exponential loss function in (9) in the optimization leading to the SAMME.C2

algorithm. More specifically, we show here that the resulting classifier is an optimal Bayes classifier. It is well-known in classification problems that this produces a classifier that minimizes the probability of misclassification. See

hastie2009.

Lemma 3.1.

Denote to be the classification variable with possible values belonging to , to be the re-coding of this variable as explained above, to be the cost vector, and

to be the classifier function. The following result leads us to the optimal classifier function under the multi-class cost-sensitive exponential loss function:

subject to where

(10)
Proof.

For this optimization, the Lagrange can be written as

where is the Lagrange multiplier. By taking derivative with respect to and , we reach

and the constraint that

Next, by summing the first equations, we get

and by substituting the last equation, we obtain the following population minimizer, (10):

Note that the constraints on in Lemma 3.1 allow us to find the unique solution. The following proposition allows us to choose the optimal Bayes classifier.

Proposition 3.2.

Denote to be the classification variable with possible values belonging to . Given the feature variables , we find the optimal Bayes classifier using the multi-class cost-sensitive exponential loss function:

Proof.

It is clear that

and that

is fixed for all . It follows therefore that from (10), we have

Proposition 3.2 provides a theoretical justification for our estimated classifier in the SAMME.C2 algorithm, and the subsequent proposition provides for a formula to calculate the implied class probabilities within this framework.

Proposition 3.3.

The implied class probability under the optimal Bayes classifier has the form

Proof.

Equation (10) provides once ’s are determined. ∎

3.3 samme.c2 as forward stagewise additive modeling

In this section, we show that our SAMME.C2 algorithm is indeed equivalent to a forward stagewise additive modeling based on the optimization of the multi-class cost-sensitive exponential loss function expressed in (9).

Given the training set , (9) can be written as:

we wish to find such that

(11)

subject to .

Using forward stagewise modeling for learning, the solution to (11) has the linear form

where is the total number of iterations, and are the basis functions with corresponding coefficient . We require that each basis function satisfies the symmetric constraint whereby for all so that takes only one of the possible values from the set

(12)

Then, at iteration , the solution can be written as:

(13)

Forward stagewise modeling finds the solution to (11) by sequentially adding new basis functions to previously fitted model. Hence, at iteration , we only need to solve the following

(14)

where does not depend on either or and is equivalent to the unnormalized distribution of observation weights in the -th iteration in Algorithm 4. We notice that in (14) is a one-to-one correspondence with the multi-class classifier in Algorithm 4 in the following manner:

Therefore, in essence, solving for is equivalent to finding in Algorithm 4.

Proposition 3.4.

The solution to the optimization expressed as

has the following form:

(15)
(16)

where

and

Proof.

To find in (14), first, we fix . Let us consider the case where . We have

(17)

On the other hand, when , we have

(18)

Equations (17) and (18) lead us to

(19)

From (19), since only the last sum depends on the classifier , for a fixed value of , solution to results in:

Now let us find given . If we define as

(20)

plugging (17), (18), and (20) into (14), we would have

The summation component does not affect the minimization so that, differentiating with respect to and setting to zero, we get

and factoring out the term , we get

Since we are minimizing a convex function of , the optimal solution is

The terms and are equivalent to those in Algorithm (4). Subsequently, we can deduce the updating equation for the distribution of the observation weights in Algorithm (4) after normalization.

Proposition 3.5.

The distribution of the observation weights at each iteration simplifies to

(21)

Equation (21) is equivalent to the updating of the weights in Algorithm (4) after normalization.

Proof.

From equation (13), multiplying both sides by , exponentiating, and multiplying both sides by , we get

for which can be written as

where we define

The weight at each iteration can further be simplified to

(22)

Multiplying (22) by ,

which proves our proposition. To show equivalence to the updating of the weights in Algorithm (4), we note that the cases where and are equivalent to the cases where and , respectively. ∎

It should be straightforward to show that the final classifier is the solution

which is equivalent to

4 Numerical experiments

This section examines the differences between SAMME.C2 and SAMME in how each model is trained, further exploring the superiority of SAMME.C2 over SAMME

in handling the issue regarding imbalanced data. To accomplish this, we make use of simulated dataset with a highly imbalanced three-class response variable. To generate such a simulated dataset, we utilize the

Scikit-learn Python module described in pedregosa2011scikit. The make_classification Application Programming Interface is employed with parameterization executed as

ΨΨ"""Make Simulation"""
ΨΨfrom sklearn.datasets  import make_classification
ΨΨX, y = make_classification(n_samples=100000, n_features=50, n_informative=5-, n_redundant=0, n_repeated=0,
ΨΨn_classes=3, n_clusters_per_class=2, class_sep=2, flip_y=0, weights=[0.90,0.09,0.01], random_state=16)
Ψ

This script generates 100,000 samples with 50 features and 3 classes, deliberately creating a highly imbalanced dataset by setting the ratio for each class as 90%, 9%, and 1%, respectively. Changing the parameter of class_sep adjusts the difficulty of the classification task. The samples no longer remain easily separable in the case of a lower value of class_sep. To investigate and compare running processes of algorithms with different level of difficulty, three sets of data was created adjusting this parameter: for high classification difficulty, we set class_sep=1, for medium classification difficulty, we set class_sep=1.5, and for low classification difficulty, we set class_sep=2. In Figure 2, we visualize these three different difficulties of classification tasks, with each difficulty separated by columns. The figure clearly shows that low classification difficulty means the samples are easily separable; the opposite holds for high classification difficulty. For ease of visualization, we only use 3 features instead of the 50 features, and we exhibit a 3-dimensional data structure by pairing and drawing three 2-dimensional graphs. We kindly ask the reader to refer to the package for further explanation of the other input parameters. For training, we use 75% of the data, and the rest are used for testing.

Figure 2: Visualization of three synthetic datasets having 3 classes with different levels of classification difficulty.

For classification problems, the most common performance statistics is accuracy, which is the proportion of all observations that were correctly classified. For obvious reasons, this is an irrelevant measure for imbalanced datasets. As alternative statistics, we consider Recall to measure the performance. The Recall statistics, sometimes called the sensitivity, for class , , is defined to be the proportion of observations in class correctly classified. It has been discussed (fernandez2018learning) that the Recall, or sensitivity, is usually a more interesting measure for imbalanced classification. To provide a single measure of performance for a given classifier, we use the geometric average of Recall statistics, denoted as MAvG, as follows:

(23)

It is straightforward to show that when we take the log of both sides of this performance metric, we get an average of the log of all the Recall statistics. This log transformation leads us to a metric that provides for impartiality of the importance of accurately classifying observations for all classes. In the case of severely imbalanced datasets, the MAvG metric allows us to correctly classify more observations of the minority class while sacrificing misclassifications of observations in the majority class. In effect, the MAvG metric is a sensible performance measure for severely imbalanced datasets; this performance metric is also used as the criterion for the hyperparameter optimization in GA to determine the cost values used in the SAMME.C2 algorithm. The concept of MAvG used for imbalanced datasets originated from the work of fowlkes1983gmean.

To examine running processes of SAMME.C2 and SAMME, each algorithm with 1,000 decision stumps is trained using three datasets. The decision stump is a decision tree with one depth, which plays the role of a weak learner in the algorithms. Figures 3, 4, and 5 show the resulting test errors and test MAvG of SAMME.C2 and SAMME, after training newly added decision trees using the datasets of varying (low, medium, and high) level of difficulty. All figures are produced for increasing number of iterations, with each iteration referring to new decision stump.

Figure 3: Comparison of test error and test MAvG between SAMME.C2 and SAMME with 1,000 decision stumps using the dataset of low level of classification difficulty.
Figure 4: Comparison of test error and test MAvG between SAMME.C2 and SAMME with 1,000 decision stumps using the dataset of medium level of classification difficulty.
Figure 5: Comparison of test error and test MAvG between SAMME.C2 and SAMME with 1,000 decision stumps using the dataset of high level of classification difficulty.

With SAMME algorithm, the objective is to reduce the test error, the misclassification rate. Therefore, when model is trained with severely imbalanced data, it puts more weight on a majority class since the majority class can significantly reduce the test error. For example, based on the simulated datasets in these numerical experiments, a model can be constructed assigning all observations in the majority class. In which case, we will get a misclassification rate of 10% which can be deemed small. Therefore, the test error is not a meaningful performance metric for severely imbalanced datasets. All figures show small test errors for SAMME algorithm, but when the SAMME.C2 algorithm is used, test errors are clearly low for low level of difficulty of classification and rapidly becomes worst for very high level of difficulty of classification.

On the other hand, all three figures show that SAMME.C2 algorithm produces better MAvG performance metric for various level of difficulty of classification. It is noted further than in the case when we have a high level of difficulty of classification, the SAMME.C2 produces a much improved MAvG metric than the SAMME algorithm. This results in spite of the worst test errors. This leads us to infer that in order to have a higher accuracy for minority class, SAMME.C2 has to sacrifice accuracy for majority class. This becomes clearer in the subsequent figure.

Figure 6: Comparison of Recall statistics of each class between SAMME.C2 and SAMME with 1,000 decision stumps using the dataset of low level (Top), medium level (Middle) and high level (Bottom) difficulty of classification.

In Figure 6, we can observe the mechanism of SAMME.C2 in more details by examining Recall statistics of each of the three classes. Regardless of the complexity of the classification task, SAMME.C2 classifies minority classes much more accurately than SAMME. However, accuracy from minority classes is gained by sacrificing the accuracy of majority class. In other words, the primary difference between SAMME.C2 and SAMME occurs based on whether the model is trained focusing on reducing test errors or improving a more balanced accuracy of classification across all classes. Apparently, as the level of classification task increases, Figure 6 shows that, to correctly classify observations in the minority class, SAMME.C2 has to correspondingly reduce accuracy of observations in the majority class. This is a very important result because when observations in the minority class for severely imbalanced datasets are extremely difficult to classify, SAMME assigns nearly all observations in the majority class. Differently said, SAMME assigns nearly no observation in the minority class.

The number of times we iterate to reach an optimal classifier is clearly directly linked to the number of decision stumps we use as weak learners. The more weak learners we use the closer we can reach a desired convergence of our MAvG performance metric. In essence, this can impact the computational efficiency of our iterative algorithm. To do this investigation, we examine for a reasonable number of decision stumps to use for the SAMME.C2 algorithm by exploring the change in the value of MAvG vis-a-vis the number of decision stumps. Figure 7 exhibits the results of this investigation.

In the figure, for each level of difficulty of the classification tasks, we examine how changing the number of trees affects reaching the optimal MAvG performance metric with training. The figure shows the effects for various levels of difficulty of classification, varying the number of decision stumps or trees from 50, 100, and intervals of 100 up to 1000. For each time we train a model, the cost values are newly tuned through Genetic Algorithm. The results in Figure 7 exhibit solid lines determined according to 5-fold cross validation MAvGs. For reference purposes, we also give the corresponding 5-fold cross validation accuracy values shown as dashed lines. For all levels of difficulty, MAvGs increase sharply before 200 decision stumps, however, after that, MAvGs do not improve significantly with increasing number of decision stumps. Therefore, we conclude that at least 200 decision stumps are necessary for SAMME.C2 to perform suitably and favorably.

Figure 7: 5-fold cross validation of MAvG and accuracy of SAMME.C2 with adjustments of the number of decision stumps (trees).

Finally, we examine the proper number of populations () in GA explained in Section 2.2 to tune the cost values of each class for SAMME.C2. To narrow the possible interval of selecting values, the cost value of the most minority class is fixed at 0.999. Since we should give the largest cost to the most minority class, obviously, cost values for other classes should be between 0 and 0.999. It has been demonstrated by initial experiments that, when we run SAMME.C2 with over 200 decision stumps, the best cost values chosen from GA are in between 0.95 and 0.999. Based on these results, we determine the optimal cost values by choosing from the interval (0.95, 0.999) and we allow for randomness of around 0.001, in the mutation step of GA. Figure 8 reveals 10 values of MAvG according to 10 cost values in each population. As explained in section 2.2, the set of 10 cost values of each population is determined by the 10 MAvG values calculated with trained SAMME.C2 using the set of 10 cost values of the previous population. For all three levels of classification difficulties, we arrive at the best cost values rather quite rapidly. We observe that just after the 4th population, the largest MAvG for each population is nearly similar. The assignment of cost values in SAMME.C2 does not slow the overall estimation and training of the SAMME.C2 algorithm.

Figure 8: Number of populations in the Genetic Algorithm (GA) to tune the cost values for each class, according to level of difficulty: top (low level), middle (medium level), and bottom (high level).

We have used numerical experiments to have a better understanding of the SAMME.C2 especially when compared to the SAMME algorithm. We find that SAMME.C2 provides us a much more superior algorithm for learning and understanding observations in the minority class, regardless of the level of difficulty of classification embedded in the data. We also examined how SAMME.C2 performs relative to other algorithms that handle severely imbalanced classes based on insurance telematics data. See so2021cost.

5 Concluding remarks

Because of its potential use in a vast array of disciplines, classification predictive modeling will continue to be an important toolkit in machine learning. One of the most challenging aspects of classification task is finding an optimal procedure to handle observational data with skewed distribution across several classes. We find that there is now a growing body of literature that deals with real world classification tasks related to highly imbalanced multi-class problems. In spite of this growing demand, there is insufficient work on methods to handle severely imbalanced data in a multi-class classification.

In this paper, we presented what we believe is a promising algorithm for handling severely imbalanced multi-class classification. The proposed method, which we refer to as SAMME.C2, combines the benefits of iterative learning from weak learners through the AdaBoost scheme and increased repeated learning of observations in the minority class through a cost-sensitive learning scheme. We provided a mathematical proof that the optimal procedure resulting in SAMME.C2 is equivalent to an additive model with a minimization of a multi-class cost-sensitive exponential loss function. The algorithm therefore belongs to the traditional statistical family of forward stagewise additive models. We additionally showed that based on the same multi-class cost-sensitive exponential loss function, SAMME.C2 is an optimal Bayes classifier.

In order to expand our insights into SAMME.C2 relative to SAMME, our numerical experiments are based on understanding the resulting differences when differing levels of difficulty in classification task is used. We therefore synthetically generated three simulated datasets that are distinguished according to these degrees of difficulty of classification. First, we note that the use of straightforward misclassification, or test errors, does not work well for severely imbalanced datasets. As has been proposed in the literature, the use of MAvG, a geometric average of recall statistics for all classes, is a more rational performance metric as it gives emphasis on being able to train and learn well from observations that belong to the more minority classes. By recording and tracking test errors, MAvGs, and recall statistics, the results of our numerical experiments reveal the superiority of SAMME.C2 in classifying objects that belong to the minority class, regardless of the degree of difficulty of classification. This is at the little expense of sacrificing recall statistics for the majority class. For SAMME.C2, the recall statistics of minority classes are much more improved at each iteration than those of SAMME, but SAMME.C2 has lower recall statistics for majority classes at all iterations than those of SAMME. We also showed the computational efficiency of SAMME.C2 by investigating the most optimal number of weak learners, or iterations, in order to reach convergence. Based on our analysis, training as little as 200 decision stumps as weak learners can rationally stop the iteration.

Appendix A. Detailed steps of the various algorithms

Data: ,
Input:
Output: Final classifier
Set initial sample weights equally distributed: ;
for  do
      Train weak classifier using the distribution ;
      Get weak classifier ;
      Compute error rate ;
      Calculate weight ;
      Update sample weights ;
     
      end for
Algorithm 1 AdaBoost.M1

Data: ,
Input: ,
Output: Final classifier
Set initial sample weights equally distributed: ;
for  do
      Train weak classifier using the distribution ;
      Get weak classifier ;
      Compute error rate ;
      Choose weight ;
      Update sample weights ;
     
      end for
Algorithm 2 Ada.C2: cost-sensitive binary AdaBoost

Data: ,
Input:
Output: Final classifier
Set initial sample weights equally distributed: ;
for  do
      Train weak classifier using the distribution ;
      Get weak classifier ;
      Compute error rate