1 Introduction
In machine learning, classification involves the accurate assignment of a target class or label to input observations. When there are only two labels, it is called binary classification; when there are more than two labels, it is often referred to as multiclass classification. Classification algorithms may generally fall into three categories (see
hastie2009):
Linear classifiers: This type of algorithm separates the input observations using a line or hyperplane based on a linear combination of the input features. Examples in this algorithm include logistic regression, probit regression, and linear discriminant analysis.

Nonlinear classifiers: This type of algorithm separates the input observations based on nonlinear functions. Examples include decision trees,
nearest neighbors (KNN), support vector machines (SVM), and neural networks.

Ensemble methods: This type of algorithm combines the predictions produced from multiple models. Examples includes random forests, stochastic gradient boosting (e.g., XGBoost), and adaptive boosting (AdaBoost).
In imbalanced classification, the distribution of observations across classes is biased or skewed. In this case, minority classes have much fewer observations to learn from than those from majority classes. In spite of the sparsity, the minority class is often considered the more interesting class yet developing a scientific learning algorithm suitable for the observations presents countless challenges. Several research have been conducted dealing with imbalanced data, yet mostly in the context of binary problems. The commonplace and more direct approach is to use algorithms listed above and handle the class imbalance at the data level. In this case, the class distribution of the input observations is rebalanced by oversampling (or undersampling) from the underrepresented (or overrepresented) classes. One popular approach is the oversampling of underrepresented classes based on the
SMOTE (Synthetic Minority Oversampling Technique), a technique developed by chawla2002smote. It is worth noting that generating synthetic observations to rebalance class distributions, especially with multiclass classification, has the disadvantage of increasing the overlapping classes with unnecessary additional noise.One popular class of algorithms that is believed to be one of the most powerful techniques is boosting, which is based on training a sequence of weak models into a strong learner in order to improve predictive power. A specific boosting technique primarily developed for classification is AdaBoost, a class of socalled adaptive boosting algorithms. AdaBoost.M1, which combines several weak classifiers to produce a strong classifier, is the first practical boosting algorithm introduced by freund1997decision. AdaBoost.M1 is an iterative process that starts with a distribution of equal observation weights. At each iteration, the process fits one weak classifier and subsequently adjusts the observation weights based on the idea that more weights are given to input observations that have been misclassified, allowing for increased learning. See Algorithm 1 in Appendix A.
AdaBoost.M1 has been extended to handle multiclass classification problems. One such extension is the socalled AdaBoost.M2, developed by freund1997decision that is based on the optimization of a pseudoloss function suitable for handling multiclass problems. Another extension is the AdaBoost.MH developed by schapire1999improv that is based on the optimization of the Hamming loss function. Both these extensions solve multiclass classification problems by reducing them into several different binary problems; Such procedures can be slow and inefficient. A more popular multiclass AdaBoost extension is the algorithm called SAMME (Stagewise Additive Modeling using a Multiclass Exponential Loss Function) proposed by zhu2009mclass, which avoids computational inefficiencies without the multiple binary problems. See so2021cost for details of the iteration process of this algorithm. According to friedman2000addlog and hastie2009, the SAMME algorithm is equivalent to an additive model with a minimization of a multiclass exponential loss function and belongs to the traditional statistical family of forward stagewise additive models. Additional variations to these AdaBoost algorithms have appeared in ferreira2012review; a recent work of tanha2020boosting provides a comprehensive survey.
In order to further improve prediction within an imbalanced classification, costsensitive learning algorithms provides for a necessary additional layer of complexity in the algorithm that takes costs into consideration. The work of pazzani1994reduce
was the first to introduce costsensitive algorithms that minimize misclassification costs in classification problems. The cost values, estimated as hyperparameters, are additional inputs to the learning procedure and are generally used to reduce misclassification costs, which attach penalty to predictions that lead to significant errors. These costs indeed are used to modify the updating of the observation weights at each iteration within the context of adaptive boosting algorithms. For binary classification,
Ada.C2 is the most wellknown and attractive method of AdaBoost that combines costsensitive learning (sun2007cost). For details of this algorithm, please see so2021cost.In this article, we suggest a novel multiclass classification algorithm, which we refer to as SAMME.C2, especially designed to handle imbalanced classes. This algorithm is inspired by combining the advantages drawn from two algorithms we earlier described: (1) SAMME, one of the Adaboost algorithms for multiclass classifiers that do not decompose the classification task into multiple binary classes to avoid computational inefficiencies, and (2) costsensitive learning employed in Ada.C2. zhu2009mclass showed that SAMME is equivalent to a forward stagewise additive modeling with a minimization of a multiclass exponential loss function and has been proven to be Bayes classifier. These mathematical proofs are important statistical justifications that the resulting classifiers are optimal. However, we find that the training purpose of the SAMME algorithm is to reduce test error rates and this works quite well when classes are generally considered balanced. In the case when classes are severely imbalanced, the SAMME algorithm places more observation weight on classifying majority classes accurately because this contributes more on decreasing test errors. Further, this results in a huge sacrifice of being able to accurately classify minority classes. This leads us to embrace the idea of adding the attributes of costsensitive learning techniques to this algorithm. When costsensitive learning is added to SAMME, SAMME.C2 is able to demonstrate the superiority of controlling these peculiar issues attributable to class imbalances. This article extends the mathematical proofs that with the addition of cost values, SAMME.C2 retains the same statistical foundations with SAMME.
The practical importance of multiclass classification tasks, especially with severely imbalanced classes, extends to multiple disciplines. Various adhoc algorithms, some of which are described above, have been employed. The works of liu2017hybrid), yuan2018regularized, jeong2020comparison, and mahmudah2021machine address real life biomedical applications of such classification tasks in the detection of disease. Spam detection is widely studied in computer engineering; see mohammad2020improved, talpur2020multi, and dewi2017multiclass. The research probe conducted by kim2016detecting applies multiclass classification tasks with costsensitive learning mechanisms to detect financial misstatements associated with fraud intention in finance. In operations research, han2019fault proposes a fault diagnosis model for planetary gear carrier packs as a detection tool for manufacture fault. Finally, in insurance, so2021cost examines the frequency of accidents as a multiclass classification problem with a highly imbalanced class using observations of insured drivers with additional telematics information about driving behavior through a usagebased insurance policy.
The remainder of the papers is as follows. Section 2 introduces the details about this new SAMME.C2 algorithm, which is largely based on the integration of SAMME and Ada.C2. Section 3 presents the mathematical proofs that SAMME.C2 follows a forward stagewise additive model and is an optimal Bayes classifier. To demonstrate the algorithmic superiority of SAMME.C2, Section 4 presents numerical experiment results based on simulated datasets. To show the many varied applications of our work, this section additionally lists some practical researches on multiclass classification. Section 5 concludes the chapter.
2 The samme.c2 algorithm
For our purpose, let us consider a set of input observations denoted by for where is a set of feature variables and is target classification variable belonging to one of classes. In the case of binary, . An important input variable is the cost value for which we denote here as to emphasize that it is a function of the target variable predetermined by hyperparameter optimization technique described below.
SAMME.C2 combines the benefits of boosting and costsensitive algorithms for handling class imbalances in multiclass classification problems. Given the input data , the algorithm is an iterative process of fitting weak classifiers denoted by at iteration and the process stops at time . The stopping time can be a tuned hyperparameter. At iteration , we set equal observation weights as . In subsequent iteration , we train weak classifiers using the distribution . Any weak classifier can be used but for our purpose, the simplest weak classifiers are decision stumps. We update the distribution of the observation weights using
(1) 
which depends on the error rate of the th weak classifier given by
(2) 
and the weight of the th weak classifier given by
(3) 
The final classifier is then determined at the final iteration using
(4) 
For details of the algorithm is in Algorithm 4 in the appendix.
2.1 Comparison with Ada.C2 and Samme
The iteration process for all the three algorithms (Ada.C2, SAMME, and SAMME.C2) are exactly the same. However, the primary differences lie in the comparison of the error rate and the weight of the th classifier, as well as the updating of the distribution of the observation weights.
For SAMME and SAMME.C2, the computation of the error rate is exactly the same despite that the SAMME algorithm does not have cost values. For the Ada.C2, unlike SAMME.C2, the cost values are used to compute error rate of the th classifier using
(5) 
For SAMME and SAMME.C2, the computation of the weight of the th classifier is exactly the same despite that the SAMME algorithm does not have cost values. For the Ada.C2, the weight of the th classifier is given by
(6) 
For the misclassified training samples to be properly boosted, the classification error at each iteration should be less than 1/2, otherwise, , which is a function of the classification error will be negative and observation weights will be updated in the wrong direction. In which case, after the iteration, the classification error can no longer be improved. In the case of binary as in Ada.C2, this just requires that each weak learner performs a little better than random guessing. However, when , the random guessing accuracy rate is , which is less than 1/2. Hence, multiclass problems need much more accurate weak learner than binary problem, and if weak learner is not chosen and trained accurately enough, the algorithm may fail. zhu2009mclass pointed this out and suggested SAMME algorithm, which directly extend AdaBoost.M1 to the multiclass cases as adding one term, , to the updating equation of at each iteration .
The updating of the distribution of the observation weights for the subsequent iteration is exactly the same for the Ada.C2 and SAMME.C2 algorithms; this is not at all surprising since both algorithms consider cost values. For the SAMME algorithm for which it does not have cost values, the distribution of the observation weights for the subsequent iteration is given by
(7) 
The updating principle is based on the idea on how the algorithm correctly classifies (or misclassifies) majority and minority classes. For the SAMME algorithm without cost values, there is an even redistribution of correct classification (or misclassification) regardless of whether it belongs to a majority or minority class. For the Ada.C2 and SAMME.C2, with addition of cost values, the redistribution becomes uneven by assigning heavier weights to observations that belong to minority classes. This leads us to conclude that after enough number of iterations, for costsensitive learning mechanisms, weak classifiers are trained with a heavy emphasis on misclassified observations that are in the minority class. See Figure 1 of so2021cost.
For a graphical display of the iteration process with emphasis on these differences, please refer to Figure 1. It can be noted that SAMME is a special case of SAMME.C2 by assigning all the cost values to 1, that is, for all and .
2.2 The cost optimization
The critical work involved in implementing SAMME.C2 is the process of determining the cost value given to each class. From the perspective of SAMME.C2, because cost values can be regarded as hyperparameters, this process can be regarded as optimizing or tuning a hyperparameter in a learning algorithm. Various optimization methods of hyperparameters may be used to optimize the cost values. Some of the frequently used optimizing strategies are grid search, random search (bergstra2012randomsearch), and sequential modelbased optimization (bergstra2011smbo). The simple and widely used optimization algorithms are the grid search and the random search. However, since the next trial set of hyperparameters are not chosen based on previous results, it is timeconsuming. One of the most powerful strategies is the sequential modelbased optimization, also sometimes referred to as Bayesian optimization. The subsequent set of hyperparameters is determined based on the result of the previously determined sets of hyperparameters.
bergstra2011smbo and snoek2012practical showed that sequential modelbased optimization outperforms both grid and random searches. However, to use the sequential modelbased optimization, advanced level of statistical knowledge is required. For our purpose with SAMME.C2
, we employ Genetic Algorithm (GA), which is simple, easily understandable, and at the same time, computationally efficient. Developed by
holland1975 and described in muhlenbein1997GA, GA is one kind of random search techniques, but the primary difference from general random searches is that the subsequent trial set of hyperparameters are decided based on the result of previously determined sets of hyperparameters just like the sequential modelbased optimization.In this algorithm, we first create the population set consisting of
arbitrary cost vectors. The cost vector has
elements for class problem. We then run SAMME.C2 and perform evaluation step to get the performance metric corresponding to each cost vector. Here, performance metric is referred to as the objective function.
In the selection step, two cost vectors are chosen from the vectors, employing the “choice by roulette” method typically used as an operator in GA algorithm with the objective of selecting cost vectors having a larger performance metric with a higher possibility.

In the crossover step, we combine the selected two cost vectors into a single vector using arithmetic average.

In the mutation step, we pick a random number within a tiny interval that is used to adjust the elements in the cost vector.
Repeating this selection, crossover, and mutation steps, we can produce a new population with new cost vectors, for which the procedure is iteratively repeated number of times to generate the population that will produce the optimal cost vectors.
3 Proof of optimality
In this section, we provide a theoretical justification of the SAMME.C2 algorithm. Recall that an advantage of the SAMME algorithm is that it is statistically explainable or justifiable. In particular, zhu2009mclass proved that the SAMME algorithm is equivalent to fitting a forward stagewise additive model using a multiclass exponential loss function expressed as
(8) 
In the same fashion, we demonstrate that the addition of cost sensitive learning to SAMME preserves these same theoretical aspects. To prove this, instead of (8), we use a loss function multiplied by cost values, which we may call a multiclass cost sensitive exponential loss function expressed as
(9) 
Just as in the work of zhu2009mclass, we justify the use of multiclass costsensitive exponential loss function in (9) by first showing that the resulting classifier minimizing (9) is the optimal Bayes classifier. Note that the symbols , , and cost vector will be well defined in the subsequent subsections.
3.1 Terminology
Suppose we are given a set of data denoted by for where is a set of feature variables, is the corresponding response which is a classification variable that belongs to the set and is the corresponding cost value which is a function of . For each observation, we attach a cost value that depends on which class observation belongs to and these are generated outside the algorithm but are based on the minority/majority characteristics of the classification variable. The objective is to learn from the data so that we can build a predictive model for identifying a particular observation will belong to a particular class, given the set of feature variables. Without loss of generality, we recode the response ; all entries in this vector will be equal to except a value of 1 in position if the observation . In effect, we have where
This recoding is for carrying over the symmetry of class label representation in the binary case (lee2004multicategory). It is straightforward to show that for all . There is a onetoone correspondence between and and will be interchangeably used for convenience and clarity whenever possible; each equivalently refers to the class the observation belongs to.
3.2 The loss function for the optimal Bayes classifier
This section provides a theoretical justification for the use of the multiclass costsensitive exponential loss function in (9) in the optimization leading to the SAMME.C2
algorithm. More specifically, we show here that the resulting classifier is an optimal Bayes classifier. It is wellknown in classification problems that this produces a classifier that minimizes the probability of misclassification. See
hastie2009.Lemma 3.1.
Denote to be the classification variable with possible values belonging to , to be the recoding of this variable as explained above, to be the cost vector, and
to be the classifier function. The following result leads us to the optimal classifier function under the multiclass costsensitive exponential loss function:
subject to where
(10) 
Proof.
For this optimization, the Lagrange can be written as
where is the Lagrange multiplier. By taking derivative with respect to and , we reach
and the constraint that
Next, by summing the first equations, we get
and by substituting the last equation, we obtain the following population minimizer, (10):
∎
Note that the constraints on in Lemma 3.1 allow us to find the unique solution. The following proposition allows us to choose the optimal Bayes classifier.
Proposition 3.2.
Denote to be the classification variable with possible values belonging to . Given the feature variables , we find the optimal Bayes classifier using the multiclass costsensitive exponential loss function:
Proof.
Proposition 3.2 provides a theoretical justification for our estimated classifier in the SAMME.C2 algorithm, and the subsequent proposition provides for a formula to calculate the implied class probabilities within this framework.
Proposition 3.3.
The implied class probability under the optimal Bayes classifier has the form
Proof.
Equation (10) provides once ’s are determined. ∎
3.3 samme.c2 as forward stagewise additive modeling
In this section, we show that our SAMME.C2 algorithm is indeed equivalent to a forward stagewise additive modeling based on the optimization of the multiclass costsensitive exponential loss function expressed in (9).
Using forward stagewise modeling for learning, the solution to (11) has the linear form
where is the total number of iterations, and are the basis functions with corresponding coefficient . We require that each basis function satisfies the symmetric constraint whereby for all so that takes only one of the possible values from the set
(12) 
Then, at iteration , the solution can be written as:
(13) 
Forward stagewise modeling finds the solution to (11) by sequentially adding new basis functions to previously fitted model. Hence, at iteration , we only need to solve the following
(14) 
where does not depend on either or and is equivalent to the unnormalized distribution of observation weights in the th iteration in Algorithm 4. We notice that in (14) is a onetoone correspondence with the multiclass classifier in Algorithm 4 in the following manner:
Therefore, in essence, solving for is equivalent to finding in Algorithm 4.
Proposition 3.4.
The solution to the optimization expressed as
has the following form:
(15)  
(16) 
where
and
Proof.
To find in (14), first, we fix . Let us consider the case where . We have
(17) 
On the other hand, when , we have
(18) 
Equations (17) and (18) lead us to
(19) 
From (19), since only the last sum depends on the classifier , for a fixed value of , solution to results in:
Now let us find given . If we define as
(20) 
plugging (17), (18), and (20) into (14), we would have
The summation component does not affect the minimization so that, differentiating with respect to and setting to zero, we get
and factoring out the term , we get
Since we are minimizing a convex function of , the optimal solution is
∎
The terms and are equivalent to those in Algorithm (4). Subsequently, we can deduce the updating equation for the distribution of the observation weights in Algorithm (4) after normalization.
Proposition 3.5.
The distribution of the observation weights at each iteration simplifies to
(21) 
Equation (21) is equivalent to the updating of the weights in Algorithm (4) after normalization.
Proof.
From equation (13), multiplying both sides by , exponentiating, and multiplying both sides by , we get
for which can be written as
where we define
The weight at each iteration can further be simplified to
(22) 
Multiplying (22) by ,
which proves our proposition. To show equivalence to the updating of the weights in Algorithm (4), we note that the cases where and are equivalent to the cases where and , respectively. ∎
It should be straightforward to show that the final classifier is the solution
which is equivalent to
4 Numerical experiments
This section examines the differences between SAMME.C2 and SAMME in how each model is trained, further exploring the superiority of SAMME.C2 over SAMME
in handling the issue regarding imbalanced data. To accomplish this, we make use of simulated dataset with a highly imbalanced threeclass response variable. To generate such a simulated dataset, we utilize the
Scikitlearn Python module described in pedregosa2011scikit. The make_classification Application Programming Interface is employed with parameterization executed asΨΨ"""Make Simulation""" ΨΨfrom sklearn.datasets import make_classification ΨΨX, y = make_classification(n_samples=100000, n_features=50, n_informative=5, n_redundant=0, n_repeated=0, ΨΨn_classes=3, n_clusters_per_class=2, class_sep=2, flip_y=0, weights=[0.90,0.09,0.01], random_state=16) Ψ
This script generates 100,000 samples with 50 features and 3 classes, deliberately creating a highly imbalanced dataset by setting the ratio for each class as 90%, 9%, and 1%, respectively. Changing the parameter of class_sep adjusts the difficulty of the classification task. The samples no longer remain easily separable in the case of a lower value of class_sep. To investigate and compare running processes of algorithms with different level of difficulty, three sets of data was created adjusting this parameter: for high classification difficulty, we set class_sep=1, for medium classification difficulty, we set class_sep=1.5, and for low classification difficulty, we set class_sep=2. In Figure 2, we visualize these three different difficulties of classification tasks, with each difficulty separated by columns. The figure clearly shows that low classification difficulty means the samples are easily separable; the opposite holds for high classification difficulty. For ease of visualization, we only use 3 features instead of the 50 features, and we exhibit a 3dimensional data structure by pairing and drawing three 2dimensional graphs. We kindly ask the reader to refer to the package for further explanation of the other input parameters. For training, we use 75% of the data, and the rest are used for testing.
For classification problems, the most common performance statistics is accuracy, which is the proportion of all observations that were correctly classified. For obvious reasons, this is an irrelevant measure for imbalanced datasets. As alternative statistics, we consider Recall to measure the performance. The Recall statistics, sometimes called the sensitivity, for class , , is defined to be the proportion of observations in class correctly classified. It has been discussed (fernandez2018learning) that the Recall, or sensitivity, is usually a more interesting measure for imbalanced classification. To provide a single measure of performance for a given classifier, we use the geometric average of Recall statistics, denoted as MAvG, as follows:
(23) 
It is straightforward to show that when we take the log of both sides of this performance metric, we get an average of the log of all the Recall statistics. This log transformation leads us to a metric that provides for impartiality of the importance of accurately classifying observations for all classes. In the case of severely imbalanced datasets, the MAvG metric allows us to correctly classify more observations of the minority class while sacrificing misclassifications of observations in the majority class. In effect, the MAvG metric is a sensible performance measure for severely imbalanced datasets; this performance metric is also used as the criterion for the hyperparameter optimization in GA to determine the cost values used in the SAMME.C2 algorithm. The concept of MAvG used for imbalanced datasets originated from the work of fowlkes1983gmean.
To examine running processes of SAMME.C2 and SAMME, each algorithm with 1,000 decision stumps is trained using three datasets. The decision stump is a decision tree with one depth, which plays the role of a weak learner in the algorithms. Figures 3, 4, and 5 show the resulting test errors and test MAvG of SAMME.C2 and SAMME, after training newly added decision trees using the datasets of varying (low, medium, and high) level of difficulty. All figures are produced for increasing number of iterations, with each iteration referring to new decision stump.
With SAMME algorithm, the objective is to reduce the test error, the misclassification rate. Therefore, when model is trained with severely imbalanced data, it puts more weight on a majority class since the majority class can significantly reduce the test error. For example, based on the simulated datasets in these numerical experiments, a model can be constructed assigning all observations in the majority class. In which case, we will get a misclassification rate of 10% which can be deemed small. Therefore, the test error is not a meaningful performance metric for severely imbalanced datasets. All figures show small test errors for SAMME algorithm, but when the SAMME.C2 algorithm is used, test errors are clearly low for low level of difficulty of classification and rapidly becomes worst for very high level of difficulty of classification.
On the other hand, all three figures show that SAMME.C2 algorithm produces better MAvG performance metric for various level of difficulty of classification. It is noted further than in the case when we have a high level of difficulty of classification, the SAMME.C2 produces a much improved MAvG metric than the SAMME algorithm. This results in spite of the worst test errors. This leads us to infer that in order to have a higher accuracy for minority class, SAMME.C2 has to sacrifice accuracy for majority class. This becomes clearer in the subsequent figure.
In Figure 6, we can observe the mechanism of SAMME.C2 in more details by examining Recall statistics of each of the three classes. Regardless of the complexity of the classification task, SAMME.C2 classifies minority classes much more accurately than SAMME. However, accuracy from minority classes is gained by sacrificing the accuracy of majority class. In other words, the primary difference between SAMME.C2 and SAMME occurs based on whether the model is trained focusing on reducing test errors or improving a more balanced accuracy of classification across all classes. Apparently, as the level of classification task increases, Figure 6 shows that, to correctly classify observations in the minority class, SAMME.C2 has to correspondingly reduce accuracy of observations in the majority class. This is a very important result because when observations in the minority class for severely imbalanced datasets are extremely difficult to classify, SAMME assigns nearly all observations in the majority class. Differently said, SAMME assigns nearly no observation in the minority class.
The number of times we iterate to reach an optimal classifier is clearly directly linked to the number of decision stumps we use as weak learners. The more weak learners we use the closer we can reach a desired convergence of our MAvG performance metric. In essence, this can impact the computational efficiency of our iterative algorithm. To do this investigation, we examine for a reasonable number of decision stumps to use for the SAMME.C2 algorithm by exploring the change in the value of MAvG visavis the number of decision stumps. Figure 7 exhibits the results of this investigation.
In the figure, for each level of difficulty of the classification tasks, we examine how changing the number of trees affects reaching the optimal MAvG performance metric with training. The figure shows the effects for various levels of difficulty of classification, varying the number of decision stumps or trees from 50, 100, and intervals of 100 up to 1000. For each time we train a model, the cost values are newly tuned through Genetic Algorithm. The results in Figure 7 exhibit solid lines determined according to 5fold cross validation MAvGs. For reference purposes, we also give the corresponding 5fold cross validation accuracy values shown as dashed lines. For all levels of difficulty, MAvGs increase sharply before 200 decision stumps, however, after that, MAvGs do not improve significantly with increasing number of decision stumps. Therefore, we conclude that at least 200 decision stumps are necessary for SAMME.C2 to perform suitably and favorably.
Finally, we examine the proper number of populations () in GA explained in Section 2.2 to tune the cost values of each class for SAMME.C2. To narrow the possible interval of selecting values, the cost value of the most minority class is fixed at 0.999. Since we should give the largest cost to the most minority class, obviously, cost values for other classes should be between 0 and 0.999. It has been demonstrated by initial experiments that, when we run SAMME.C2 with over 200 decision stumps, the best cost values chosen from GA are in between 0.95 and 0.999. Based on these results, we determine the optimal cost values by choosing from the interval (0.95, 0.999) and we allow for randomness of around 0.001, in the mutation step of GA. Figure 8 reveals 10 values of MAvG according to 10 cost values in each population. As explained in section 2.2, the set of 10 cost values of each population is determined by the 10 MAvG values calculated with trained SAMME.C2 using the set of 10 cost values of the previous population. For all three levels of classification difficulties, we arrive at the best cost values rather quite rapidly. We observe that just after the 4th population, the largest MAvG for each population is nearly similar. The assignment of cost values in SAMME.C2 does not slow the overall estimation and training of the SAMME.C2 algorithm.
We have used numerical experiments to have a better understanding of the SAMME.C2 especially when compared to the SAMME algorithm. We find that SAMME.C2 provides us a much more superior algorithm for learning and understanding observations in the minority class, regardless of the level of difficulty of classification embedded in the data. We also examined how SAMME.C2 performs relative to other algorithms that handle severely imbalanced classes based on insurance telematics data. See so2021cost.
5 Concluding remarks
Because of its potential use in a vast array of disciplines, classification predictive modeling will continue to be an important toolkit in machine learning. One of the most challenging aspects of classification task is finding an optimal procedure to handle observational data with skewed distribution across several classes. We find that there is now a growing body of literature that deals with real world classification tasks related to highly imbalanced multiclass problems. In spite of this growing demand, there is insufficient work on methods to handle severely imbalanced data in a multiclass classification.
In this paper, we presented what we believe is a promising algorithm for handling severely imbalanced multiclass classification. The proposed method, which we refer to as SAMME.C2, combines the benefits of iterative learning from weak learners through the AdaBoost scheme and increased repeated learning of observations in the minority class through a costsensitive learning scheme. We provided a mathematical proof that the optimal procedure resulting in SAMME.C2 is equivalent to an additive model with a minimization of a multiclass costsensitive exponential loss function. The algorithm therefore belongs to the traditional statistical family of forward stagewise additive models. We additionally showed that based on the same multiclass costsensitive exponential loss function, SAMME.C2 is an optimal Bayes classifier.
In order to expand our insights into SAMME.C2 relative to SAMME, our numerical experiments are based on understanding the resulting differences when differing levels of difficulty in classification task is used. We therefore synthetically generated three simulated datasets that are distinguished according to these degrees of difficulty of classification. First, we note that the use of straightforward misclassification, or test errors, does not work well for severely imbalanced datasets. As has been proposed in the literature, the use of MAvG, a geometric average of recall statistics for all classes, is a more rational performance metric as it gives emphasis on being able to train and learn well from observations that belong to the more minority classes. By recording and tracking test errors, MAvGs, and recall statistics, the results of our numerical experiments reveal the superiority of SAMME.C2 in classifying objects that belong to the minority class, regardless of the degree of difficulty of classification. This is at the little expense of sacrificing recall statistics for the majority class. For SAMME.C2, the recall statistics of minority classes are much more improved at each iteration than those of SAMME, but SAMME.C2 has lower recall statistics for majority classes at all iterations than those of SAMME. We also showed the computational efficiency of SAMME.C2 by investigating the most optimal number of weak learners, or iterations, in order to reach convergence. Based on our analysis, training as little as 200 decision stumps as weak learners can rationally stop the iteration.
Comments
There are no comments yet.