Robust and Efficient Boosting Method using the Conditional Risk

06/21/2018 ∙ by Zhi Xiao, et al. ∙ The University of Mississippi 0

Well-known for its simplicity and effectiveness in classification, AdaBoost, however, suffers from overfitting when class-conditional distributions have significant overlap. Moreover, it is very sensitive to noise that appears in the labels. This article tackles the above limitations simultaneously via optimizing a modified loss function (i.e., the conditional risk). The proposed approach has the following two advantages. (1) It is able to directly take into account label uncertainty with an associated label confidence. (2) It introduces a "trustworthiness" measure on training samples via the Bayesian risk rule, and hence the resulting classifier tends to have finite sample performance that is superior to that of the original AdaBoost when there is a large overlap between class conditional distributions. Theoretical properties of the proposed method are investigated. Extensive experimental results using synthetic data and real-world data sets from UCI machine learning repository are provided. The empirical study shows the high competitiveness of the proposed method in predication accuracy and robustness when compared with the original AdaBoost and several existing robust AdaBoost algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For classification, AdaBoost is well-known as a simple but effective boosting algorithm with the goal of constructing a strong classifier by gradually combining weak learners [46, 12, 31]. Its improvement on classification accuracy benefits from the ability of adaptively sampling instances for each base classifier in the training process, more specifically in its re-weighting mechanism. It emphasizes the instances that were previously misclassified, and it decreases the importance of those that have been adequately trained. This adaptive scheme, however, causes an overfitting problem for noise data or data from overlapping class distributions [9, 25, 43]. The problem stems from the uncertainty of observed labels. It is usually a great challenge to do classification for the cases with overlapping classes. Also, it is both expensive and difficult to obtain reliable labels [11]. In some applications (such as biomedical data), perfect training labels are almost impossible to obtain. Hence, how to make AdaBoost achieve noise robustness and avoid overfits becomes an important task. The aim of this paper is to construct a modified AdaBoost classification algorithm with a new perspective for tackling those problems.

I-a Related Work

Modifications to AdaBoost in dealing with noise data can be summarized into three strategic categories. The first one introduces some robust loss functions as new criteria to be minimized, rather than using the original exponential loss. The second type focuses on modifying the re-weighting rule in iterations in order to reduce or eliminate the effects of noisy data or outliers in the training sets. The third approach suggests more modest methods to combine weak learners that take advantage of base classifiers in other ways.

LogitBoost [13] is an outstanding example of a modification of the first strategic category. It uses the negative binomial log-likelihood loss function, which puts relatively less influence on instances with large negative margins111Margin is generally defined as , a negative margin implies a misclassification on an instance in comparison with the exponential loss, thus LogitBoost is less affected by contaminated data [15]. Based on the concept of robust statistics, Kanamori et al. [19] studied loss functions for robust boosting and proposed a transformation of loss functions in order to construct boosting algorithms more robust against outliers. Their usefulness has been confirmed empirically. However, the loss function they utilized was derived without considering efficiency. Onoda [26] proposed a set of algorithms that incorporate a normalization term into the original objective function to prevent from overfitting. Sun et al. [35] and Sun et al. [36] modified AdaBoost using the regularization method. The approaches of the first category modification mainly differ in the loss functions and optimization techniques that are used. Sometimes, in the pursuit of robustness, it is hard to balance the complexity of a loss function with its computation cost.

In general, modification of a loss function leads to a new re-weighting rule for AdaBoost, but some heuristic algorithms directly rebuild their weight updating scheme to avoid skewed distributions of examples in the training set. For instance, Domingo and Watanabe

[10]

proposed MadaBoost that bounds the weight assigned to every sample by its initial probability. Zhang et al.

[49] introduced a parameter into the weight updating formula to reduce weight changes in the training process. Servedio [32] provided a new boosting algorithm, SmoothBoost, which produces only smooth distributions of weights but enables generation of a large margin in the final hypothesis. Utkin and Zhuk [40] took the minimax (pessimistic) approach to search the optimal weights at each iteration in order to avoid outliers being heavily sampled in the next iteration.

Since the ensemble classifier in AdaBoost predicts a new instance by a weighted majority voting among weak learners, the classifier that achieves high training accuracy will greatly impact the predictive result because of its large coefficient. This can have a detrimental effect on the generalization error, especially when the training set itself is corrupted [30, 1]. With this in mind, the third strategy seeks to provide a better way to combine weak learners. Schapire and Singer [30] improved boosting in an extended framework where each weak hypothesis produces not only classifications but also confidence scores in order to smooth the predictions. Besides, another method called Modest AdaBoost [42] intends to decrease contributions of base learners in a modest way and forces them to work only in their domain.

The algorithms described above mainly focus on some robustifying principle, but they do not consider specific information in the training samples. Many other researches [37, 18, 16] introduced the noise level into the loss function and extended some of the above mentioned methods. Nevertheless, most of these algorithms do not fundamentally change the fact that misclassified samples are weighted more than they are in the previous stage, though the increment of weights is smaller than that in AdaBoost. Thus mislabeled data may still hurt the final decision and cause overfitting.

In recent studies, many researchers were inclined to utilize the instance-base method to make AdaBoost robust against label noise or outliers. They evaluated the reliability or usefulness of each sample using statistical methods, and took that information into account. Cao et al. [6] suggested a noise-detection based loss function that teaches AdaBoost to classify each sample into a mostly agreed class rather than using its observed label. Gao and Gao [14] set the weight of suspicious samples in each iteration to zero and eliminated their effects in AdaBoost. Essentially, these two methods use dynamic correcting and deleting techniques in the training process. In [43], the boosting algorithm directly works on a reduced training set whose “confusing” samples have been removed. Zhang and Zhang [48] considered a local boosting algorithm. Its reweighting rule and the combination of multiple classifiers utilize more local information of the training instances.

For handling label noise, it is natural to delete or correct suspicious instances first and then take the remaining “good” samples as prototypes for learning tasks. This idea is not just for AdaBoost but is also applicable to general methods in many fields (e.g., [39]). Some approaches aim at constructing a good noise purification mechanism under the framework of different methods, such as ensemble methods [41, 4, 5]

, KNN or its variants

[29, 22, 17] and so on. Data preprocessing technique is a necessary step to improve quality of the prediction models in some cases [28]. However, some correct samples along with some valuable information may be discarded, and in the meantime, some noise samples may be included or some new noise samples may be introduced. This is the limitation of correcting and deleting techniques. To overcome this weakness, Rebbapragada and Brodley [27] tried to use the confidence on the observed label as a weight of each instance during the training process and provided a novel framework for mitigating class noises. They showed empirically that this confidence weighting approach can outperform the discarding approach, but this new method was only applied to tree-based C4.5 classifier. The confidence-labeling technique they utilized fails to be a desirable label correction method. In [45] and [50]

, they considered and estimated the probability of an instance being from class 1 and used it as a soft label of the instance.

I-B An overview of the proposed approach

Inspired by instance-base methods and construction of robust algorithms, we propose a novel boosting algorithm based on label confidence, called CB-AdaBoost. The observed label of each instance is treated as uncertain. Not only the correctness, but also the degree of correctness of the label, are evaluated according to a certain criterion before the training procedure. We introduce the confidence of each instance into the exponential loss function. With such a modification, the misclassified and correctly classified exponential losses are weightily averaged. The weights are their corresponding probabilities represented by the correctness certainty parameter. In this way, the algorithm treats instances differently based on their confidence, and thus, it moderately controls the training intensity for each observation. The modified loss function is indeed the conditional risk or inner risk, which is quite different from a asymmetric loss or fuzzy loss.

Our method can make a smooth transition between full acceptance and full rejection of a sample label, thereby achieving robustness and efficiency at the same time. In addition, our label-confidence based learning has no threshold parameter, whereas correcting and deleting techniques have to define a confidence level for “suspect” instances so that they are relabeled or discarded in the training procedure. We derive theoretical results and also provide empirical evidences to show superior performance of the proposed CB-AdaBoost.

The contributions of this paper are as follows.

  • A new loss function. We consider the conditional risk so that label uncertainty can be directly dealt with by the concept of label confidence. This new loss function also leads the consideration of the sign of Bayesian risk rule on each of the sample points at the initialization of the procedure.

  • A simple modification of adaptive boosting algorithm. Based on the new exponential loss function, AdaBoost has a simple explicit optimization solution at each iteration.

  • Theoretical and empirical justifications for efficiency and robustness of the proposed method.

  • Consistency of the CB-AdaBoost is studied.

  • Broad adaptivity. The proposed CB-AdaBoost is suitable for noise data and for class-overlapping data.

I-C Outline of the paper

The remainder of the paper is organized as follows. Section II reviews the original AdaBoost. In Section III, we propose a new AdaBoost algorithm. We discuss in detail assignment of label confidence, the loss function, and the algorithm as well as its ability of adaptive learning in the label-confidence framework. Section IV devotes to a study of the consistency property. In Section V, we illustrate how the proposed algorithm works and investigate its performance through empirical studies of both synthetic and real-world data sets. Finally, the paper concludes with some final remarks in Section VI. A proof of consistency is provided in Appendix.

Ii Review of AdaBoost Algorithm

For binary classification, the main idea of AdaBoost is to produce a strong classifier by combining weak learners. This is obtained through an optimization that minimizes the exponential loss criterion over the training set. Let denote a given training set consisting of independent training observations, where and represent the input attributes and the class label of the instance, respectively. The pseudo-code of AdaBoost is given in Algorithm 1 below.


Algorithm 1

AdaBoost Algorithm Input: and the maximum number of base classifiers . Initialize: For , , , where is the normalization factor. For To 1 Draw instance from with replacement according to the distribution to form a training set ; 2 Train with the base learning algorithm and obtain a weak hypothesis ; 3 Compute ; 4 Let ; If , then and abort loop. 5 Update ; for , where ; End For Output: sign().


In the AdaBoost Algorithm, the current classifier is induced on the weighted sampling data, and the resulting weighted error is computed. The individual weight of each of the observations is updated for the next iteration. AdaBoost is designed for clean training data—that is, each label is the true label of . In this framework, any instance was previously misclassified has a higher probability to be sampled in the next stage. In this way, the next classifier focuses more on those misclassified instances, and hence, the final ensemble classifier achieves high accuracy. For mislabeled data, however, those observations which were misclassified in the previous step are weighted less, and those correctly classified instances are weighted more than they should. This leads to the next training set being seriously corrupted, and those mislabeled data eventually hurt the performance of the ensemble classifier. Therefore, some modifications should be introduced to make AdaBoost insensitive to class noise.

Iii Label-confidence based Boosting Algorithm

Iii-a Label confidence

For the class noise data problem, the observed label associated with may be incorrectly assumed due to some random mechanism. For the class overlapping problem, the label associated with is a realization of random label from some distribution. In our approach to deal with both problems, we treat the true label to be random. Let (either 1 or -1) be the observed label associated with . We define a parameter as the probability of being correctly labeled, that is, and for . The quantity measures “trustworthiness” of label and represents confidence towards correctness or wrongness of the label. Thus we can use as the trusted label with confidence level . For example, for , and represent that we are confident about correctness of the label , while for , and represent certainty about the wrongness of so that should be trusted. The label with is the most unsure or fuzzy case with 0 confidence. It is easy to see that the trusted label is exactly the Bayes rule. Let and hence the Bayes rule is , which is equal to for both and .

For given training data

, let a parameter vector

represent their probabilities of being correctly labeled. That is, the parameter can be regarded as the confidence of a sample being correctly labeled as . In the next subsections, we first introduce the modified loss function based on a given , then propose the confidence based adaptive boosting method (CB-AdaBoost). At the end of the section we discuss the estimation of .

Iii-B Conditional-risk loss function

Given a clean training set with correct labels ’s available, the original AdaBoost minimizes the empirical exponential risk

(III.1)

over all linear combinations of base classifiers in the given space , assuming that an exhaustive weak learner returns the best weak hypothesis on every round [13, 31]. Now in class noise data, the true label is unknown. We only observe associated with . Based on the assumption, given , the probability that is is . It is natural to consider the following empirical risk:

(III.2)

That is, we treat the observed label as a fuzzy label with correctness confidence. In other words, we consider the modified exponential loss function

(III.3)

which has a straightforward interpretation. The label associated with is trusted with confidence and it is corrected as with confidence. It is easy to check that the loss (III.3)

which is the inner risk defined in [33]. The reason it is called the inner risk is because the true exponential risk is

(III.5)

for . From this perspective, we consider minimizing the empirical inner risk of (III.5), while the original AdaBoost minimizes the empirical risk of (III-B). Steinwart and Christmann [33] showed in their Lemma 3.4 that the risk can be achieved by minimizing the inner risks, where the expectation is taken with respect to the marginal distribution of , in contrast to (III-B

) where the expectation is taken with respect to the joint distribution of

. Clearly, under the scenarios of overlapping class and label noise, the empirical inner risk (III.2) has an advantage over (III.1).

In [2], (III.3) is called the conditional -risk with being the exponential loss function. A classification-calibrated condition on the conditional risk is provided to ensure a pointwise form of Fisher consistency for classification. In other words, if the condition is satisfied, the 0-1 loss can be surrogated by the convex loss in order to make the minimization computationally efficient. The exponential loss is classification-calibrated. Our proposed method utilizes a different empirical estimator of the exponential risk. Its consistency follows from the consistency result of AdaBoosting [3] along with consistent estimation of . More details are presented in Section IV.

The loss (III.3) is closely related to the asymmetric loss used in the literature (e.g. [44, 24]), but the motivation and goal of the two losses are quite different. The asymmetric loss treats two classes unequally. Two misclassification errors produce different costs. However, the costs or weights do not necessarily sum up to 1. In asymmetric loss, the ratio of two costs is usually used to measure the degree of asymmetry and is often a constant parameter, while in (III.3) it is a function of . Also the loss (III.3) takes a linear combination of the exponential loss at and , while the asymmetric loss only takes one.

Indeed, in the loss (III.3

) is the posterior probability used in

[38]

for the support vector machine technique. The similarity is that we all use the sign of the Bayes rule as the trusted label. However, we also include the magnitude

in our loss function. We associate the trusted label with a confidence , while in [38] the confidence is always 1. The idea of label confidence is closely related to fuzzy label used in fuzzy support vector machines [21]. The difference is that fuzzy label only assigns an importance weight for the observed label without considering its correctness.

Next, we derive the proposed method based on the modified exponential loss function.

Iii-C Derivation of our algorithm

For an additive model,

(III.6)

where is a weak classifier in the iteration, is its coefficient and is an ensemble classifier. Our goal is to learn an ensemble classifier with a forward stage-wise estimation procedure by fitting an additive model to minimize the modified loss functions. Let us consider an update from to by minimizing (III.2). This is an optimization problem to find solutions and , that is,

(III.7)

where and are independent with and .

As we will show, and can be derived separately in two steps. Let us first optimize the weak hypothesis . The summation in (III.7) can be expressed alternatively as

Therefore, for any given value of , (III.7) is equivalent to the minimization of

(III.8)

It is worthwhile to mention that the term may be negative, hence it cannot be directly interpreted as the “weight” of the instance in the training set. According to the analytical solution of , the base classifier is expected to correctly predict in the case of and otherwise misclassify . This is equivalent to solving

(III.9)

In other words, is actually the one that minimizes the prediction error over the set with each instance weighted . In each iteration, we treat as the label of and as its importance. This provides a theoretical justification of the sampling scheme in our proposed algorithm, which is given later.

Next, we optimize . With fixed, minimizes

(III.10)

Upon setting the derivative of (III-C) (with respect to ) to zero, we obtain

(III.11)

Note that the condition that

(III.12)

should hold in order to ensure the value of is positive.

The approximation on the iteration is then updated as

(III.13)

which leads to the following update of and :

and

By repeating the procedure above, we can derive the iterative process for all rounds until or the condition (III-C) is not satisfied. The initial values take and . Now we write the procedure into the pseudo-code of the Algorithm 2.


Algorithm 2

CB-AdaBoost Algorithm Input: , and Initialize: For , , , , where For To 1 Relabel all instances in to compose a new data set as , where , ; 2 Draw instance from with replacement according to the distribution to compose a training set ; 3 Train with the base learning algorithm and obtain a weak hypothesis ; 4 Let If , then and abort loop. 5 Update ; ; for , where ; End For Output: sign().


Iii-D Class noise mitigation

In this subsection, we study the effect of label’ confidence, and we investigate the adaptive ability of CB-AdaBoost in the mitigation of overfitting and class noise from aspects of its re-weighting procedure and classifier combination rule.

First, the initialization of distribution shows different initial emphases on training instances between Algorithm 1 and Algorithm 2. As discussed early, actually represents its label certainty, and it is used as the initial weight in Algorithm 2. The conditional risk type of loss function leads this initialization and the weighting strategy that distinguishes instances based on their own confidences. Consequently, the instances with a high certainty receive a priority to be trained. This makes sense as these instances are usually those identifiable from a statistical standpoint, and thus, they are more valuable in classification. By contrast, Algorithm 1 treats each instance equally at the beginning without considering the reliability on the samples.

Second, we consider as the label of in Algorithm 2. Under the mislabeled or class overlapping scenarios, this design makes sense because represents the confidence towards correctness or wrongness of the label . If , should be trusted with confidence . Nevertheless, if , should be trusted with confidence . The original AdaBoost trusts label completely, which is inappropriate under mislabelling and class overlapping. As shown before, the trust label in CB-AdaBoost has the same sign as the Bayes rule at sample point . Intuitively, our method takes more information at the initialization.

Third, we take a detailed look at the weight updating formulas in Algorithm 2 and subsequently obtain the following results on the first re-weighting process. We say that an instance is misclassified at the iteration if , where ; otherwise, it is correctly classified.

Proposition 1. The misclassified instance receives larger weight for the next iteration.

Proof. Two types of misclassification are either with or with . In the first case,

while in the second case,

In both cases, the weight increases.

Proposition 2. If an instance is correctly classified and its certainty is high enough so that , then it receives smaller weight at the next iteration.

Proof. We can easily check two cases. For the case of and , when , we have

For the case of and , if , we have

Propositions 1 and 2 show that on the first important stage, CB-AdaBoost inherits the adaptive learning ability of AdaBoost and has the distinction that it adjusts the distribution of instances according to the current classification with respect to the commonly agreed information. Moreover, the degree of adjustment is managed by the confidence of each sample. For the following iterations, we can imagine the resampling process. The weights of instances with high confidence stay at a high level until most of them are sufficiently learned. After that, their proportion decreases rapidly while the proportion of instances with low confidence increases gradually. As uncertain instances consist of most of the training set, the training process is difficult to continue. On the other hand, once a new classifier becomes no better than a random guess, then an early stop in the iterative process is possible. This is because the condition (III-C) no longer holds in that case. Thus, our proposed method effectively prevents the ensemble classifier from overfitting.

Fourth, let us scrutinize the classifier ensemble rule.

Proposition 3. In the framework of Algorithm 2, define as the error rate of over its training set during the iteration—that is, . We then have

Proof. We can prove this result by giving an equivalent representation of as:

where

With the Condition (III-C) being satisfied, we obtain , which implies

Thus, the proof of Proposition 3 is complete.

It turns out that

calculated in our modified algorithm does not take into account the full value of the odd ratio for each hypothesis. In fact, it is smaller than that calculated in AdaBoost, so our algorithm combines base classifiers and updates instance weights modestly. This effectively avoids the situation where some hypotheses dominated by substantial classification noise are exaggerated by their large coefficients in the final classifier.

We have studied the CB-AdaBoost algorithm in detail and compared its advantages to the original one. Next, we discuss the remaining issue of how to estimate label confidence.

Iii-E Assignment of label confidence

In most cases, since it is difficult to track the data collection process and identify where corruptions will most likely occur, we evaluate the confidence on labels according to the statistical characteristics of the data itself. In this regard, [27]

suggested a pair-wise expectation maximization method (PWEM) to compute confidence of labels. Cao et al

[6] applied KNN to detect suspicions examples. However, a direct application of these methods may not be efficient for data sets whose noise level is high. We believe that a cleaner data set can make a better confidence estimation. Therefore, before confidence assignment, a noise filter shall be introduced to eliminate very suspicious instances so that we are able to extract more reliable statistical characteristics from the remaining data.

Noise Level
Normal Clean
Mislabeled
Clean
Mislabeled
Sine Clean
Mislabeled
Clean
Mislabeled
TABLE I:

Average and standard deviation of the confidences for clean and mislabeled samples in two data sets with different noise levels.

First, a noise filter scans over the original data set. Using a similarity measure between instances to find a neighborhood of each instance, one can compute the agreement rate for its label from its neighbors. The instances with an agreement rate below a certain threshold are eliminated. The above process can be repeated several times since some suspect instances may be exposed later when their neighborhood changes. In our experiment, the threshold is set to 0.07 at the beginning with an increment of 0.07 in each subsequent round. The process is repeated three times, and the final cut-off value for the agreement rate is 0.21 so that the sample size doesn’t decrease much. In the mean time, distributional information of the sample is kept relatively intact. Once a filtered data set, denoted as , is obtained, two methods can be used to compute label confidence.

If the noise level over the training labels is known or can be estimated, we can represent the frequency of observations with label as follows:

where the noise level . This representation explains two sources for the composition of label : correctly labeled instances belonging to true class and mislabeled instances belonging to true class . Then , and utilizing the Bayesian formula, we assess the confidence as follows:

With conditional distribution type known, and can be estimated under while is directly set to be the sample proportion of class in .

The second method doesn’t need to assume the noise level. KNN is recalled to assign confidence on each label. Based on , the label agreement rate of each instance among its nearest neighbors can act as its confidence. So the confidence probability of an example in is computed as follows:

(III.14)

where represents the set containing nearest neighbors of from . In our experiment, is used.

In the simulation of Section V, we will evaluate the quality of confidence assigned by these two methods. In practice, however, the Bayesian method is usually infeasible since the noise level is unknown.

Iii-F Relationship to previous work

Note that our modified algorithm reduces to AdaBoost if we set the confidence on each label to one. The greater the confidence on each instance, the less CB-AdaBoost differs from AdaBoost in terms of the weight updating and base classifiers, as well as their coefficients in successive iterations. Rebbapragada et al. [27] proposed instance weighting via confidence in order to mitigate class noise. They attempted to assign confidence on instance label such that incorrect labels receive lower confidences. We share a similar opinion in dealing with noise data, but instance weighting via confidence itself seems to be a discarding technique rather than a correcting technique. That is, a low confidence implies an attempt to eliminate the example, while a high confidence implies keeping it. By contrast, our algorithm considers both the correctly labeled and mislabeled probability for an instance. Therefore, the loss function

explains the attitude towards an instance: delete it with and correct it by . In other words, our algorithm can be viewed as a composition technique of discarding and correcting. For the same reason, our algorithm differs from those proposed in [14] and [6]. In their discussions, they suggested heuristic algorithms to delete or revise suspicious examples during iterations in order to improve the accuracy of AdaBoost for mislabeled data. In our algorithm, the suspicious labels are similarly revised, which is a consequence of minimizing the modified loss function (III.2). The trusted label at each sample point is the sign of the Bayes rule and is associated with a confidence level.

Other closely related work includes [45] and [50]. Both consider the same confidence level of as , whereas our approach takes advantage of the observed label by considering . We evaluate confidence of the observed label , while they assess confidence of the positive label . In [50], the initial weight is very similar to our choice, but our re-weighting and classifier combination rules are different. [45] has a similar combination rule as ours, but the initial weights are different.

Iv Consistency of CB-AdaBoosting

In this section, we study consistency of the proposed CB-AdaBoosting method with label confidences estimated by KNN approach. Several authors have shown that the original and modified versions of AdaBoost are consistent. For example, Zhang and Yu [47] considered a general “boosting” with a step size restriction. Lugosi and Vayatis [23] proved the consistency of regularized boosting methods. Bartlett and Traskin [3] studied the stopping rule of the traditional AdaBoost that guarantees its consistency. In our algorithm, we use the exponential loss function. We just use a different empirical version of the exponential risk. This enables us to adopt the stopping strategy used in [3] with a consistency result on the nearest neighborhood method ([34, 8]) to show that the proposed CB-AdaBoost is Bayes-risk consistent.

We use notation similar to [3]. Let be a pair of random values in with the joint distribution and the marginal distribution of being . The training sample data is available, having the same distribution as . The mislabel problem can be treated as the case being a contamination distribution. The CB-AdaBoost produces a classifier