Kalman Filter-based Heuristic Ensemble: A New Perspective on Ensemble Classification Using Kalman Filters

07/30/2018 ∙ by Arjun Pakrashi, et al. ∙ Insight Centre for Data Analytics 0

A classifier ensemble is a combination of multiple diverse classifier models whose outputs are aggregated into a single prediction. Ensembles have been repeatedly shown to perform better than single classifier models, therefore ensembles has been always a subject of research. The objective of this paper is to introduce a new perspective on ensemble classification by considering the training of the ensemble as a state estimation problem. The state is estimated using noisy measurements, and these measurements are then combined using a Kalman filter, within which heuristics are used. An implementation of this perspective, Kalman Filter based Heuristic Ensemble (KFHE), is also presented in this paper. Experiments performed on several datasets, indicate the effectiveness and the potential of KFHE when compared with boosting and bagging. Moreover, KFHE was found to perform comparatively better than bagging and boosting in the case of datasets with noisy class label assignments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An ensemble classification model is composed of multiple individual base classifiers, also known as component classifiers, the outputs of which are aggregated together into a single prediction. The classification accuracy of an ensemble model can be expected to exceed that of any of its individual base classifiers. The main motivation behind ensemble techniques is that a committee of experts working together on a problem are more likely to accurately solve it than a single expert working alone (kelleher2015fundamentals, ). Although many existing ensemble techniques (e.g. (Breiman1996, ; Friedman00greedyfunction, ; hastie2009multi, ; zhu2006multi, )) have been repeatedly shown in benchmark experiments to be effective (see (Narassiguin2016, ; Opitz:1999:PEM:3013545.3013549, )), current approaches still have limitations. For example, methods based on bagging, although robust, may not lead to models as accurate as those learned by more sophisticated methods such as those based on boosting (Narassiguin2016, )

. Methods based on boosting, however, are sensitive to class-label noise and the presence of outliers in training datasets

(Dietterich2000, ).

To address the limitations of current multi-class classification ensemble algorithms, this paper presents a new perspective on ensemble model training, framing it as a state estimation problem that can be solved using a Kalman filter (kalman1960, ; maybeck1982stochastic, ). Although Kalman filters are most commonly used to solve problems associated with time series data, this is not the case in this work. Rather, this work exploits the data fusion property of the Kalman filter to combine individual multi-class component classifier models to construct an ensemble.

The new perspective views the ensemble model to be trained as an unknown static state to be estimated. A Kalman filter can be used to estimate an unknown static state by combining multiple uncertain measurements of the state. This exploits the data fusion property of the Kalman filter. In the new perspective the measurements are the single component classifiers in the ensemble, and the uncertainties of these measurements are based on the classification errors of the single component classifiers. The Kalman filter is used to combine the component classifier models into an overall ensemble model. This new perspective on ensemble training provides a framework within which different algorithms can be formulated. This paper describes one such new algorithm, the Kalman Filter-based Heuristic Ensemble (KFHE). In an evaluation experiment KFHE is shown to out-perform methods based on boosting while maintaining the robustness of methods based on bagging. The contributions of this paper are:

  1. A new perspective on training multi-class ensemble classifiers, which views it as a state estimation problem and solves it using a Kalman filter (kalman1960, ; maybeck1982stochastic, ).

  2. A new multi-class ensemble classification algorithm, the Kalman Filter-based Heuristic Ensemble (KFHE).

  3. Extensive experiments comparing KFHE with the state-of-the-art ensemble algorithms that demonstrate the effectiveness of KFHE in both scenarios of noise free and noisy class-labels.

The remainder of this paper is structured as follows. Section 2 discusses previous work on multi-class ensemble classification algorithms and provides a brief introduction to the Kalman filter. Section 3 introduces the new Kalman filter-based perspective on building multi-class classification ensembles. The Kalman Filter-based Heuristic Ensemble (KFHE) method based on this perspective is described in Section 4. The setup of an experiment to evaluate the performance of KFHE and the comparison method to state-of-the-art approaches on a selection of datasets is described in Section 5, and a detailed discussion of the results of this experiment is presented in Section 6. Finally, Section 7 reflects on the newly proposed perspective and explores directions for future work.

2 Background

This section first reviews existing multi-class ensemble classification methods. Relevant aspects of the Kalman filter approach for state estimation, which serve as a basis for the explanation of KFHE, are then introduced.

2.1 Ensemble methods

The advent of ensemble approaches in machine learning in the early 1990s was due mainly to works by Hansen and Salamon

(Hansen:1990:NNE:628297.628429, ), and Schapire (Schapire1990, ). Hansen and Salamon (Hansen:1990:NNE:628297.628429, ) showed that multiple classifiers could be combined to achieve better performance than any individual classifier. Schapire (Schapire1990, ) proved that the learnability of strong learners and weak learners are equivalent, and then showed how to boost weak learners to become strong learners. Since then many alternative and improved approaches to build ensembles have been introduced. Ensemble methods can still, however, be categorised into three fundamental types: bagging, boosting, and stacking.

Bagging (Breiman1996, ) or bootstrap aggregation, trains several base classifiers on bootstrap samples of a training dataset and combines the outputs of these base classifiers using simple aggregation such as majority voting. Training models on different samples of the training set introduces diversity into the ensemble, which is key to making ensembles work effectively. UnderBagging (UnderBagging:Barandela2003, ) is a variation of bagging addressing imbalanced datasets that performs undersampling before every bagging iteration, but also keeps all minority class instances in every iteration. The Random Forest (Breiman2001rf, )

is an extension to bagging in which base classifiers (usually decision trees) are trained using a bootstrap sample of the dataset that has also been reduced to only a small random sample of the input space. The

Rotational Forest (Rodriguez:2006:RFN:1159167.1159358, ) is another extension that attempts to build base classifiers that are simultaneously accurate and diverse. The input dataset is transformed by applying PCA (hastie01statisticallearning, ) on different subsets of the attributes of the dataset, and axis rotation is performed by combining the coefficient matrices found by PCA for each subset. This is repeated multiple times. Local Linear Forests

modify random forests by considering random forests as an adaptive kernel method and combining it with local linear regression

(friedberg2018local, ).

Boosting (zhu2006multi, ) approaches iteratively learn component classifiers such that each one specialises on specific types of training examples. Each component classifier is trained using a weighted sample from a training dataset such that at each iteration the ensemble emphasises training examples that were misclassified in the previous iteration. Since the introduction of the original boosting algorithm, AdaBoost (freund1995desicion, ), several new approaches to boosting have been proposed. In LogitBoost (friedman2000additive, )

, the logistic loss function is minimised while combining the sub-classifiers in a binary classification context. A linear programming approach to boosting,

LPBoost (demiriz2002linear, ), was shown to be competitive with AdaBoost. This algorithm minimises the misclassification error and maximises the soft margin in the feature space generated by the predictions of the weak hypothesis components of the ensemble. A multi-class modification for binary class AdaBoost was introduced in (freund1995desicion, ), and an improvement of it was proposed in (hastie2009multi, ). RotBoost (ZHANG20081524, ) is a direct extension of the rotational forest approach (Rodriguez:2006:RFN:1159167.1159358, ) to include boosting. The Gradient Boosting Machine (GBM) (Friedman00greedyfunction, ) is a sequential tree based ensemble method, where each tree corrects the errors of the previously trained trees. Stochastic Gradient Boosting Machine (S-GBM) (FRIEDMAN2002367, ) improves GBM by training the component trees on bootstrap samples.

AdaBoost is sensitive to noisy class labels and performs poorly as the level of noise increases (Freund2001, ). This is mainly due to the exponential loss function AdaBoost uses to optimise the ensemble. If a training datapoint has noisy class-labels AdaBoost will increase its weight for the next iteration and keep on increasing the weight of the datapoint in a vain attempt to classify it correctly. Therefore, given enough such noisy class-labelled datapoints AdaBoost can learn classifiers with poor generalisation ability. Although the performance of bagging decreases in the presence of class-label noise, it does not do so as severely as it does with AdaBoost (Dietterich2000, ).

To overcome this problem with noisy class-labeled datasets, MadaBoost (Domingo:2000:MMA:648299.755176_madaboost, ) was proposed. MadaBoost changes the standard AdaBoost weight update rule by capping the maximum value for the weight of a datapoint to be . Similarly FilterBoost (NIPS2007_3321filterboost, ) optimises the log loss function, leading to a weight update rule which caps the weight upper bound of a datapoint to using a smooth function. BrownBoost (Freund2001, ) and Noise Detection Based AdaBoost (ND_AdaBoost) (CAO20124451, ) make AdaBoost more robust to class label noise by explicitly identifying noisy examples and ignoring them. Robust Multi-class AdaBoost (Rob_MulAda) (SUN201687, ) is an extension to ND_AdaBoost for multi-class classification. Vote-Boosting (SABZEVARI2018119, ), decides the weights of each datapoint while training based on the disagreement of the predictions of the component classifiers that exist at each iteration. For lower levels of class-label noise, the datapoints with higher disagreement rates are emphasised. Whereas for higher levels of class-label noise, datapoints which agree among different component classifiers are highlighted, in an attempt to achieve robustness to class-label noise. A comprehensive review and analysis of the different boosting variations can be found in (zhou2012ensemble, ).

Stacking (WOLPERT1992241, ; Ting97stackedgeneralization:, ) is a two stage process in which the outputs of a collection of first stage base classifiers are combined by a second stage classifier to produce a final output. Seewald (Seewald:2002:MSB:645531.656165, ), empirically showed that the extension to stacking by Ting and Witten (Ting97stackedgeneralization:, ) does not perform well in the multi-class context, and proposed StackingC to overcome this drawback. In (MENAHEM20094097_troika, )

the weaknesses of StackingC were highlighted and were shown to occur due to increasingly skewed class distributions because of the binarisation of the multi-class problem. Next, a three layered improved stacked method for multi-class classification,

Troika (MENAHEM20094097_troika, ), was proposed. The stacking approach to building ensembles has received much less research attention than approaches based on bagging and boosting.

2.2 The Kalman filter

The Kalman filter (kalman1960, ) is a mathematical tool for stochastic estimation of the state of a linear system based on noisy measurements. Let there be a system which evolves linearly over time, and assume that the state of the system, which is unobservable, has to be estimated at each time step, . The state may be estimated in two ways. First, a linear model, which is used to update the state of the system from step to step , can be used to get an a priori estimate of the state. This estimate will have a degree of uncertainty as the linear model is unlikely to fully capture the true nature of the system. Estimating the state using this type of linear model is commonly known as a time update step. Second, an external sensor can provide a state estimate. This estimate will also have an associated uncertainty, referred to as measurement noise, and introduced because of inaccuracies in the measurement process.

Figure 1: A high-level illustration of a Kalman filter

Given these two state estimates, and their related uncertainties, the Kalman filter combines the a priori estimate and the measurement to generate an a posteriori state estimate, such that the uncertainty of the a posteriori estimate is minimised. This combination of a sensor measurement with an a priori estimate is commonly known as the measurement update step. The process iterates using the a posteriori estimate calculated in a measurement update step as input to the time update step of the next iteration. A high-level illustration of the Kalman filter is shown in Figure 1. More formally, the time update step in a Kalman filter can be defined as:

(1)
(2)

where:

  • is the a priori estimate at step when the knowledge of the state in the previous step is given

  • is the a posteriori estimate at step , which is found through combining the a priori estimate and the measurement

  • is the state transition matrix which defines the linear relationship between and

  • is the control input vector, containing inputs which changes the state based on some external effect

  • is the control input matrix applied to the control input vector

  • is the covariance matrix representing the uncertainty of the a priori estimate

  • is the covariance matrix representing the uncertainty of the a posteriori estimate at step

  • is the process noise covariance matrix, induced during the linear update

Similarly, the measurement update step can be defined as:

(3)
(4)
(5)

where

  • is the measurement of the system at time

  • is the measurement noise covariance matrix

  • is a transformation matrix relating the state space to the measurement space (when they are the same space, then

    can be the identity matrix)

  • is the Kalman gain which drives the weighted combination of the measurement and the a priori state

  • indicates the identity matrix

The Kalman filter iterates through the time update and the measurement update steps. In this work time steps are considered equidistant and discreet. Hence, from this point, “time step” and “iteration” will be used interchangeably. At , an initial estimate for and is used. Next, the time update step is performed using Eq. (1) and (2) to get and respectively. The measurement and its related uncertainty are then obtained from a sensor or other appropriate source. These are combined with the a priori estimate using the measurement update step to find and using Eq (4), (3) and (5), which are then used in the next iteration . A detailed explanation of Kalman filters can be found in (kalman1960, ; maybeck1982stochastic, ), and an intuitive description in (Welch:1995:IKF:897831, ).

It should be emphasised here that, although a Kalman filter is used and Kalman filters are most commonly used with time series data, the proposed method does not perform time series prediction. Rather the focus is on multi-class classification and the data fusion property of the Kalman filter is used to combine the individual multi-class classifiers in the ensemble. Also, the term “ensemble” in this work relates to multi-class ensemble classifiers, and should not be confused with Ensemble Kalman Filters (EnKF) (evensen2003ensemble, ).

Apart from their applications to time series data and sensor fusion, Kalman filters have been used previously in a small number of supervised and unsupervised machine learning applications. For example, (SISWANTORO2016112, )

, improves the predictions of a neural network using Kalman filters, although this method is essentially a post-processing of the results of a neural network output. Properties of a Kalman filter were used in combination with heuristics in population-based metaheuristic optimisation algorithms

(TOSCANO20101955, ; Monson04thekalman, ), and in an unsupervised context in clustering (PAKRASHI2016704, ; pakrashikhka_10.1007/978-3-319-20294-5_39, ). To the best of the authors’ knowledge this is the first application of Kalman filters to training multi-class ensemble classifiers.

3 Training multi-class ensemble classifiers using a Kalman filter

This section introduces the new perspective on training multi-class ensemble classifiers using a Kalman filter. First, a toy example of static state estimation using a Kalman filter is presented, and then the new perspective is described.

3.1 A static state estimation problem: Estimating voltage level of a battery

Imagine that the exact voltage of a DC battery (which should remain constant) is unknown and needs to be estimated. A sensor is available to measure the voltage level of the battery. The measurements made by this sensor are unfortunately noisy, but the uncertainty associated with the measurements is known. This is a simple example of a static state estimation problem that can be solved by taking multiple noisy sensor measurements of the battery’s voltage, and combining these into a single accurate estimate using a Kalman filter.

The Kalman filter can be applied in this scenario as follows. As it is known that the voltage of the battery does not change the state transition matrix, , in Eq. (1) is the identity matrix; the control input matrix, , in Eq. (2) is non-existent; and the process noise covariance matrix, , in Eq. (2) is considered to be zero. The voltage read by the sensor at a particular measurement, and the related uncertainty of the value due to the limited accuracy of the sensor, give and in Eq. (3) and (4) respectively. Given this information, the Kalman filter time update and measurement update steps can be performed to combine the current estimated voltage, , and the measurement, , to get a new and better estimate of the voltage. The process can be repeated, where at each step, a new voltage measurement from the sensor is received, which is then combined with the current estimated voltage value using the measurement update step.

Note that, after iterations, the estimated voltage is a combination of the sensor output values, where the Kalman gain, in Eq (4) and (5), is controls the influence of each measurement in the combination. Therefore, after iterations, the estimated voltage, , can be seen as an ensemble of the values received from the sensor, which are optimally combined. This same idea can be applied to combine noisy base classifiers into a more accurate ensemble model.

3.2 Combining multi-class classifiers using the Kalman filter

A machine learning algorithm learns a hypothesis for a specific problem. Assume that all possible hypotheses make a hypothesis space111The term hypothesis and hypothesis space is used to introduce the high level idea in connection with (DietterichHSpace, ), but the terms model and model space will be used synonomoulsy throughout this text., as described in (DietterichHSpace, ). Any point in the hypothesis space represents one hypothesis. For a specific problem, there is at least one ideal hypothesis within this hypothesis space which the learning algorithm tries to reach. Different hypotheses within the hypothesis space differ in their trainable parameters, and the machine learning algorithm modifies these parameters. Therefore, the training process can be seen as a search through the hypothesis space toward the ideal hypothesis.

The perspective presented in this paper views the ideal hypothesis as the static state to be estimated, and the hypothesis space as a state space. When an individual component classifier, , is trained, it can be seen as a point in the hypothesis space. Here, can be considered as an attempt to measure the ideal state with a related uncertainty indicated by the training error of . The Kalman filter can be used to estimate the ideal state by combining these multiple noisy measurements. The combination of these noisy measurements leads to an estimation of the state that is expected to be more accurate than the individual measurements, and so an ensemble classification model that is more accurate than its component classifiers.

This is illustrated in Figure 2. The vertical axis is an abstract representation of the hypothesis space with each point along this axis representing a possible hypothesis. The star symbol on the vertical axis indicates the ideal hypothesis for a specific classification problem. The horizontal axis in Figure 2 represents training iterations proceeding from left to right. The circles are the estimates of the hypothesis at a time step (the combination of all models added to the ensemble to this point in the training process), and the plus symbols represent the measurement of the hypothesis at a time step (the last model added to the ensemble). The dashed and solid arrows connecting the state estimates indicate the combination of the measurement and the a priori estimate respectively. The goal of the process is to reach a hypothesis as close as possible to the ideal hypothesis (indicated by the horizontal line marked with a star) by combining multiple individual hypotheses using a Kalman filter.

Figure 2: An illustration of the state estimation perspective on training a classification ensemble. The vertical axis indicates the hypothesis space and the horizontal axis represents iterations. Any point along the vertical axis represents a hypothesis. The horizontal line marked with the star indicates the ideal hypothesis. The circles are the estimates, and the plus symbols are the measurements of the hypothesis. The absolute distance of the circles and the plus symbols from the bold horizontal line indicating the ideal hypotheses represents the uncertainty of the estimates. The target of the training algorithm is to navigate through the hypothesis space to get as close to the ideal hypothesis as possible.

To help with understanding the new perspective, the Kalman filter-based approach to ensemble training can be directly mapped back to the DC battery voltage estimation example described in Section 3.1. The ensemble model capturing the ideal hypothesis is equivalent to the actual voltage level of the DC battery. An individual component classifier, , is analogous to the output from the voltage sensor. The classification error of the model maps to the uncertainty related to the voltage sensor measurements. Just as the estimated voltage after iterations can be thought as an ensemble of sensor measurements in the battery voltage estimation case; the trained individual classifiers, combined using the Kalman filter leads to an ensemble of classifier models.

4 Kalman Filter-based Heuristic Ensemble (KFHE)

This section provides a detailed description of the Kalman Filter-based Heuristic Ensemble (KFHE) algorithm, based on the new perspective proposed in Section 3. First, Section 4.1 presents an overview of the algorithm and connects the high-level concepts from Section 3. Sections 4.2, 4.3 and 4.4 then discuss the details of the algorithm.

4.1 Algorithm overview

In KFHE the Kalman filter used to estimate an ensemble classifier, as described in Section 3, is referred to as the model Kalman filter, abbreviated to kf-m. To implement kf-m, the following questions must be answered:

  1. What should constitute a state?

  2. How should the time update step be defined?

  3. What should constitute a measurement?

  4. How should measurement uncertainty be evaluated?

Figure 3: Overall dataflow between kf-m and kf-w

The kf-m state estimates are essentially the trained component classifiers. A model specification (for example the rules encoded in a decision tree or weight values in neural network) cannot be used directly as a state within the Kalman filter framework. Instead the predictions made by a component classifier for the instances in the training dataset are used as the representation of the state, as shown in Figure 4. This allows states to be combined using the equations in Section 2.2. This representation is explained in detail in Section 4.2.

Heuristics are used to address the remaining questions. The time update step is implemented as the identity function, as it can be assumed that the ideal state is static and does not change over time (as indicated by the horizontal line in Figure 2). The measurement is a function of the output of the multi-class classifier trained at the th iteration. This model is trained using a weighted sample from the overall training dataset. The classification error of the model trained at the th iteration, measured against its predictions for the full training set, is used as the uncertainty of the measurement.

A Kalman filter is then used to combine a measurement, which is the classification model at step represented as shown in Figure 4, and the a priori estimate to get an a posteriori estimate. The a posteriori state estimate at the th iteration is considered the ensemble classifier up to the th iteration. This a posteriori estimate is used in the next iteration, and the process continues until a stopping condition is met. As the uncertainties of the estimates are represented as the classification errors, the process continues towards estimating states expected to yield lower classification errors.

The use of weighted samples from the training set to train component classifiers at each step of the kf-m process gives rise to another question: how should the weights for the weighted sampling of the training dataset be decided? In KFHE the answer is through another Kalman filter, which is referred to as the weight Kalman filter and abbreviated to kf-w. The kf-w Kalman filter works very similarly to kf-m, but estimates sampling weights for the training dataset instead of the overall model state. This is described in detail in Section 4.3.

The interactions between the model Kalman filter, kf-m, and the weight Kalman filter, kf-w, are illustrated in Figure 3. Essentially kf-w provides weights for the measurement step in kf-m, and kf-m provides measurement errors back to kf-w for its measurement step. The training process is summarised in Algorithm 1 and the following subsections describe the workings of kf-m and kf-w in detail.

4.2 The model Kalman filter: kf-m

The model Kalman filter, kf-m, estimates the ensemble classifier by combining component classifiers into a single ensemble classification model. This is a static estimation problem as the state to be estimated, the ideal ensemble classifier, does not change over time. For this reason the time update step for kf-m is the identity function and the a posteriori estimate of iteration is directly transferred to the a priori estimate at iteration .

Figure 4: The representation of a state for kf-m.

The trained base classifiers of the ensemble (the measurements) or the a posteriori state estimate (ensemble classifier) themselves are not directly usable as a state in the Kalman filter framework. Therefore a proxy numerical representation is required to perform the computations. The proxy representation of the state is shown in Figure 4 where each row represents a datapoint from the training set and the estimate scores for the classes for the corresponding datapoint. The class membership is determined by taking the class with the maximum score, and this membership is expressed as . For example in Figure 4, the first datapoint has the highest prediction score assigned to class-label , and thus the first datapoint is considered as a member of class . This representation of a model is used as the state in the Kalman filter framework.

Hence, the time update equations for kf-m are very simply defined as:

(6)
(7)

where

  • is the a posteriori estimate from the previous iteration and represents the a priori estimate at the present iteration. These are the predictions of the ensemble model at the th iteration in the representation shown in Figure 4. So, for example, where denotes the prediction for the datapoint; and each is a vector of prediction scores where is the number of classes in the prediction problem.

  • and are the uncertainties related to and respectively.

Eq. (6) is derived directly from Eq. (1) by setting to the identity matrix, and assuming that is non-existent (there is no control process involved in KFHE). in Eq. (4), (3), and (5) is set to the identity function. Also, it is assumed that no process noise is induced, hence in Eq. (2) is set to to get Eq. (7). The superscript throughout indicates that these parameters are related to the model Kalman filter, kf-m, estimating the state .

The kf-m measurement step is more interesting. At every th iteration a new classification model is trained with a weighted sample of the training dataset. The sampling is done with replacement, with the same number of datapoints as in the original training dataset. The weights are designed to highlight the points which were misclassified previously, as is common in boosting algorithms (although the weight updates are performed using the other Kalman filter kf-w). The measurement is taken as the average of the previous prediction, , and the prediction of this th model, , as in Eq. (8). This effectively attempts to capture how much the trained model of the present iteration impacts the ensemble predictions until iteration . Therefore the measurement step and its related error for kf-m becomes:

(8)
(9)
(10)
(11)
(12)

where:

  • is a model trained on dataset , using the learning algorithm , where the dataset is sampled using the weights .

  • indicates the predictions made by the trained model for the datapoints in the dataset .

  • represents the measurement heuristic, the representation of which is as explained in Figure 4.

  • is the uncertainty related to and is a misclassification rate calculated by comparing the class predictions made by the current ensemble, , with the ground truth classes, .

The remaining steps of the Kalman filter process to compute the Kalman gain, the a posteriori

state estimate, and the variance are as described for the standard Kalman filter framework but are repeated in Eq. (

11), (10) and (12) for completeness. Note that, the uncertainty and the Kalman gain are scalars in the KFHE implementation, as the state to be estimated is one model and only one measurement is taken per iteration.

To initialise the kf-m processthe initial learner is trained as and , where

is a uniform distribution. Also,

is set to , indicating that the initial a priori estimates are uncertain. After initialisation, the iteration starts at . The goal of the training phase is to compute and store the learned models and the Kalman gain values for all .

To avoid measurements with large errors, if the measurement error is more than , where is the number of classes, then the sampling weights, , are reset to a uniform distribution, which is a similar modification to that used in the AdaBoost implementation in (adabag, ).

4.3 The weight Kalman filter: kf-w

The previous description mentioned how a component learner depends on a vector of sampling weights, , which is estimated using kf-w. The purpose of is to give more weight to the datapoints which were not classified correctly in the previous iteration to encourage specialisation. The implementation of kf-w is very similar to the previous Kalman filter implementation. In this case the state estimated by the Kalman filter is a vector of real numbers representing weights. The time update step in this case is also the identity function:

(13)
(14)

To estimate the measurement of the weights the following equations are used:

(15)
(16)

This heuristic derives the measurement of kf-w, from the classification error, , of the measurement of kf-m, as shown in Figure 3. In Eq. (15), the function can adjust the impact of misclassified datapoints on the weight vector. In the present work on KFHE, two options are explored: and , where the second option places more emphasis on misclassified datapoints. We refer to the variant of KFHE using the first, linear definition for as KFHE-l and the variant using the second, exponential definition as KFHE-e.

A trivial heuristic is used in this step to compute the measurement error, , by setting it to (Eq. (16)). This assumes the measurement weight, , has an error at most equal to the last measurement error for kf-m, which assumes that the weights will lead to a model with an error no more than the last measurement by kf-m. The measurement update of kf-w becomes:

(17)
(18)
(19)

The superscript indicates that these parameters are related to kf-w. Here and are vectors, with and representing the weight estimate and the weight measurement of the th iteration for the datapoint.

The equations for kf-w to compute the Kalman gain, ; the a posteriori state estimate for the weights, ; and the variance, , are shown in Eq. (18), (17) and (19). These are identical to those presented for kf-m in Section 4.2 (except for the superscripts), but are included here for completeness.

Initially, is set to have equal weights for every datapoint in the training set, and is initialised to . Note that under this implementation the calculation of the measurement error for kf-w and the initialisation of , makes the Kalman gain the same as . No information from the kf-w process needs to be stored to support predictions from the ensmble.

1:procedure kfhe_train() : training dataset, : Max iterations
2:     Initialise kf-m: , and following Section 4.2
3:     Initialise kf-w: and following Section 4.3
4:     
5:     while  do
6:           kf-m Section
7:         kf-m time update: Find , Eq. (6) and related uncertainty , Eq. (7)
8:         kf-m measurement: Train , compute and , Eq. (8) and (9)
9:         if (misclassification rate of more than then
10:              Reset and to initial values
11:              
12:              continue Repeat measurement step
13:         end if
14:         kf-m measurement update: Compute , and , using Eq. (10), (11) and (12)
15:           kf-w Section
16:
17:         kf-w time update: Find and related uncertainty using Eq. (13) and (14)
18:         kf-w measurement: Compute and using Eq. (15) and (16)
19:         kf-w measurement update: Compute , and using Eq. (17), (18) and (19)
20:          
21:         
22:     end while
23:     return The learned classifier models, and the kf-m Kalman gain values
24:end procedure
Algorithm 1 KFHE training

4.4 Making predictions using KFHE

The goal of KFHE training is to calculate and store the trained base model, , and Kalman gain, , for each iteration, , of the model Kalman filter, kf-m, process for to (the total number of component classifiers trained). Once this is done, generating predictions is straight-forward. Given a new datapoint, , is found using the initial model . Then Eq. (8) and (10) are iteratively applied to generate predictions from each model, , which are combined using the appropriate Kalman gain values, . The final value is taken as the ensemble prediction, and is a vector containing a prediction score for each class. Datapoints as classified as belonging to the class with the maximum score. Algorithm 2 summarises the prediction process for KFHE.

1:procedure kfhe_prediction() : test datapoint, : The th sub-classifier, : The th Kalman gain, : max training iterations
2:      Initial a posteriori estimate
3:     
4:     while  do
5:          Compute , the time update for kf-m using Eq. (6)
6:          Compute , the measurement for kf-m, using the using Eq. (8)
7:          Compute the a posteriori estimate using Eq. (10)
8:         
9:     end while
10:     return Return class-wise prediction scores
11:end procedure
Algorithm 2 KFHE prediction

5 Experiments

This section describes the datasets, algorithms, experimental setup, and evaluation processes used in a set of experiments designed to evaluate the effectiveness of the KFHE algorithm. Two variants of KFHE, KFHE-e and KFHE-l (as described in Section 4.3), are evaluated and a set of state-of-the-art ensemble methods are used as benchmarks.

5.1 Datasets & performance measure

multi-class datasets (described in Table 1) from the UCI Machine Learning repository (Lichman:2013, ) are used. These datasets are frequently used in classifier benchmark experiments (Dietterich2000, ; ZHANG20081524, ; CAO20124451, ; SUN201687, ), cover diverse domains, have numbers of classes ranging from to , and exhibit varying amounts of class imbalance.

dataset names #datapoints #dimensions #classes
mushroom 8124 22 2
iris 150 5 3
glass 214 10 6
car_eval 1728 7 4
cmc 1473 10 3
tvowel 871 4 6
balance_scale 625 5 3
breasttissue 106 10 6
german 1000 21 2
ilpd 579 11 2
ionosphere 351 34 2
knowledge 403 6 4
vertebral 310 7 2
sonar 208 61 2
diabetes 145 4 3
skulls 150 5 5
physio 464 37 3
flags 194 30 8
bupa 345 7 2
cleveland 303 14 5
haberman 306 4 2
hayes-roth 132 6 3
monks 432 7 2
newthyroid 432 7 3
yeast 1484 9 10
spam 4601 58 2
lymphography 148 19 4
movement_libras 360 91 15
SAheart 462 10 2
zoo 101 17 7
Table 1: The datasets used in this paper

To evaluate the performance of each model, the macro-averaged -score (kelleher2015fundamentals, ) was used. The -score in a binary classifier context indicates how precise as well as how robust a classifier model is, and it can be easily extended to a multi-class scenario. The macro-averaged -score will be denoted as , and is defined as:

where and

are the precision and recall values for the

class, where is the number of classes. is appropriate for this experiment because the datasets used exhibit different levels of class imbalance.

5.2 Experimental setup

The state-of-the-art methods used as benchmarks are AdaBoost (zhu2006multi, ), Bagging (Breiman1996, ), Gradient Boosting Machine (GBM) (Friedman00greedyfunction, ) and Stochastic Gradient Boosting Machine (S-GBM) (FRIEDMAN2002367, ). This set covers the different fundamental ensemble classifier types described in Section 2. For all algorithms, including KFHE-e and KFHE-l, the component learners are CART models (kelleher2015fundamentals, ). The performance of a single CART model is also included as a baseline to compare against the ensemble methods. The number of ensemble components is set to for all algorithms (initial experiments showed that for all datasets there were no significant improvements in performance beyond 100 components).

All implementations and evaluations were performed in R222A version of KFHE is available at https://github.com/phoxis/kfhe. The AdaBoost and Bagging implementations were from the package adabag (adabag, ), and the GBM and S-GBM implementations were from the package gbm (gbm, ). As multi-class datasets were used in this experiment, the multi-class variant of AdaBoost, AdaBoost.SAMME (zhu2006multi, ), was used for this experiment (this will be described just as AdaBoost in the remainder of the paper). For the KFHE experiments the training was stopped when the value of reached , which can be interpreted as an indication that the state estimated by kf-m has no uncertainty.

The experiments were divided into two parts. First, to evaluate the effectiveness of KFHE-e and KFHE-l and to compare these to the state-of-the-art methods, the performance of all algorithms is assessed using the datasets listed in Table 1. Second, the robustness of the different algorithms to class-label noise is compared. For both sets of experiments, for each algorithm-dataset pair, a times -fold cross-validation experiment was performed, and the mean of the -scores across the folds are measured.

For the second set of experiments, class-label noise was introduced synthetically into each of the datasets in Table 1. To induce noise a fraction of the datapoints from the training set was sampled randomly and the class of each selected datapoint was randomly changed, following a uniform distribution, to a different one. For each dataset in Table 1, datasets with , , and noise were generated. For each of these noisy datasets, a times -fold cross-validation experiment was performed. For each fold, the noisy class labels were used in training, but the -scores were computed with respect to the original unchanged dataset labels.

6 Results

The experiment results comparing the performance of KFHE-e and KFHE-l to the other methods are shown first. Next, the results of the experiments comparing the performance of the different methods in the presence of noisy class-labels are presented. Statistical significance tests that analyse the differences between the proposed and other methods are also presented.

6.1 Performance comparison of the methods

The relative performance of each algorithm, based on the average -scores achieved in the cross validation experiments, on each of the datasets was ranked (from to , where implies best performance). The first row of Table 2 (labelled ) shows the average rank of each algorithm across the datasets (detailed performance results for each algorithm on each dataset are shown in Table 4 in A). These average ranks are also visualised in the first column of Figure 5 (also labelled ).

The average ranking shows that KFHE-e was able to attain the best average rank , AdaBoost was very close with average rank , followed by KFHE-l with the average rank . It is clear that KFHE-e outperformed GBM, S-GBM, Bagging and CART. Also, KFHE-l performs better overall than GBM, S-GBM, Bagging and CART. It was concluded that KFHE-l performed slightly less well than KHFE-e and AdaBoost due to the lack of emphasis on misclassified points in the weight measurement step. In Section 6.3, a statistical significance test will be performed to uncover significant differences between methods on datasets with non-noisy class-labels.

KFHE-e KFHE-l AdaBoost GBM S-GBM Bagging CART
0% 2.78 3.33 2.98 3.70 4.30 4.82 6.08
5% 3.07 3.07 3.77 3.27 4.08 4.62 6.13
10% 3.70 2.70 4.77 3.37 3.50 4.10 5.87
15% 3.87 2.68 4.83 3.48 3.27 3.75 6.12
20% 4.33 2.70 5.40 3.37 3.18 3.10 5.92
Table 2: Average ranks over all datasets for different levels of class label noise summarised from Tables 4 to 8 in A. Lower ranks are better, best ranks are highlighted in boldface. Percent values in the rows indicate class-label noise.
Figure 5: Changes in average rank with noisy class-labels, over the datasets used (y-axis is inverted to highlight lower average ranks are better).

The evolution of the key parameters of KFHE (the measurement error, ; the posteriori variance, ; the Kalman gain, ; and the training misclassification rate of the kf-m component) with respect to the number of ensemble iterations , for a selection of datasets (knowledge, diabetes, car_eval and lymphography) are plotted in Figure 6. The plots show the results of the first iterations after which, for most of the datasets, the error reduces to . The plots for all of the datasets are given in D.

The plots in Figure 6 show that in all cases the value of decreases monotonically, which can be interpreted as the system becoming more confident on the a posteriori estimate, and therefore that the values of reduce and stabilise, implying less impact for subsequent measurements. This is because of the way the time step update was formulated in Section 4.2: no uncertainty induced during the time update step, and no process noise is assumed. Therefore, in effect the steepness of controls how much of the measurement is combined through Eq. (10) and (11). Also, it is interesting to note the similarity and the rate of change of the error rate of the ensemble with the value. For most of the datasets they show a similar trend. The value of indicates the fraction of the measurement which will be incorporated into the ensemble. A measurement with less error is incorporated more into the final model.

(a) knowledge
(b) diabetes
(c) car_eval
(d) lymphography
Figure 6: Changes in the parameters and the misclassification rate for the training sets for KFHE-e, for the knowledge, diabetes, car_eval, and lymphography datasets.

6.2 Performance for the noisy class-label case

The relative performance of each algorithm, based on the average -scores achieved in the cross validation experiments, on each of the datasets was ranked (from to , where implies best performance). this was performed separately for datasets with , , and induced class label noise. Table 2 shows the average rank of each algorithm for each level of noise (detailed performance results for each algorithm on each dataset are shown in Tables 5 to 8 in A). These average ranks are also visualised in Figure 5. For ease of reading, the vertical axis in Figure 5 is inverted to highlight that the lower average ranks are better.

Out of the algorithms tested the KFHE-l algorithm performs most consistently in the presence of class-label noise. At the noise level KFHE-e and KFHE-l had the same rank, and as the class-noise level increases to , and , KFHE-l attains the best average rank over the datasets. Along with KFHE-l, S-GBM and Bagging also improve their relative ranking. As the fraction of mislabelled datapoints increased in the training set, the average rank of AdaBoost degrades sharply. The performance of AdaBoost and Bagging in the presence of noisy class labels is studied in (Dietterich2000, ), where a similar result was found. On the other hand the change in the relative rank for GBM, and CART was consistently stable.

It should be noted that the degradation of performance with respect to class-noise in AdaBoost is more severe than KFHE-e, although both of them use the function to highlight the weights of the misclassified datapoints. This is due to the smoothing effect in the KFHE algorithm, which makes KFHE-e less sensitive to noise than AdaBoost. On the other hand, KFHE-l does not use in Eq. (15) for the weight measurement step, which makes it more robust to noise and allows it achieve high performance across all noise levels.

Figure 7 shows the change in -score for each algorithm on the knowledge, diabetes, car_eval, and lymphography datasets (similar plots for all datasets are given in C), as the amount of class-label noise increases (note that to highlight changes in performance the vertical axes in these charts are scaled to narrow ranges of possible -scores). These plots are derived from Tables 4-8. With few exceptions the performances of KFHE-l, GBM, S-GBM and Bagging are not impacted as much as the other approaches by noise. Although KFHE-e is generally better than the other approaches when there is no class-label noise present, as the induced noise increases, the -score for KFHE-e decreases—albeit less severely than in the case of AdaBoost.

(a) knowledge
(b) diabetes
(c) car_eval
(d) lymphography
Figure 7: Changes in -score with increased induced class-label noise for the knowledge, diabetes, car_eval, and lymphography datasets.

6.3 Statistical significance testing

This section presents two types of statistical significance tests that compare the performance of the different algorithms tested. First, to assess the overall differences in performance a multiple classifier comparison test was performed following the recommendations of (GARCIA20102044, ). Second, a comparison of each pair of algorithms in isolation is performed using the Wilcoxon’s Signed Rank Sum test (GARCIA20102044, ).

6.3.1 Multiple classifier comparison

To understand the overall effectiveness of the variants of KFHE (KFHE-e and KFHE-l), following the recommendations of García et. al. (GARCIA20102044, )

, a multiple classifier comparison significance test was performed (separate tests were performed on the performance of algorithms at each noise level). First, a Friedman Aligned Rank test was performed. This indicated that, at all noise levels, the null hypothesis that the performance of all algorithms is similar can be rejected, with

. To further investigate these differences, post-hoc pairwise Friedman Aligned Rank tests along with the Finner -value adjustment (GARCIA20102044, ) were performed. Rank plots describing the results of the post-hoc tests (with a significance level of ) are shown in Figure 8.

When no class-label noise is present, the results indicate that KFHE-e (avg. rank ) was significantly better than S-GBM (avg. rank ), Bagging (avg. rank ) and CART (avg. rank ) with ; and that KFHE-l (avg. rank ) was significantly better than S-GBM, Bagging and CART with . Although KFHE-e attained a better average rank, , than AdaBoost, , the null-hypothesis could not be rejected, and so it cannot be determined that the performances of KFHE-e and AdaBoost are significantly different. Similarly, KFHE-l attained a worse average rank, , than AdaBoost, but tests did not identify a statistically significant difference.

The results of the experiment for the datasets with class-label noise indicate that, as the noise continues to increase, the relative performance of KFHE-l improves, but the relative performance of KFHE-e decreases. This is as expected, because of the chosen weight measurement heuristic for the two variants of KFHE as explained in Section 4.3. KFHE-l was found to be statistically significantly better than S-GBM, and Bagging at all noise levels except . KFHE-l was also found to be statistically significantly better than AdaBoost at the , and noise levels. Although the performance of KFHE-e decreases with increasing class-label noise, it does not decrease as sharply as AdaBoost. The complete details of the tests are given in Table 9 in B.

Overall these tests confirm that when no class-label noise is present KFHE-e performs as well as AdaBoost and GBM, but significantly better than S-GBM, Bagging and CART. KFHE-e, however, is not as robust to class label noise as the other approaches. KFHE-l, on the other hand, is robust to noise and performs very well in all class-label noise settings.

(a) Rank chart for no induced class-label noise
(b) Rank chart for induced class-label noise
(c) Rank chart for induced class-label noise
(d) Rank chart for induced class-label noise
(e) Rank chart for induced class-label noise
Figure 8: Rank plots from post-hoc Friedman Aligned Rank test with the Finner -value adjustment, using a significance level of . Algorithms connected with the horizontal bars in the sub-plots are similar based on this test.

6.3.2 Isolated algorithm pairs comparison

To further understand how individual algorithm pairs compare with each other, ignoring other algorithms, a two tailed Wilcoxon’s Signed Rank Sum test for each pair of algorithms was performed. It must be emphasised that the Wilcoxon’s rank sum test cannot

be used to perform multiple classifier comparison without introducing Type I error (rejecting the null hypothesis when it cannot be rejected), as it does not control the Family Wise Error Rate (FWER)

(GARCIA20102044, ). Therefore, the -values for each pair from this experiment should only be interpreted in isolation from any other algorithms. Table 3 shows the results of these tests for the datasets without any class-label noise (Tables (a)a to (e)e in B show the results for the noisy cases). The cells in the lower diagonal show the -values of the Wilcoxon’s Signed Rank Sum test for corresponding algorithm pair and the cells in the upper diagonal show the pairwise win/lost/tie counts.

KFHE-e KFHE-l AdaBoost GBM S-GBM Bagging CART
KFHE-e (19/11/0) (13/16/1) (23/7/0) (21/9/0) (24/6/0) (26/4/0)
KFHE-l 0.009519 *** (12/18/0) (16/14/0) (20/10/0) (25/5/0) (26/4/0)
AdaBoost 0.491795 0.028548 ** (19/11/0) (20/10/0) (22/8/0) (25/5/0)
GBM 0.003018 *** 0.144739 0.013515 ** (22/8/0) (20/10/0) (25/5/0)
S-GBM 0.001128 *** 0.004921 *** 0.002834 *** 0.003418 *** (19/11/0) (25/5/0)
Bagging 0.000210 *** 0.000034 *** 0.000415 *** 0.004108 *** 0.336640 (25/4/1)
CART 0.000010 *** 0.000006 *** 0.000019 *** 0.000055 *** 0.002057 *** 0.000016 ***
Table 3: Result of pairwise Wilcoxon’s Signed Rank Sum test over the different datasets when no class-label noise is present. Upper diagonal: win/lose/tie. Lower diagonal: Wilcoxon’s Signed Rank Sum Test -values. * , ** and *** .

The results in Table 3 show that without class-label noise when compared in isolation KFHE-e performs significantly better than any other method, except AdaBoost. In the noise free case KFHE-l performs significantly better than S-GBM, Bagging and CART. Similarly, the test results at different noise levels (described in B) show that as class-label noise increases, the performance of KFHE-e starts to become significantly better than AdaBoost, although it is worse than other methods. When compared in isolation KFHE-l performs significantly better than almost all other methods at all noise levels.

7 Conclusion and future work

This paper introduces a new perspective on training multi-class ensemble classification models. The ensemble classifier model is viewed as a state to be estimated, and this state is estimated using a Kalman filter. Unlike more common applications of Kalman filters to time series data, this work exploits the sensor fusion property of the Kalman filter to combine multiple individual multi-class classifiers to build a multi-class ensemble classifier algorithm. Based on this new perspective a new multi-class ensemble classification algorithm, the Kalman Filter-based Heuristic Ensemble (KFHE), is proposed.

Detailed experiments on two slight variants of KFHE, KFHE-e and KFHE-l, were performed. KFHE-e is more effective on non-noisy class-labels, as it emphasises the misclassified training datapoints from one iteration of the training algorithm to the next, and KFHE-l is more effective on noisy class-labels as it does not emphasise misclassified training datapoints as much. Experiments show that KFHE-e and KFHE-l perform at least as well as, and in many cases, better than Bagging, S-GBM, GBM and AdaBoost. For datasets with noisy class labels, KFHE-l performed significantly better than all other methods across different levels of class-label noise. For these datasets KFHE-e performed more poorly than KFHE-l, GBM, and S-GBM, but better than AdaBoost.

KFHE can be seen as a hybrid ensemble approach mixing the benefits of both bagging and a boosting. Bagging weighs each of the component learner’s votes equally, whereas boosting finds the optimum weights, using which the component learners are combined. KFHE does not find the optimum weights analytically as AdaBoost does, but attempts to combine the classifiers based on how well the measurement is in a given iteration.

Given the new perspective, other implementations that expand upon KFHE can also be designed following the framework and methods described in Sections 3 and 4. In future, it would be interesting to pursue the following studies:

  • The effect when process noise and a linear time update step are introduced.

  • Multiple and different types of measurements can also be performed. That is, instead of having one component classifier model per iteration, more than one classifier model could be used. This is analogous to having multiple noisy sensors measuring the DC voltage level of the toy example presented in Section 3.1.

  • To further study the effects of other types of noise (class-wise label noise, noise in input space, etc.), higher levels of noise induced in the class-label assignments, and performance on imbalanced class datasets.

Acknowledgements

This research was supported by Science Foundation Ireland (SFI) under Grant number SFI/12/RC/2289. The authors would like to thank Gevorg Poghosyan, PhD Research Student at Insight Centre for Data Analytics, School of Computer Science, University College Dublin, for feedback and discussions which led to the state space representation in Figure 2. Also, the authors would like to thank the unnamed reviewers for their detailed and constructive comments which helped to significantly improve the quality of the paper.

Appendix A Complete experiment results

KFHE-e KFHE-l AdaBoost GBM S-GBM Bagging CART
mushroom 1.0000 0.00 (1.5) 0.9968 0.00 (5) 1.0000 0.00 (1.5) 0.9997 0.00 (3) 0.9990 0.00 (4) 0.9941 0.00 (6.5) 0.9941 0.00 (6.5)
iris 0.9433 0.03 (4) 0.9487 0.03 (1) 0.9448 0.03 (2) 0.9403 0.04 (5) 0.9437 0.03 (3) 0.9376 0.03 (6) 0.9298 0.03 (7)
glass 0.7125 0.08 (3) 0.7153 0.07 (1) 0.7144 0.07 (2) 0.6666 0.08 (4) 0.5695 0.09 (6) 0.5965 0.09 (5) 0.5466 0.06 (7)
car_eval 0.9653 0.02 (2) 0.9011 0.03 (4) 0.9665 0.02 (1) 0.9131 0.04 (3) 0.8236 0.04 (7) 0.8569 0.04 (5) 0.8546 0.04 (6)
cmc 0.5222 0.02 (5) 0.5280 0.03 (2) 0.5038 0.02 (7) 0.5270 0.02 (4) 0.5275 0.02 (3) 0.5291 0.03 (1) 0.5187 0.03 (6)
tvowel 0.8451 0.02 (2) 0.8283 0.03 (3) 0.8279 0.02 (4) 0.8458 0.03 (1) 0.8236 0.03 (5) 0.8004 0.03 (6) 0.7855 0.03 (7)
balance_scale 0.6345 0.03 (1) 0.5984 0.01 (4) 0.6186 0.03 (2) 0.5935 0.02 (5) 0.6045 0.01 (3) 0.5861 0.02 (6) 0.5412 0.02 (7)
flags 0.3059 0.05 (3) 0.2771 0.05 (4) 0.3187 0.06 (2) 0.3236 0.06 (1) 0.2602 0.03 (5) 0.2525 0.03 (6) 0.2439 0.03 (7)
german 0.6907 0.03 (2) 0.6960 0.02 (1) 0.6837 0.03 (5) 0.6860 0.03 (3) 0.6852 0.03 (4) 0.6826 0.02 (6) 0.6550 0.03 (7)
ilpd 0.6126 0.04 (2) 0.5797 0.04 (5) 0.6153 0.04 (1) 0.5809 0.04 (4) 0.5733 0.04 (6) 0.5675 0.04 (7) 0.5865 0.04 (3)
ionosphere 0.9238 0.03 (2) 0.9157 0.03 (4) 0.9298 0.02 (1) 0.9179 0.03 (3) 0.9105 0.03 (5) 0.9004 0.03 (6) 0.8617 0.03 (7)
knowledge 0.9354 0.03 (2) 0.9315 0.03 (3) 0.9524 0.02 (1) 0.9155 0.03 (6) 0.8925 0.04 (7) 0.9184 0.03 (4) 0.9160 0.03 (5)
vertebral 0.8036 0.04 (4) 0.8090 0.04 (2) 0.8001 0.04 (6) 0.8030 0.04 (5) 0.8130 0.05 (1) 0.8042 0.04 (3) 0.7857 0.04 (7)
sonar 0.8072 0.05 (2) 0.7836 0.06 (4) 0.8371 0.05 (1) 0.7893 0.06 (3) 0.7797 0.06 (5) 0.7766 0.06 (6) 0.7021 0.05 (7)
skulls 0.2358 0.06 (5) 0.2380 0.06 (3) 0.2362 0.06 (4) 0.2514 0.08 (1) 0.2436 0.06 (2) 0.2300 0.06 (7) 0.2309 0.07 (6)
diabetes 0.9558 0.03 (7) 0.9647 0.03 (6) 0.9658 0.03 (5) 0.9725 0.02 (2) 0.9722 0.02 (3) 0.9727 0.03 (1) 0.9710 0.03 (4)
physio 0.9069 0.02 (4) 0.9109 0.03 (2) 0.9079 0.02 (3) 0.9046 0.02 (5) 0.9136 0.03 (1) 0.8959 0.03 (6) 0.8847 0.03 (7)
breasttissue 0.6766 0.08 (1) 0.6711 0.08 (2) 0.6606 0.08 (4) 0.6605 0.07 (5) 0.6347 0.08 (6) 0.6653 0.08 (3) 0.6338 0.08 (7)
bupa 0.7027 0.04 (2) 0.7114 0.04 (1) 0.6926 0.04 (6) 0.7018 0.04 (3) 0.6944 0.04 (5) 0.6954 0.04 (4) 0.6433 0.05 (7)
cleveland 0.2975 0.05 (2) 0.2845 0.04 (5) 0.3058 0.04 (1) 0.2938 0.04 (3) 0.2865 0.04 (4) 0.2736 0.04 (7) 0.2766 0.04 (6)
haberman 0.5504 0.05 (6) 0.5743 0.05 (5) 0.5465 0.05 (7) 0.5751 0.05 (4) 0.5996 0.04 (1) 0.5757 0.05 (3) 0.5772 0.05 (2)
hayes_roth 0.8602 0.05 (1) 0.8491 0.05 (3) 0.8510 0.04 (2) 0.6094 0.08 (6) 0.5683 0.10 (7) 0.7418 0.10 (4) 0.6691 0.10 (5)
monks 0.9997 0.00 (2) 0.9981 0.01 (3) 1.0000 0.00 (1) 0.9671 0.06 (4) 0.9114 0.06 (5) 0.9002 0.06 (6) 0.8178 0.09 (7)
newthyroid 0.3973 0.04 (4) 0.3867 0.04 (6) 0.3972 0.04 (5) 0.3742 0.04 (7) 0.4162 0.04 (2) 0.4275 0.04 (1) 0.4087 0.11 (3)
yeast 0.5339 0.05 (1) 0.4701 0.03 (4) 0.5209 0.05 (3) 0.5225 0.05 (2) 0.4359 0.02 (5) 0.4187 0.03 (6) 0.4069 0.03 (7)
spam 0.9477 0.01 (2) 0.9256 0.01 (4) 0.9508 0.01 (1) 0.9309 0.01 (3) 0.9225 0.01 (5) 0.9029 0.01 (6) 0.8870 0.01 (7)
lymphography 0.6733 0.19 (2) 0.5074 0.13 (3) 0.7089 0.18 (1) 0.4454 0.10 (4) 0.3973 0.03 (5) 0.3966 0.03 (6) 0.3704 0.04 (7)
movement_libras 0.7772 0.04 (1) 0.7488 0.05 (3) 0.7679 0.04 (2) 0.6434 0.05 (5) 0.6190 0.05 (6) 0.6715 0.05 (4) 0.5176 0.05 (7)
SAheart 0.6214 0.04 (6) 0.6408 0.04 (4) 0.6090 0.04 (7) 0.6436 0.04 (3) 0.6509 0.03 (1) 0.6466 0.04 (2) 0.6237 0.05 (5)
zoo 0.8548 0.12 (2) 0.8490 0.11 (3) 0.8740 0.11 (1) 0.8450 0.10 (4) 0.5455 0.11 (7) 0.5922 0.09 (5) 0.5840 0.05 (6)
Average rank 2.78 3.33 2.98 3.7 4.3 4.82 6.08
Table 4:

Each cell in the table shows the mean and standard deviation of the

-score (higher value is better) for the times -fold cross-validation experiment for each algorithm and each of the datasets listed in Table 1. The values in parenthesis are the relative rankings of the algorithms on the dataset in the corresponding row (lower ranks are better).
KFHE-e KFHE-l AdaBoost GBM S-GBM Bagging CART
mushroom 0.9972 0.00 (4) 0.9941 0.00 (6) 0.9993 0.00 (2) 0.9997 0.00 (1) 0.9990 0.00 (3) 0.9941 0.00 (6) 0.9941 0.00 (6)
iris 0.9205 0.05 (7) 0.9413 0.03 (3) 0.9282 0.04 (6) 0.9359 0.03 (4) 0.9486 0.03 (1) 0.9436 0.03 (2) 0.9335 0.03 (5)
glass 0.6818 0.09 (2) 0.6969 0.07 (1) 0.6597 0.08 (3) 0.6498 0.08 (4) 0.5532 0.08 (6) 0.5876 0.10 (5) 0.5367 0.06 (7)
car_eval 0.8918 0.03 (1) 0.8639 0.03 (4) 0.8765 0.03 (3) 0.8820 0.04 (2) 0.8047 0.04 (7) 0.8451 0.03 (5) 0.8423 0.03 (6)
cmc 0.5223 0.02 (5) 0.5285 0.02 (3) 0.5047 0.03 (7) 0.5303 0.02 (2) 0.5274 0.02 (4) 0.5330 0.02 (1) 0.5197 0.03 (6)
tvowel 0.8346 0.03 (2) 0.8275 0.03 (4) 0.7903 0.03 (6) 0.8435 0.03 (1) 0.8329 0.03 (3) 0.7961 0.03 (5) 0.7805 0.04 (7)
balance_scale 0.5989 0.03 (1) 0.5940 0.02 (3) 0.5917 0.03 (4) 0.5912 0.02 (5) 0.5982 0.02 (2) 0.5799 0.02 (6) 0.5418 0.02 (7)
flags 0.3113 0.06 (3) 0.2988 0.05 (4) 0.3193 0.06 (1) 0.3185 0.06 (2) 0.2678 0.04 (5) 0.2518 0.03 (6) 0.2420 0.04 (7)
german 0.6732 0.03 (2) 0.6765 0.03 (1) 0.6690 0.03 (4) 0.6695 0.03 (3) 0.6655 0.03 (5) 0.6608 0.03 (6) 0.6348 0.04 (7)
ilpd 0.6130 0.04 (2) 0.5874 0.04 (3) 0.6220 0.04 (1) 0.5815 0.04 (4) 0.5755 0.04 (6) 0.5724 0.04 (7) 0.5759 0.04 (5)
ionosphere 0.9093 0.03 (2) 0.9087 0.03 (4) 0.9090 0.03 (3) 0.9018 0.03 (6) 0.9103 0.03 (1) 0.9023 0.03 (5) 0.8507 0.04 (7)
knowledge 0.9360 0.02 (2) 0.9300 0.02 (3) 0.9369 0.02 (1) 0.9188 0.03 (4) 0.8924 0.03 (7) 0.9177 0.03 (6) 0.9181 0.03 (5)
vertebral 0.7928 0.05 (5) 0.8073 0.05 (3) 0.7740 0.05 (6) 0.8024 0.05 (4) 0.8163 0.04 (1.5) 0.8163 0.05 (1.5) 0.7736 0.05 (7)
sonar 0.7900 0.05 (2) 0.7759 0.05 (3) 0.8116 0.05 (1) 0.7719 0.06 (4) 0.7708 0.05 (5) 0.7610 0.05 (6) 0.6867 0.06 (7)
skulls 0.2462 0.07 (3) 0.2226 0.06 (6) 0.2275 0.06 (5) 0.2550 0.07 (1) 0.2510 0.07 (2) 0.2295 0.06 (4) 0.1935 0.06 (7)
diabetes 0.9364 0.04 (6) 0.9731 0.03 (1) 0.9305 0.04 (7) 0.9722 0.02 (2) 0.9705 0.02 (5) 0.9710 0.03 (3) 0.9709 0.03 (4)
physio 0.8781 0.03 (5) 0.9092 0.02 (2) 0.8712 0.03 (6) 0.8995 0.02 (3) 0.9113 0.02 (1) 0.8955 0.03 (4) 0.8658 0.03 (7)
breasttissue 0.6557 0.07 (2) 0.6709 0.07 (1) 0.6511 0.08 (3) 0.6502 0.08 (4) 0.6285 0.08 (6) 0.6416 0.08 (5) 0.5927 0.08 (7)
bupa 0.6852 0.04 (2) 0.6962 0.04 (1) 0.6625 0.05 (6) 0.6839 0.04 (4) 0.6846 0.04 (3) 0.6795 0.05 (5) 0.6309 0.05 (7)
cleveland 0.2895 0.05 (4) 0.2906 0.05 (3) 0.3076 0.05 (1) 0.2922 0.05 (2) 0.2883 0.05 (5) 0.2793 0.04 (7) 0.2864 0.05 (6)
haberman 0.5429 0.05 (6) 0.5554 0.06 (5) 0.5342 0.05 (7) 0.5665 0.06 (3) 0.5793 0.06 (1) 0.5643 0.06 (4) 0.5738 0.06 (2)
hayes_roth 0.8022 0.07 (2) 0.8289 0.06 (1) 0.7815 0.07 (3) 0.5869 0.09 (6) 0.5208 0.09 (7) 0.7145 0.10 (4) 0.6695 0.10 (5)
monks 0.9644 0.02 (2) 0.9985 0.00 (1) 0.9311 0.03 (5) 0.9473 0.06 (3) 0.9268 0.06 (6) 0.9379 0.06 (4) 0.8498 0.09 (7)
newthyroid 0.3972 0.04 (5) 0.3962 0.04 (6) 0.4039 0.04 (4) 0.3960 0.04 (7) 0.4475 0.04 (1) 0.4395 0.03 (2) 0.4086 0.11 (3)
yeast 0.4797 0.06 (2) 0.4354 0.03 (4) 0.4521 0.07 (3) 0.4829 0.05 (1) 0.4312 0.03 (5) 0.4176 0.02 (6) 0.4021 0.03 (7)
spam 0.9325 0.01 (1) 0.9265 0.01 (4) 0.9311 0.01 (2) 0.9294 0.01 (3) 0.9219 0.01 (5) 0.9039 0.01 (6) 0.8866 0.01 (7)
lymphography 0.6436 0.16 (2) 0.4896 0.12 (3) 0.6507 0.15 (1) 0.4402 0.09 (4) 0.3954 0.04 (5) 0.3941 0.04 (6) 0.3704 0.05 (7)
movement_libras 0.7365 0.04 (1) 0.7138 0.05 (3) 0.7362 0.04 (2) 0.6064 0.06 (5) 0.5868 0.06 (6) 0.6517 0.06 (4) 0.5008 0.05 (7)
SAheart 0.6120 0.04 (5) 0.6252 0.04 (4) 0.6020 0.04 (7) 0.6255 0.04 (3) 0.6446 0.03 (1) 0.6405 0.04 (2) 0.6106 0.05 (6)
zoo 0.7765 0.11 (4) 0.8025 0.13 (2) 0.7841 0.12 (3) 0.8058 0.12 (1) 0.5566 0.11 (7) 0.5823 0.09 (5) 0.5704 0.07 (6)
Average rank 3.07 3.07 3.77 3.27 4.08 4.62 6.13
Table 5: Noise Level: . Each cell in the table shows the -measure (higher value is better) from the times -fold cross-validation experiment with noise induced on the class labels. The values in parenthesis is the relative ranking of the algorithm on the dataset in the corresponding row (lower ranks are better).
KFHE-e KFHE-l AdaBoost GBM S-GBM Bagging CART
mushroom 0.9942 0.00 (4) 0.9941 0.00 (6) 0.9970 0.00 (3) 0.9988 0.00 (1) 0.9987 0.00 (2) 0.9941 0.00 (6) 0.9941 0.00 (6)
iris 0.8749 0.06 (6) 0.9384 0.04 (3) 0.8569 0.06 (7) 0.9319 0.05 (5) 0.9487 0.03 (1) 0.9433 0.03 (2) 0.9380 0.03 (4)
glass 0.6850 0.09 (2) 0.6990 0.08 (1) 0.6801 0.07 (3) 0.6253 0.08 (4) 0.5658 0.08 (6) 0.6189 0.09 (5) 0.5641 0.10 (7)
car_eval 0.8660 0.04 (1) 0.8621 0.03 (2) 0.7311 0.04 (7) 0.8411 0.05 (3) 0.7421 0.05 (6) 0.8111 0.04 (4) 0.7901 0.04 (5)
cmc 0.5133 0.02 (6) 0.5232 0.02 (1) 0.4886 0.02 (7) 0.5220 0.02 (2) 0.5212 0.02 (4) 0.5217 0.02 (3) 0.5175 0.03 (5)
tvowel 0.8264 0.03 (2) 0.8187 0.03 (4) 0.7747 0.04 (7) 0.8383 0.03 (1) 0.8238 0.03 (3) 0.7961 0.03 (5) 0.7791 0.03 (6)
balance_scale 0.6016 0.03 (2) 0.5939 0.02 (3) 0.5929 0.03 (4) 0.5912 0.02 (5) 0.6024 0.02 (1) 0.5757 0.02 (6) 0.5332 0.02 (7)
flags 0.2716 0.05 (4) 0.2544 0.03 (6) 0.2860 0.04 (2) 0.2885 0.06 (1) 0.2763 0.03 (3) 0.2577 0.03 (5) 0.2484 0.04 (7)
german 0.6698 0.03 (4) 0.6786 0.03 (1) 0.6611 0.03 (6) 0.6695 0.03 (5) 0.6756 0.03 (2) 0.6706 0.03 (3) 0.6348 0.04 (7)
ilpd 0.5836 0.04 (2) 0.5782 0.04 (3) 0.5849 0.05 (1) 0.5737 0.04 (4) 0.5696 0.04 (5) 0.5650 0.04 (7) 0.5694 0.05 (6)
ionosphere 0.8922 0.04 (5) 0.9098 0.03 (1) 0.8867 0.04 (6) 0.9048 0.03 (4) 0.9095 0.03 (2) 0.9093 0.03 (3) 0.8486 0.04 (7)
knowledge 0.9132 0.03 (4) 0.9300 0.02 (1) 0.9122 0.03 (5) 0.9191 0.03 (2) 0.8906 0.03 (7) 0.9111 0.02 (6) 0.9156 0.03 (3)
vertebral 0.7904 0.05 (5) 0.7987 0.04 (3) 0.7749 0.04 (7) 0.7949 0.05 (4) 0.8099 0.05 (1) 0.8059 0.05 (2) 0.7838 0.05 (6)
sonar 0.7818 0.05 (2) 0.7719 0.06 (3) 0.7947 0.05 (1) 0.7588 0.06 (5) 0.7702 0.06 (4) 0.7505 0.06 (6) 0.6677 0.06 (7)
skulls 0.2396 0.06 (4) 0.2362 0.07 (5) 0.2275 0.05 (6) 0.2547 0.06 (3) 0.2586 0.06 (1) 0.2557 0.07 (2) 0.2247 0.07 (7)
diabetes 0.9079 0.05 (7) 0.9529 0.04 (4) 0.9193 0.05 (6) 0.9567 0.04 (2) 0.9578 0.04 (1) 0.9545 0.04 (3) 0.9527 0.04 (5)
physio 0.8736 0.03 (6) 0.9114 0.03 (1) 0.8458 0.04 (7) 0.8977 0.03 (4) 0.9100 0.03 (2) 0.8988 0.03 (3) 0.8822 0.03 (5)
breasttissue 0.6136 0.08 (5) 0.6489 0.10 (2) 0.6150 0.09 (4) 0.6721 0.09 (1) 0.5737 0.10 (7) 0.6470 0.09 (3) 0.5890 0.07 (6)
bupa 0.6737 0.05 (5) 0.6894 0.04 (3) 0.6670 0.05 (6) 0.6823 0.05 (4) 0.6901 0.04 (1) 0.6898 0.05 (2) 0.6241 0.05 (7)
cleveland 0.2805 0.05 (4) 0.2849 0.05 (2) 0.2828 0.05 (3) 0.2889 0.06 (1) 0.2654 0.05 (6) 0.2608 0.04 (7) 0.2700 0.04 (5)
haberman 0.5436 0.06 (6) 0.5729 0.06 (5) 0.5326 0.05 (7) 0.5819 0.05 (3) 0.5975 0.05 (1) 0.5858 0.06 (2) 0.5796 0.07 (4)
hayes_roth 0.7349 0.07 (2) 0.7971 0.07 (1) 0.7291 0.09 (3) 0.5641 0.09 (6) 0.5330 0.11 (7) 0.7036 0.09 (4) 0.6459 0.08 (5)
monks 0.9336 0.02 (2) 0.9835 0.02 (1) 0.8884 0.03 (6) 0.9058 0.08 (5) 0.9069 0.05 (4) 0.9162 0.05 (3) 0.8280 0.09 (7)
newthyroid 0.3922 0.04 (7) 0.4255 0.04 (4) 0.4015 0.04 (6) 0.4178 0.04 (5) 0.4601 0.04 (3) 0.4641 0.04 (2) 0.4903 0.08 (1)
yeast 0.5061 0.05 (1) 0.4508 0.03 (3) 0.4499 0.05 (4) 0.4756 0.05 (2) 0.4382 0.03 (5) 0.4099 0.03 (6) 0.3958 0.03 (7)
spam 0.9215 0.01 (4) 0.9251 0.01 (2) 0.9128 0.01 (5) 0.9279 0.01 (1) 0.9231 0.01 (3) 0.8993 0.01 (6) 0.8848 0.01 (7)
lymphography 0.5765 0.15 (1) 0.4878 0.12 (3) 0.5567 0.12 (2) 0.4466 0.10 (4) 0.3963 0.04 (5) 0.3924 0.03 (6) 0.3878 0.05 (7)
movement_libras 0.7025 0.06 (1) 0.6890 0.05 (3) 0.6957 0.05 (2) 0.5678 0.05 (6) 0.5818 0.05 (5) 0.6224 0.05 (4) 0.4789 0.05 (7)
SAheart 0.6259 0.04 (5) 0.6410 0.04 (3) 0.6137 0.05 (7) 0.6359 0.05 (4) 0.6503 0.04 (1) 0.6436 0.04 (2) 0.6141 0.05 (6)
zoo 0.7423 0.11 (2) 0.7612 0.11 (1) 0.7238 0.10 (3) 0.6636 0.10 (4) 0.5182 0.08 (6) 0.5183 0.08 (5) 0.5104 0.06 (7)
Average rank 3.70 2.70 4.77 3.37 3.50 4.10 5.87
Table 6: Noise Level: . Each cell in the table shows the -measure (higher value is better) from the times -fold cross-validation experiment with noise induced on the class labels. The values in parenthesis is the relative ranking of the algorithm on the dataset in the corresponding row (lower ranks are better).
KFHE-e KFHE-l AdaBoost GBM S-GBM Bagging CART
mushroom 0.9941 0.00 (4.5) 0.9941 0.00 (4.5) 0.9967 0.00 (3) 0.9992 0.00 (1) 0.9990 0.00 (2) 0.9934 0.00 (6.5) 0.9934 0.00 (6.5)
iris 0.8619 0.06 (6) 0.9379 0.04 (3) 0.8438 0.06 (7) 0.9317 0.05 (4) 0.9499 0.03 (1) 0.9463 0.04 (2) 0.9299 0.03 (5)
glass 0.5976 0.09 (2) 0.6240 0.08 (1) 0.5901 0.09 (4) 0.5909 0.09 (3) 0.5076 0.08 (6) 0.5345 0.08 (5) 0.4721 0.08 (7)
car_eval 0.8374 0.04 (3) 0.8394 0.04 (2) 0.6708 0.04 (7) 0.8563 0.04 (1) 0.7599 0.05 (5) 0.8108 0.05 (4) 0.7577 0.07 (6)
cmc 0.5182 0.03 (5) 0.5199 0.02 (4) 0.4889 0.03 (7) 0.5221 0.02 (3) 0.5245 0.03 (2) 0.5258 0.03 (1) 0.4949 0.04 (6)
tvowel 0.8261 0.03 (3) 0.8208 0.03 (4) 0.7552 0.03 (7) 0.8274 0.03 (2) 0.8311 0.03 (1) 0.7924 0.03 (5) 0.7758 0.03 (6)
balance_scale 0.5908 0.03 (3) 0.5948 0.03 (2) 0.5808 0.04 (5) 0.5871 0.02 (4) 0.5949 0.02 (1) 0.5674 0.02 (6) 0.5326 0.03 (7)
flags 0.3032 0.06 (1) 0.2998 0.05 (3) 0.2958 0.06 (4) 0.3021 0.06 (2) 0.2463 0.03 (5) 0.2451 0.03 (6) 0.2352 0.04 (7)
german 0.6488 0.03 (3) 0.6522 0.03 (1) 0.6376 0.03 (5) 0.6491 0.03 (2) 0.6432 0.03 (4) 0.6275 0.03 (6) 0.6202 0.04 (7)
ilpd 0.5645 0.04 (3) 0.5699 0.04 (1) 0.5698 0.04 (2) 0.5592 0.04 (4) 0.5564 0.04 (5) 0.5557 0.04 (6) 0.5517 0.04 (7)
ionosphere 0.8572 0.04 (5) 0.8892 0.04 (3) 0.8416 0.05 (6) 0.8714 0.04 (4) 0.9025 0.04 (1) 0.8995 0.04 (2) 0.8043 0.06 (7)
knowledge 0.9050 0.03 (4) 0.9291 0.03 (1) 0.8915 0.03 (6) 0.9090 0.03 (3) 0.8835 0.03 (7) 0.9133 0.03 (2) 0.9006 0.04 (5)
vertebral 0.7463 0.04 (5) 0.7790 0.05 (3) 0.7275 0.05 (6) 0.7641 0.05 (4) 0.7997 0.05 (1) 0.7866 0.05 (2) 0.7267 0.05 (7)
sonar 0.7548 0.06 (2) 0.7451 0.06 (5) 0.7462 0.06 (4) 0.7330 0.06 (6) 0.7643 0.06 (1) 0.7485 0.07 (3) 0.6356 0.08 (7)
skulls 0.2545 0.06 (3) 0.2635 0.06 (1) 0.2587 0.06 (2) 0.2530 0.05 (4.5) 0.2408 0.08 (6) 0.2530 0.06 (4.5) 0.2183 0.06 (7)
diabetes 0.8668 0.06 (6) 0.9494 0.04 (5) 0.8662 0.08 (7) 0.9523 0.04 (4) 0.9680 0.03 (1) 0.9536 0.03 (3) 0.9608 0.03 (2)
physio 0.8496 0.03 (6) 0.8964 0.02 (3) 0.8155 0.04 (7) 0.8893 0.03 (4) 0.9052 0.02 (1) 0.8995 0.02 (2) 0.8734 0.03 (5)
breasttissue 0.6072 0.08 (6) 0.6283 0.08 (2) 0.6202 0.09 (3) 0.6139 0.07 (5) 0.6161 0.08 (4) 0.6500 0.07 (1) 0.5652 0.08 (7)
bupa 0.6553 0.04 (6) 0.6779 0.05 (2) 0.6565 0.05 (5) 0.6640 0.05 (4) 0.6828 0.05 (1) 0.6776 0.05 (3) 0.6089 0.05 (7)
cleveland 0.2917 0.05 (3.5) 0.2869 0.05 (5) 0.2989 0.05 (2) 0.2998 0.05 (1) 0.2770 0.04 (6) 0.2917 0.05 (3.5) 0.2758 0.05 (7)
haberman 0.5448 0.05 (6) 0.5606 0.06 (5) 0.5404 0.05 (7) 0.5625 0.06 (4) 0.5819 0.07 (1) 0.5645 0.06 (3) 0.5677 0.07 (2)
hayes_roth 0.7335 0.08 (2) 0.7653 0.08 (1) 0.6615 0.09 (4) 0.5390 0.08 (6) 0.4691 0.10 (7) 0.6970 0.09 (3) 0.6558 0.10 (5)
monks 0.8757 0.04 (4) 0.9623 0.03 (1) 0.8265 0.04 (6) 0.8671 0.07 (5) 0.9045 0.05 (3) 0.9134 0.06 (2) 0.7957 0.08 (7)
newthyroid 0.3816 0.04 (7) 0.4305 0.04 (3) 0.3825 0.04 (6) 0.4224 0.04 (4) 0.4700 0.04 (1) 0.4646 0.04 (2) 0.4175 0.10 (5)
yeast 0.4700 0.05 (1) 0.4462 0.04 (3) 0.4198 0.05 (4) 0.4695 0.05 (2) 0.4194 0.03 (5) 0.4115 0.03 (6) 0.4018 0.03 (7)
spam 0.9154 0.01 (4) 0.9256 0.01 (1) 0.8968 0.02 (6) 0.9236 0.01 (2) 0.9206 0.01 (3) 0.9013 0.01 (5) 0.8808 0.01 (7)
lymphography 0.5011 0.13 (1) 0.4120 0.08 (3) 0.4825 0.13 (2) 0.3881 0.03 (6) 0.4008 0.03 (4) 0.3964 0.03 (5) 0.3561 0.04 (7)
movement_libras 0.6955 0.05 (2) 0.6914 0.06 (3) 0.6995 0.05 (1) 0.5758 0.05 (6) 0.5934 0.05 (5) 0.6325 0.06 (4) 0.4584 0.06 (7)
SAheart 0.6105 0.04 (5) 0.6279 0.05 (4) 0.5929 0.04 (7) 0.6322 0.04 (2) 0.6368 0.04 (1) 0.6290 0.05 (3) 0.6081 0.04 (6)
zoo 0.6541 0.12 (4) 0.7949 0.12 (1) 0.6618 0.12 (3) 0.7736 0.11 (2) 0.4358 0.10 (7) 0.5625 0.07 (6) 0.5627 0.07 (5)
Average rank 3.87 2.68 4.83 3.48 3.27 3.75 6.12
Table 7: Noise Level: . Each cell in the table shows the -measure (higher value is better) from the times -fold cross-validation experiment with noise induced on the class labels. The values in parenthesis is the relative ranking of the algorithm on the dataset in the corresponding row (lower ranks are better).
KFHE-e KFHE-l AdaBoost GBM S-GBM Bagging CART
mushroom 0.9939 0.00 (5) 0.9943 0.00 (4) 0.9964 0.00 (3) 0.9981 0.00 (2) 0.9984 0.00 (1) 0.9912 0.01 (7) 0.9914 0.00 (6)
iris 0.8457 0.07 (6) 0.9208 0.05 (4) 0.7999 0.06 (7) 0.9199 0.05 (5) 0.9560 0.03 (1) 0.9537 0.03 (2) 0.9369 0.04 (3)
glass 0.6253 0.08 (2) 0.6242 0.08 (3) 0.5804 0.09 (5) 0.5985 0.09 (4) 0.5692 0.07 (6) 0.6284 0.08 (1) 0.5329 0.09 (7)
car_eval 0.8347 0.03 (3) 0.8403 0.04 (2) 0.6342 0.04 (7) 0.8512 0.04 (1) 0.7853 0.04 (6) 0.8151 0.05 (4) 0.8124 0.04 (5)
cmc 0.5183 0.02 (4) 0.5167 0.02 (5) 0.4847 0.03 (7) 0.5202 0.02 (3) 0.5284 0.02 (2) 0.5285 0.02 (1) 0.5006 0.03 (6)
tvowel 0.8244 0.03 (1) 0.8198 0.03 (4) 0.7136 0.04 (7) 0.8226 0.03 (3) 0.8237 0.03 (2) 0.7897 0.03 (5) 0.7669 0.03 (6)
balance_scale 0.5856 0.04 (4) 0.5951 0.03 (2) 0.5442 0.03 (7) 0.5960 0.03 (1) 0.5892 0.03 (3) 0.5660 0.03 (5) 0.5471 0.04 (6)
flags 0.2893 0.06 (3) 0.2824 0.05 (4) 0.2934 0.05 (2) 0.2964 0.05 (1) 0.2609 0.04 (5) 0.2598 0.05 (6) 0.2243 0.04 (7)
german 0.6311 0.03 (5) 0.6613 0.03 (2) 0.6158 0.03 (7) 0.6496 0.03 (4) 0.6671 0.03 (1) 0.6535 0.03 (3) 0.6267 0.04 (6)
ilpd 0.5794 0.04 (5) 0.5836 0.04 (4) 0.5667 0.04 (7) 0.5843 0.05 (3) 0.5909 0.04 (1) 0.5848 0.04 (2) 0.5713 0.04 (6)
ionosphere 0.8458 0.04 (4) 0.8682 0.04 (3) 0.8260 0.04 (6) 0.8397 0.04 (5) 0.8731