1 Introduction
An ensemble classification model is composed of multiple individual base classifiers, also known as component classifiers, the outputs of which are aggregated together into a single prediction. The classification accuracy of an ensemble model can be expected to exceed that of any of its individual base classifiers. The main motivation behind ensemble techniques is that a committee of experts working together on a problem are more likely to accurately solve it than a single expert working alone (kelleher2015fundamentals, ). Although many existing ensemble techniques (e.g. (Breiman1996, ; Friedman00greedyfunction, ; hastie2009multi, ; zhu2006multi, )) have been repeatedly shown in benchmark experiments to be effective (see (Narassiguin2016, ; Opitz:1999:PEM:3013545.3013549, )), current approaches still have limitations. For example, methods based on bagging, although robust, may not lead to models as accurate as those learned by more sophisticated methods such as those based on boosting (Narassiguin2016, )
. Methods based on boosting, however, are sensitive to classlabel noise and the presence of outliers in training datasets
(Dietterich2000, ).To address the limitations of current multiclass classification ensemble algorithms, this paper presents a new perspective on ensemble model training, framing it as a state estimation problem that can be solved using a Kalman filter (kalman1960, ; maybeck1982stochastic, ). Although Kalman filters are most commonly used to solve problems associated with time series data, this is not the case in this work. Rather, this work exploits the data fusion property of the Kalman filter to combine individual multiclass component classifier models to construct an ensemble.
The new perspective views the ensemble model to be trained as an unknown static state to be estimated. A Kalman filter can be used to estimate an unknown static state by combining multiple uncertain measurements of the state. This exploits the data fusion property of the Kalman filter. In the new perspective the measurements are the single component classifiers in the ensemble, and the uncertainties of these measurements are based on the classification errors of the single component classifiers. The Kalman filter is used to combine the component classifier models into an overall ensemble model. This new perspective on ensemble training provides a framework within which different algorithms can be formulated. This paper describes one such new algorithm, the Kalman Filterbased Heuristic Ensemble (KFHE). In an evaluation experiment KFHE is shown to outperform methods based on boosting while maintaining the robustness of methods based on bagging. The contributions of this paper are:

A new perspective on training multiclass ensemble classifiers, which views it as a state estimation problem and solves it using a Kalman filter (kalman1960, ; maybeck1982stochastic, ).

A new multiclass ensemble classification algorithm, the Kalman Filterbased Heuristic Ensemble (KFHE).

Extensive experiments comparing KFHE with the stateoftheart ensemble algorithms that demonstrate the effectiveness of KFHE in both scenarios of noise free and noisy classlabels.
The remainder of this paper is structured as follows. Section 2 discusses previous work on multiclass ensemble classification algorithms and provides a brief introduction to the Kalman filter. Section 3 introduces the new Kalman filterbased perspective on building multiclass classification ensembles. The Kalman Filterbased Heuristic Ensemble (KFHE) method based on this perspective is described in Section 4. The setup of an experiment to evaluate the performance of KFHE and the comparison method to stateoftheart approaches on a selection of datasets is described in Section 5, and a detailed discussion of the results of this experiment is presented in Section 6. Finally, Section 7 reflects on the newly proposed perspective and explores directions for future work.
2 Background
This section first reviews existing multiclass ensemble classification methods. Relevant aspects of the Kalman filter approach for state estimation, which serve as a basis for the explanation of KFHE, are then introduced.
2.1 Ensemble methods
The advent of ensemble approaches in machine learning in the early 1990s was due mainly to works by Hansen and Salamon
(Hansen:1990:NNE:628297.628429, ), and Schapire (Schapire1990, ). Hansen and Salamon (Hansen:1990:NNE:628297.628429, ) showed that multiple classifiers could be combined to achieve better performance than any individual classifier. Schapire (Schapire1990, ) proved that the learnability of strong learners and weak learners are equivalent, and then showed how to boost weak learners to become strong learners. Since then many alternative and improved approaches to build ensembles have been introduced. Ensemble methods can still, however, be categorised into three fundamental types: bagging, boosting, and stacking.Bagging (Breiman1996, ) or bootstrap aggregation, trains several base classifiers on bootstrap samples of a training dataset and combines the outputs of these base classifiers using simple aggregation such as majority voting. Training models on different samples of the training set introduces diversity into the ensemble, which is key to making ensembles work effectively. UnderBagging (UnderBagging:Barandela2003, ) is a variation of bagging addressing imbalanced datasets that performs undersampling before every bagging iteration, but also keeps all minority class instances in every iteration. The Random Forest (Breiman2001rf, )
is an extension to bagging in which base classifiers (usually decision trees) are trained using a bootstrap sample of the dataset that has also been reduced to only a small random sample of the input space. The
Rotational Forest (Rodriguez:2006:RFN:1159167.1159358, ) is another extension that attempts to build base classifiers that are simultaneously accurate and diverse. The input dataset is transformed by applying PCA (hastie01statisticallearning, ) on different subsets of the attributes of the dataset, and axis rotation is performed by combining the coefficient matrices found by PCA for each subset. This is repeated multiple times. Local Linear Forestsmodify random forests by considering random forests as an adaptive kernel method and combining it with local linear regression
(friedberg2018local, ).Boosting (zhu2006multi, ) approaches iteratively learn component classifiers such that each one specialises on specific types of training examples. Each component classifier is trained using a weighted sample from a training dataset such that at each iteration the ensemble emphasises training examples that were misclassified in the previous iteration. Since the introduction of the original boosting algorithm, AdaBoost (freund1995desicion, ), several new approaches to boosting have been proposed. In LogitBoost (friedman2000additive, )
, the logistic loss function is minimised while combining the subclassifiers in a binary classification context. A linear programming approach to boosting,
LPBoost (demiriz2002linear, ), was shown to be competitive with AdaBoost. This algorithm minimises the misclassification error and maximises the soft margin in the feature space generated by the predictions of the weak hypothesis components of the ensemble. A multiclass modification for binary class AdaBoost was introduced in (freund1995desicion, ), and an improvement of it was proposed in (hastie2009multi, ). RotBoost (ZHANG20081524, ) is a direct extension of the rotational forest approach (Rodriguez:2006:RFN:1159167.1159358, ) to include boosting. The Gradient Boosting Machine (GBM) (Friedman00greedyfunction, ) is a sequential tree based ensemble method, where each tree corrects the errors of the previously trained trees. Stochastic Gradient Boosting Machine (SGBM) (FRIEDMAN2002367, ) improves GBM by training the component trees on bootstrap samples.AdaBoost is sensitive to noisy class labels and performs poorly as the level of noise increases (Freund2001, ). This is mainly due to the exponential loss function AdaBoost uses to optimise the ensemble. If a training datapoint has noisy classlabels AdaBoost will increase its weight for the next iteration and keep on increasing the weight of the datapoint in a vain attempt to classify it correctly. Therefore, given enough such noisy classlabelled datapoints AdaBoost can learn classifiers with poor generalisation ability. Although the performance of bagging decreases in the presence of classlabel noise, it does not do so as severely as it does with AdaBoost (Dietterich2000, ).
To overcome this problem with noisy classlabeled datasets, MadaBoost (Domingo:2000:MMA:648299.755176_madaboost, ) was proposed. MadaBoost changes the standard AdaBoost weight update rule by capping the maximum value for the weight of a datapoint to be . Similarly FilterBoost (NIPS2007_3321filterboost, ) optimises the log loss function, leading to a weight update rule which caps the weight upper bound of a datapoint to using a smooth function. BrownBoost (Freund2001, ) and Noise Detection Based AdaBoost (ND_AdaBoost) (CAO20124451, ) make AdaBoost more robust to class label noise by explicitly identifying noisy examples and ignoring them. Robust Multiclass AdaBoost (Rob_MulAda) (SUN201687, ) is an extension to ND_AdaBoost for multiclass classification. VoteBoosting (SABZEVARI2018119, ), decides the weights of each datapoint while training based on the disagreement of the predictions of the component classifiers that exist at each iteration. For lower levels of classlabel noise, the datapoints with higher disagreement rates are emphasised. Whereas for higher levels of classlabel noise, datapoints which agree among different component classifiers are highlighted, in an attempt to achieve robustness to classlabel noise. A comprehensive review and analysis of the different boosting variations can be found in (zhou2012ensemble, ).
Stacking (WOLPERT1992241, ; Ting97stackedgeneralization:, ) is a two stage process in which the outputs of a collection of first stage base classifiers are combined by a second stage classifier to produce a final output. Seewald (Seewald:2002:MSB:645531.656165, ), empirically showed that the extension to stacking by Ting and Witten (Ting97stackedgeneralization:, ) does not perform well in the multiclass context, and proposed StackingC to overcome this drawback. In (MENAHEM20094097_troika, )
the weaknesses of StackingC were highlighted and were shown to occur due to increasingly skewed class distributions because of the binarisation of the multiclass problem. Next, a three layered improved stacked method for multiclass classification,
Troika (MENAHEM20094097_troika, ), was proposed. The stacking approach to building ensembles has received much less research attention than approaches based on bagging and boosting.2.2 The Kalman filter
The Kalman filter (kalman1960, ) is a mathematical tool for stochastic estimation of the state of a linear system based on noisy measurements. Let there be a system which evolves linearly over time, and assume that the state of the system, which is unobservable, has to be estimated at each time step, . The state may be estimated in two ways. First, a linear model, which is used to update the state of the system from step to step , can be used to get an a priori estimate of the state. This estimate will have a degree of uncertainty as the linear model is unlikely to fully capture the true nature of the system. Estimating the state using this type of linear model is commonly known as a time update step. Second, an external sensor can provide a state estimate. This estimate will also have an associated uncertainty, referred to as measurement noise, and introduced because of inaccuracies in the measurement process.
Given these two state estimates, and their related uncertainties, the Kalman filter combines the a priori estimate and the measurement to generate an a posteriori state estimate, such that the uncertainty of the a posteriori estimate is minimised. This combination of a sensor measurement with an a priori estimate is commonly known as the measurement update step. The process iterates using the a posteriori estimate calculated in a measurement update step as input to the time update step of the next iteration. A highlevel illustration of the Kalman filter is shown in Figure 1. More formally, the time update step in a Kalman filter can be defined as:
(1) 
(2) 
where:

is the a priori estimate at step when the knowledge of the state in the previous step is given

is the a posteriori estimate at step , which is found through combining the a priori estimate and the measurement

is the state transition matrix which defines the linear relationship between and

is the control input vector, containing inputs which changes the state based on some external effect

is the control input matrix applied to the control input vector

is the covariance matrix representing the uncertainty of the a priori estimate

is the covariance matrix representing the uncertainty of the a posteriori estimate at step

is the process noise covariance matrix, induced during the linear update
Similarly, the measurement update step can be defined as:
(3) 
(4) 
(5) 
where

is the measurement of the system at time

is the measurement noise covariance matrix

is a transformation matrix relating the state space to the measurement space (when they are the same space, then
can be the identity matrix)

is the Kalman gain which drives the weighted combination of the measurement and the a priori state

indicates the identity matrix
The Kalman filter iterates through the time update and the measurement update steps. In this work time steps are considered equidistant and discreet. Hence, from this point, “time step” and “iteration” will be used interchangeably. At , an initial estimate for and is used. Next, the time update step is performed using Eq. (1) and (2) to get and respectively. The measurement and its related uncertainty are then obtained from a sensor or other appropriate source. These are combined with the a priori estimate using the measurement update step to find and using Eq (4), (3) and (5), which are then used in the next iteration . A detailed explanation of Kalman filters can be found in (kalman1960, ; maybeck1982stochastic, ), and an intuitive description in (Welch:1995:IKF:897831, ).
It should be emphasised here that, although a Kalman filter is used and Kalman filters are most commonly used with time series data, the proposed method does not perform time series prediction. Rather the focus is on multiclass classification and the data fusion property of the Kalman filter is used to combine the individual multiclass classifiers in the ensemble. Also, the term “ensemble” in this work relates to multiclass ensemble classifiers, and should not be confused with Ensemble Kalman Filters (EnKF) (evensen2003ensemble, ).
Apart from their applications to time series data and sensor fusion, Kalman filters have been used previously in a small number of supervised and unsupervised machine learning applications. For example, (SISWANTORO2016112, )
, improves the predictions of a neural network using Kalman filters, although this method is essentially a postprocessing of the results of a neural network output. Properties of a Kalman filter were used in combination with heuristics in populationbased metaheuristic optimisation algorithms
(TOSCANO20101955, ; Monson04thekalman, ), and in an unsupervised context in clustering (PAKRASHI2016704, ; pakrashikhka_10.1007/9783319202945_39, ). To the best of the authors’ knowledge this is the first application of Kalman filters to training multiclass ensemble classifiers.3 Training multiclass ensemble classifiers using a Kalman filter
This section introduces the new perspective on training multiclass ensemble classifiers using a Kalman filter. First, a toy example of static state estimation using a Kalman filter is presented, and then the new perspective is described.
3.1 A static state estimation problem: Estimating voltage level of a battery
Imagine that the exact voltage of a DC battery (which should remain constant) is unknown and needs to be estimated. A sensor is available to measure the voltage level of the battery. The measurements made by this sensor are unfortunately noisy, but the uncertainty associated with the measurements is known. This is a simple example of a static state estimation problem that can be solved by taking multiple noisy sensor measurements of the battery’s voltage, and combining these into a single accurate estimate using a Kalman filter.
The Kalman filter can be applied in this scenario as follows. As it is known that the voltage of the battery does not change the state transition matrix, , in Eq. (1) is the identity matrix; the control input matrix, , in Eq. (2) is nonexistent; and the process noise covariance matrix, , in Eq. (2) is considered to be zero. The voltage read by the sensor at a particular measurement, and the related uncertainty of the value due to the limited accuracy of the sensor, give and in Eq. (3) and (4) respectively. Given this information, the Kalman filter time update and measurement update steps can be performed to combine the current estimated voltage, , and the measurement, , to get a new and better estimate of the voltage. The process can be repeated, where at each step, a new voltage measurement from the sensor is received, which is then combined with the current estimated voltage value using the measurement update step.
Note that, after iterations, the estimated voltage is a combination of the sensor output values, where the Kalman gain, in Eq (4) and (5), is controls the influence of each measurement in the combination. Therefore, after iterations, the estimated voltage, , can be seen as an ensemble of the values received from the sensor, which are optimally combined. This same idea can be applied to combine noisy base classifiers into a more accurate ensemble model.
3.2 Combining multiclass classifiers using the Kalman filter
A machine learning algorithm learns a hypothesis for a specific problem. Assume that all possible hypotheses make a hypothesis space^{1}^{1}1The term hypothesis and hypothesis space is used to introduce the high level idea in connection with (DietterichHSpace, ), but the terms model and model space will be used synonomoulsy throughout this text., as described in (DietterichHSpace, ). Any point in the hypothesis space represents one hypothesis. For a specific problem, there is at least one ideal hypothesis within this hypothesis space which the learning algorithm tries to reach. Different hypotheses within the hypothesis space differ in their trainable parameters, and the machine learning algorithm modifies these parameters. Therefore, the training process can be seen as a search through the hypothesis space toward the ideal hypothesis.
The perspective presented in this paper views the ideal hypothesis as the static state to be estimated, and the hypothesis space as a state space. When an individual component classifier, , is trained, it can be seen as a point in the hypothesis space. Here, can be considered as an attempt to measure the ideal state with a related uncertainty indicated by the training error of . The Kalman filter can be used to estimate the ideal state by combining these multiple noisy measurements. The combination of these noisy measurements leads to an estimation of the state that is expected to be more accurate than the individual measurements, and so an ensemble classification model that is more accurate than its component classifiers.
This is illustrated in Figure 2. The vertical axis is an abstract representation of the hypothesis space with each point along this axis representing a possible hypothesis. The star symbol on the vertical axis indicates the ideal hypothesis for a specific classification problem. The horizontal axis in Figure 2 represents training iterations proceeding from left to right. The circles are the estimates of the hypothesis at a time step (the combination of all models added to the ensemble to this point in the training process), and the plus symbols represent the measurement of the hypothesis at a time step (the last model added to the ensemble). The dashed and solid arrows connecting the state estimates indicate the combination of the measurement and the a priori estimate respectively. The goal of the process is to reach a hypothesis as close as possible to the ideal hypothesis (indicated by the horizontal line marked with a star) by combining multiple individual hypotheses using a Kalman filter.
To help with understanding the new perspective, the Kalman filterbased approach to ensemble training can be directly mapped back to the DC battery voltage estimation example described in Section 3.1. The ensemble model capturing the ideal hypothesis is equivalent to the actual voltage level of the DC battery. An individual component classifier, , is analogous to the output from the voltage sensor. The classification error of the model maps to the uncertainty related to the voltage sensor measurements. Just as the estimated voltage after iterations can be thought as an ensemble of sensor measurements in the battery voltage estimation case; the trained individual classifiers, combined using the Kalman filter leads to an ensemble of classifier models.
4 Kalman Filterbased Heuristic Ensemble (KFHE)
This section provides a detailed description of the Kalman Filterbased Heuristic Ensemble (KFHE) algorithm, based on the new perspective proposed in Section 3. First, Section 4.1 presents an overview of the algorithm and connects the highlevel concepts from Section 3. Sections 4.2, 4.3 and 4.4 then discuss the details of the algorithm.
4.1 Algorithm overview
In KFHE the Kalman filter used to estimate an ensemble classifier, as described in Section 3, is referred to as the model Kalman filter, abbreviated to kfm. To implement kfm, the following questions must be answered:

What should constitute a state?

How should the time update step be defined?

What should constitute a measurement?

How should measurement uncertainty be evaluated?
The kfm state estimates are essentially the trained component classifiers. A model specification (for example the rules encoded in a decision tree or weight values in neural network) cannot be used directly as a state within the Kalman filter framework. Instead the predictions made by a component classifier for the instances in the training dataset are used as the representation of the state, as shown in Figure 4. This allows states to be combined using the equations in Section 2.2. This representation is explained in detail in Section 4.2.
Heuristics are used to address the remaining questions. The time update step is implemented as the identity function, as it can be assumed that the ideal state is static and does not change over time (as indicated by the horizontal line in Figure 2). The measurement is a function of the output of the multiclass classifier trained at the th iteration. This model is trained using a weighted sample from the overall training dataset. The classification error of the model trained at the th iteration, measured against its predictions for the full training set, is used as the uncertainty of the measurement.
A Kalman filter is then used to combine a measurement, which is the classification model at step represented as shown in Figure 4, and the a priori estimate to get an a posteriori estimate. The a posteriori state estimate at the th iteration is considered the ensemble classifier up to the th iteration. This a posteriori estimate is used in the next iteration, and the process continues until a stopping condition is met. As the uncertainties of the estimates are represented as the classification errors, the process continues towards estimating states expected to yield lower classification errors.
The use of weighted samples from the training set to train component classifiers at each step of the kfm process gives rise to another question: how should the weights for the weighted sampling of the training dataset be decided? In KFHE the answer is through another Kalman filter, which is referred to as the weight Kalman filter and abbreviated to kfw. The kfw Kalman filter works very similarly to kfm, but estimates sampling weights for the training dataset instead of the overall model state. This is described in detail in Section 4.3.
The interactions between the model Kalman filter, kfm, and the weight Kalman filter, kfw, are illustrated in Figure 3. Essentially kfw provides weights for the measurement step in kfm, and kfm provides measurement errors back to kfw for its measurement step. The training process is summarised in Algorithm 1 and the following subsections describe the workings of kfm and kfw in detail.
4.2 The model Kalman filter: kfm
The model Kalman filter, kfm, estimates the ensemble classifier by combining component classifiers into a single ensemble classification model. This is a static estimation problem as the state to be estimated, the ideal ensemble classifier, does not change over time. For this reason the time update step for kfm is the identity function and the a posteriori estimate of iteration is directly transferred to the a priori estimate at iteration .
The trained base classifiers of the ensemble (the measurements) or the a posteriori state estimate (ensemble classifier) themselves are not directly usable as a state in the Kalman filter framework. Therefore a proxy numerical representation is required to perform the computations. The proxy representation of the state is shown in Figure 4 where each row represents a datapoint from the training set and the estimate scores for the classes for the corresponding datapoint. The class membership is determined by taking the class with the maximum score, and this membership is expressed as . For example in Figure 4, the first datapoint has the highest prediction score assigned to classlabel , and thus the first datapoint is considered as a member of class . This representation of a model is used as the state in the Kalman filter framework.
Hence, the time update equations for kfm are very simply defined as:
(6) 
(7) 
where

is the a posteriori estimate from the previous iteration and represents the a priori estimate at the present iteration. These are the predictions of the ensemble model at the th iteration in the representation shown in Figure 4. So, for example, where denotes the prediction for the datapoint; and each is a vector of prediction scores where is the number of classes in the prediction problem.

and are the uncertainties related to and respectively.
Eq. (6) is derived directly from Eq. (1) by setting to the identity matrix, and assuming that is nonexistent (there is no control process involved in KFHE). in Eq. (4), (3), and (5) is set to the identity function. Also, it is assumed that no process noise is induced, hence in Eq. (2) is set to to get Eq. (7). The superscript throughout indicates that these parameters are related to the model Kalman filter, kfm, estimating the state .
The kfm measurement step is more interesting. At every th iteration a new classification model is trained with a weighted sample of the training dataset. The sampling is done with replacement, with the same number of datapoints as in the original training dataset. The weights are designed to highlight the points which were misclassified previously, as is common in boosting algorithms (although the weight updates are performed using the other Kalman filter kfw). The measurement is taken as the average of the previous prediction, , and the prediction of this th model, , as in Eq. (8). This effectively attempts to capture how much the trained model of the present iteration impacts the ensemble predictions until iteration . Therefore the measurement step and its related error for kfm becomes:
(8) 
(9) 
(10) 
(11) 
(12) 
where:

is a model trained on dataset , using the learning algorithm , where the dataset is sampled using the weights .

indicates the predictions made by the trained model for the datapoints in the dataset .

represents the measurement heuristic, the representation of which is as explained in Figure 4.

is the uncertainty related to and is a misclassification rate calculated by comparing the class predictions made by the current ensemble, , with the ground truth classes, .
The remaining steps of the Kalman filter process to compute the Kalman gain, the a posteriori
state estimate, and the variance are as described for the standard Kalman filter framework but are repeated in Eq. (
11), (10) and (12) for completeness. Note that, the uncertainty and the Kalman gain are scalars in the KFHE implementation, as the state to be estimated is one model and only one measurement is taken per iteration.To initialise the kfm processthe initial learner is trained as and , where
is a uniform distribution. Also,
is set to , indicating that the initial a priori estimates are uncertain. After initialisation, the iteration starts at . The goal of the training phase is to compute and store the learned models and the Kalman gain values for all .To avoid measurements with large errors, if the measurement error is more than , where is the number of classes, then the sampling weights, , are reset to a uniform distribution, which is a similar modification to that used in the AdaBoost implementation in (adabag, ).
4.3 The weight Kalman filter: kfw
The previous description mentioned how a component learner depends on a vector of sampling weights, , which is estimated using kfw. The purpose of is to give more weight to the datapoints which were not classified correctly in the previous iteration to encourage specialisation. The implementation of kfw is very similar to the previous Kalman filter implementation. In this case the state estimated by the Kalman filter is a vector of real numbers representing weights. The time update step in this case is also the identity function:
(13) 
(14) 
To estimate the measurement of the weights the following equations are used:
(15) 
(16) 
This heuristic derives the measurement of kfw, from the classification error, , of the measurement of kfm, as shown in Figure 3. In Eq. (15), the function can adjust the impact of misclassified datapoints on the weight vector. In the present work on KFHE, two options are explored: and , where the second option places more emphasis on misclassified datapoints. We refer to the variant of KFHE using the first, linear definition for as KFHEl and the variant using the second, exponential definition as KFHEe.
A trivial heuristic is used in this step to compute the measurement error, , by setting it to (Eq. (16)). This assumes the measurement weight, , has an error at most equal to the last measurement error for kfm, which assumes that the weights will lead to a model with an error no more than the last measurement by kfm. The measurement update of kfw becomes:
(17) 
(18) 
(19) 
The superscript indicates that these parameters are related to kfw. Here and are vectors, with and representing the weight estimate and the weight measurement of the th iteration for the datapoint.
The equations for kfw to compute the Kalman gain, ; the a posteriori state estimate for the weights, ; and the variance, , are shown in Eq. (18), (17) and (19). These are identical to those presented for kfm in Section 4.2 (except for the superscripts), but are included here for completeness.
Initially, is set to have equal weights for every datapoint in the training set, and is initialised to . Note that under this implementation the calculation of the measurement error for kfw and the initialisation of , makes the Kalman gain the same as . No information from the kfw process needs to be stored to support predictions from the ensmble.
4.4 Making predictions using KFHE
The goal of KFHE training is to calculate and store the trained base model, , and Kalman gain, , for each iteration, , of the model Kalman filter, kfm, process for to (the total number of component classifiers trained). Once this is done, generating predictions is straightforward. Given a new datapoint, , is found using the initial model . Then Eq. (8) and (10) are iteratively applied to generate predictions from each model, , which are combined using the appropriate Kalman gain values, . The final value is taken as the ensemble prediction, and is a vector containing a prediction score for each class. Datapoints as classified as belonging to the class with the maximum score. Algorithm 2 summarises the prediction process for KFHE.
5 Experiments
This section describes the datasets, algorithms, experimental setup, and evaluation processes used in a set of experiments designed to evaluate the effectiveness of the KFHE algorithm. Two variants of KFHE, KFHEe and KFHEl (as described in Section 4.3), are evaluated and a set of stateoftheart ensemble methods are used as benchmarks.
5.1 Datasets & performance measure
multiclass datasets (described in Table 1) from the UCI Machine Learning repository (Lichman:2013, ) are used. These datasets are frequently used in classifier benchmark experiments (Dietterich2000, ; ZHANG20081524, ; CAO20124451, ; SUN201687, ), cover diverse domains, have numbers of classes ranging from to , and exhibit varying amounts of class imbalance.
dataset names  #datapoints  #dimensions  #classes 

mushroom  8124  22  2 
iris  150  5  3 
glass  214  10  6 
car_eval  1728  7  4 
cmc  1473  10  3 
tvowel  871  4  6 
balance_scale  625  5  3 
breasttissue  106  10  6 
german  1000  21  2 
ilpd  579  11  2 
ionosphere  351  34  2 
knowledge  403  6  4 
vertebral  310  7  2 
sonar  208  61  2 
diabetes  145  4  3 
skulls  150  5  5 
physio  464  37  3 
flags  194  30  8 
bupa  345  7  2 
cleveland  303  14  5 
haberman  306  4  2 
hayesroth  132  6  3 
monks  432  7  2 
newthyroid  432  7  3 
yeast  1484  9  10 
spam  4601  58  2 
lymphography  148  19  4 
movement_libras  360  91  15 
SAheart  462  10  2 
zoo  101  17  7 
To evaluate the performance of each model, the macroaveraged score (kelleher2015fundamentals, ) was used. The score in a binary classifier context indicates how precise as well as how robust a classifier model is, and it can be easily extended to a multiclass scenario. The macroaveraged score will be denoted as , and is defined as:
where and
are the precision and recall values for the
class, where is the number of classes. is appropriate for this experiment because the datasets used exhibit different levels of class imbalance.5.2 Experimental setup
The stateoftheart methods used as benchmarks are AdaBoost (zhu2006multi, ), Bagging (Breiman1996, ), Gradient Boosting Machine (GBM) (Friedman00greedyfunction, ) and Stochastic Gradient Boosting Machine (SGBM) (FRIEDMAN2002367, ). This set covers the different fundamental ensemble classifier types described in Section 2. For all algorithms, including KFHEe and KFHEl, the component learners are CART models (kelleher2015fundamentals, ). The performance of a single CART model is also included as a baseline to compare against the ensemble methods. The number of ensemble components is set to for all algorithms (initial experiments showed that for all datasets there were no significant improvements in performance beyond 100 components).
All implementations and evaluations were performed in R^{2}^{2}2A version of KFHE is available at https://github.com/phoxis/kfhe. The AdaBoost and Bagging implementations were from the package adabag (adabag, ), and the GBM and SGBM implementations were from the package gbm (gbm, ). As multiclass datasets were used in this experiment, the multiclass variant of AdaBoost, AdaBoost.SAMME (zhu2006multi, ), was used for this experiment (this will be described just as AdaBoost in the remainder of the paper). For the KFHE experiments the training was stopped when the value of reached , which can be interpreted as an indication that the state estimated by kfm has no uncertainty.
The experiments were divided into two parts. First, to evaluate the effectiveness of KFHEe and KFHEl and to compare these to the stateoftheart methods, the performance of all algorithms is assessed using the datasets listed in Table 1. Second, the robustness of the different algorithms to classlabel noise is compared. For both sets of experiments, for each algorithmdataset pair, a times fold crossvalidation experiment was performed, and the mean of the scores across the folds are measured.
For the second set of experiments, classlabel noise was introduced synthetically into each of the datasets in Table 1. To induce noise a fraction of the datapoints from the training set was sampled randomly and the class of each selected datapoint was randomly changed, following a uniform distribution, to a different one. For each dataset in Table 1, datasets with , , and noise were generated. For each of these noisy datasets, a times fold crossvalidation experiment was performed. For each fold, the noisy class labels were used in training, but the scores were computed with respect to the original unchanged dataset labels.
6 Results
The experiment results comparing the performance of KFHEe and KFHEl to the other methods are shown first. Next, the results of the experiments comparing the performance of the different methods in the presence of noisy classlabels are presented. Statistical significance tests that analyse the differences between the proposed and other methods are also presented.
6.1 Performance comparison of the methods
The relative performance of each algorithm, based on the average scores achieved in the cross validation experiments, on each of the datasets was ranked (from to , where implies best performance). The first row of Table 2 (labelled ) shows the average rank of each algorithm across the datasets (detailed performance results for each algorithm on each dataset are shown in Table 4 in A). These average ranks are also visualised in the first column of Figure 5 (also labelled ).
The average ranking shows that KFHEe was able to attain the best average rank , AdaBoost was very close with average rank , followed by KFHEl with the average rank . It is clear that KFHEe outperformed GBM, SGBM, Bagging and CART. Also, KFHEl performs better overall than GBM, SGBM, Bagging and CART. It was concluded that KFHEl performed slightly less well than KHFEe and AdaBoost due to the lack of emphasis on misclassified points in the weight measurement step. In Section 6.3, a statistical significance test will be performed to uncover significant differences between methods on datasets with nonnoisy classlabels.
KFHEe  KFHEl  AdaBoost  GBM  SGBM  Bagging  CART  

0%  2.78  3.33  2.98  3.70  4.30  4.82  6.08 
5%  3.07  3.07  3.77  3.27  4.08  4.62  6.13 
10%  3.70  2.70  4.77  3.37  3.50  4.10  5.87 
15%  3.87  2.68  4.83  3.48  3.27  3.75  6.12 
20%  4.33  2.70  5.40  3.37  3.18  3.10  5.92 
The evolution of the key parameters of KFHE (the measurement error, ; the posteriori variance, ; the Kalman gain, ; and the training misclassification rate of the kfm component) with respect to the number of ensemble iterations , for a selection of datasets (knowledge, diabetes, car_eval and lymphography) are plotted in Figure 6. The plots show the results of the first iterations after which, for most of the datasets, the error reduces to . The plots for all of the datasets are given in D.
The plots in Figure 6 show that in all cases the value of decreases monotonically, which can be interpreted as the system becoming more confident on the a posteriori estimate, and therefore that the values of reduce and stabilise, implying less impact for subsequent measurements. This is because of the way the time step update was formulated in Section 4.2: no uncertainty induced during the time update step, and no process noise is assumed. Therefore, in effect the steepness of controls how much of the measurement is combined through Eq. (10) and (11). Also, it is interesting to note the similarity and the rate of change of the error rate of the ensemble with the value. For most of the datasets they show a similar trend. The value of indicates the fraction of the measurement which will be incorporated into the ensemble. A measurement with less error is incorporated more into the final model.
6.2 Performance for the noisy classlabel case
The relative performance of each algorithm, based on the average scores achieved in the cross validation experiments, on each of the datasets was ranked (from to , where implies best performance). this was performed separately for datasets with , , and induced class label noise. Table 2 shows the average rank of each algorithm for each level of noise (detailed performance results for each algorithm on each dataset are shown in Tables 5 to 8 in A). These average ranks are also visualised in Figure 5. For ease of reading, the vertical axis in Figure 5 is inverted to highlight that the lower average ranks are better.
Out of the algorithms tested the KFHEl algorithm performs most consistently in the presence of classlabel noise. At the noise level KFHEe and KFHEl had the same rank, and as the classnoise level increases to , and , KFHEl attains the best average rank over the datasets. Along with KFHEl, SGBM and Bagging also improve their relative ranking. As the fraction of mislabelled datapoints increased in the training set, the average rank of AdaBoost degrades sharply. The performance of AdaBoost and Bagging in the presence of noisy class labels is studied in (Dietterich2000, ), where a similar result was found. On the other hand the change in the relative rank for GBM, and CART was consistently stable.
It should be noted that the degradation of performance with respect to classnoise in AdaBoost is more severe than KFHEe, although both of them use the function to highlight the weights of the misclassified datapoints. This is due to the smoothing effect in the KFHE algorithm, which makes KFHEe less sensitive to noise than AdaBoost. On the other hand, KFHEl does not use in Eq. (15) for the weight measurement step, which makes it more robust to noise and allows it achieve high performance across all noise levels.
Figure 7 shows the change in score for each algorithm on the knowledge, diabetes, car_eval, and lymphography datasets (similar plots for all datasets are given in C), as the amount of classlabel noise increases (note that to highlight changes in performance the vertical axes in these charts are scaled to narrow ranges of possible scores). These plots are derived from Tables 48. With few exceptions the performances of KFHEl, GBM, SGBM and Bagging are not impacted as much as the other approaches by noise. Although KFHEe is generally better than the other approaches when there is no classlabel noise present, as the induced noise increases, the score for KFHEe decreases—albeit less severely than in the case of AdaBoost.
6.3 Statistical significance testing
This section presents two types of statistical significance tests that compare the performance of the different algorithms tested. First, to assess the overall differences in performance a multiple classifier comparison test was performed following the recommendations of (GARCIA20102044, ). Second, a comparison of each pair of algorithms in isolation is performed using the Wilcoxon’s Signed Rank Sum test (GARCIA20102044, ).
6.3.1 Multiple classifier comparison
To understand the overall effectiveness of the variants of KFHE (KFHEe and KFHEl), following the recommendations of García et. al. (GARCIA20102044, )
, a multiple classifier comparison significance test was performed (separate tests were performed on the performance of algorithms at each noise level). First, a Friedman Aligned Rank test was performed. This indicated that, at all noise levels, the null hypothesis that the performance of all algorithms is similar can be rejected, with
. To further investigate these differences, posthoc pairwise Friedman Aligned Rank tests along with the Finner value adjustment (GARCIA20102044, ) were performed. Rank plots describing the results of the posthoc tests (with a significance level of ) are shown in Figure 8.When no classlabel noise is present, the results indicate that KFHEe (avg. rank ) was significantly better than SGBM (avg. rank ), Bagging (avg. rank ) and CART (avg. rank ) with ; and that KFHEl (avg. rank ) was significantly better than SGBM, Bagging and CART with . Although KFHEe attained a better average rank, , than AdaBoost, , the nullhypothesis could not be rejected, and so it cannot be determined that the performances of KFHEe and AdaBoost are significantly different. Similarly, KFHEl attained a worse average rank, , than AdaBoost, but tests did not identify a statistically significant difference.
The results of the experiment for the datasets with classlabel noise indicate that, as the noise continues to increase, the relative performance of KFHEl improves, but the relative performance of KFHEe decreases. This is as expected, because of the chosen weight measurement heuristic for the two variants of KFHE as explained in Section 4.3. KFHEl was found to be statistically significantly better than SGBM, and Bagging at all noise levels except . KFHEl was also found to be statistically significantly better than AdaBoost at the , and noise levels. Although the performance of KFHEe decreases with increasing classlabel noise, it does not decrease as sharply as AdaBoost. The complete details of the tests are given in Table 9 in B.
Overall these tests confirm that when no classlabel noise is present KFHEe performs as well as AdaBoost and GBM, but significantly better than SGBM, Bagging and CART. KFHEe, however, is not as robust to class label noise as the other approaches. KFHEl, on the other hand, is robust to noise and performs very well in all classlabel noise settings.
6.3.2 Isolated algorithm pairs comparison
To further understand how individual algorithm pairs compare with each other, ignoring other algorithms, a two tailed Wilcoxon’s Signed Rank Sum test for each pair of algorithms was performed. It must be emphasised that the Wilcoxon’s rank sum test cannot
be used to perform multiple classifier comparison without introducing Type I error (rejecting the null hypothesis when it cannot be rejected), as it does not control the Family Wise Error Rate (FWER)
(GARCIA20102044, ). Therefore, the values for each pair from this experiment should only be interpreted in isolation from any other algorithms. Table 3 shows the results of these tests for the datasets without any classlabel noise (Tables (a)a to (e)e in B show the results for the noisy cases). The cells in the lower diagonal show the values of the Wilcoxon’s Signed Rank Sum test for corresponding algorithm pair and the cells in the upper diagonal show the pairwise win/lost/tie counts.KFHEe  KFHEl  AdaBoost  GBM  SGBM  Bagging  CART  

KFHEe  (19/11/0)  (13/16/1)  (23/7/0)  (21/9/0)  (24/6/0)  (26/4/0)  
KFHEl  0.009519 ***  (12/18/0)  (16/14/0)  (20/10/0)  (25/5/0)  (26/4/0)  
AdaBoost  0.491795  0.028548 **  (19/11/0)  (20/10/0)  (22/8/0)  (25/5/0)  
GBM  0.003018 ***  0.144739  0.013515 **  (22/8/0)  (20/10/0)  (25/5/0)  
SGBM  0.001128 ***  0.004921 ***  0.002834 ***  0.003418 ***  (19/11/0)  (25/5/0)  
Bagging  0.000210 ***  0.000034 ***  0.000415 ***  0.004108 ***  0.336640  (25/4/1)  
CART  0.000010 ***  0.000006 ***  0.000019 ***  0.000055 ***  0.002057 ***  0.000016 *** 
The results in Table 3 show that without classlabel noise when compared in isolation KFHEe performs significantly better than any other method, except AdaBoost. In the noise free case KFHEl performs significantly better than SGBM, Bagging and CART. Similarly, the test results at different noise levels (described in B) show that as classlabel noise increases, the performance of KFHEe starts to become significantly better than AdaBoost, although it is worse than other methods. When compared in isolation KFHEl performs significantly better than almost all other methods at all noise levels.
7 Conclusion and future work
This paper introduces a new perspective on training multiclass ensemble classification models. The ensemble classifier model is viewed as a state to be estimated, and this state is estimated using a Kalman filter. Unlike more common applications of Kalman filters to time series data, this work exploits the sensor fusion property of the Kalman filter to combine multiple individual multiclass classifiers to build a multiclass ensemble classifier algorithm. Based on this new perspective a new multiclass ensemble classification algorithm, the Kalman Filterbased Heuristic Ensemble (KFHE), is proposed.
Detailed experiments on two slight variants of KFHE, KFHEe and KFHEl, were performed. KFHEe is more effective on nonnoisy classlabels, as it emphasises the misclassified training datapoints from one iteration of the training algorithm to the next, and KFHEl is more effective on noisy classlabels as it does not emphasise misclassified training datapoints as much. Experiments show that KFHEe and KFHEl perform at least as well as, and in many cases, better than Bagging, SGBM, GBM and AdaBoost. For datasets with noisy class labels, KFHEl performed significantly better than all other methods across different levels of classlabel noise. For these datasets KFHEe performed more poorly than KFHEl, GBM, and SGBM, but better than AdaBoost.
KFHE can be seen as a hybrid ensemble approach mixing the benefits of both bagging and a boosting. Bagging weighs each of the component learner’s votes equally, whereas boosting finds the optimum weights, using which the component learners are combined. KFHE does not find the optimum weights analytically as AdaBoost does, but attempts to combine the classifiers based on how well the measurement is in a given iteration.
Given the new perspective, other implementations that expand upon KFHE can also be designed following the framework and methods described in Sections 3 and 4. In future, it would be interesting to pursue the following studies:

The effect when process noise and a linear time update step are introduced.

Multiple and different types of measurements can also be performed. That is, instead of having one component classifier model per iteration, more than one classifier model could be used. This is analogous to having multiple noisy sensors measuring the DC voltage level of the toy example presented in Section 3.1.

To further study the effects of other types of noise (classwise label noise, noise in input space, etc.), higher levels of noise induced in the classlabel assignments, and performance on imbalanced class datasets.
Acknowledgements
This research was supported by Science Foundation Ireland (SFI) under Grant number SFI/12/RC/2289. The authors would like to thank Gevorg Poghosyan, PhD Research Student at Insight Centre for Data Analytics, School of Computer Science, University College Dublin, for feedback and discussions which led to the state space representation in Figure 2. Also, the authors would like to thank the unnamed reviewers for their detailed and constructive comments which helped to significantly improve the quality of the paper.
Appendix A Complete experiment results
KFHEe  KFHEl  AdaBoost  GBM  SGBM  Bagging  CART  

mushroom  1.0000 0.00 (1.5)  0.9968 0.00 (5)  1.0000 0.00 (1.5)  0.9997 0.00 (3)  0.9990 0.00 (4)  0.9941 0.00 (6.5)  0.9941 0.00 (6.5) 
iris  0.9433 0.03 (4)  0.9487 0.03 (1)  0.9448 0.03 (2)  0.9403 0.04 (5)  0.9437 0.03 (3)  0.9376 0.03 (6)  0.9298 0.03 (7) 
glass  0.7125 0.08 (3)  0.7153 0.07 (1)  0.7144 0.07 (2)  0.6666 0.08 (4)  0.5695 0.09 (6)  0.5965 0.09 (5)  0.5466 0.06 (7) 
car_eval  0.9653 0.02 (2)  0.9011 0.03 (4)  0.9665 0.02 (1)  0.9131 0.04 (3)  0.8236 0.04 (7)  0.8569 0.04 (5)  0.8546 0.04 (6) 
cmc  0.5222 0.02 (5)  0.5280 0.03 (2)  0.5038 0.02 (7)  0.5270 0.02 (4)  0.5275 0.02 (3)  0.5291 0.03 (1)  0.5187 0.03 (6) 
tvowel  0.8451 0.02 (2)  0.8283 0.03 (3)  0.8279 0.02 (4)  0.8458 0.03 (1)  0.8236 0.03 (5)  0.8004 0.03 (6)  0.7855 0.03 (7) 
balance_scale  0.6345 0.03 (1)  0.5984 0.01 (4)  0.6186 0.03 (2)  0.5935 0.02 (5)  0.6045 0.01 (3)  0.5861 0.02 (6)  0.5412 0.02 (7) 
flags  0.3059 0.05 (3)  0.2771 0.05 (4)  0.3187 0.06 (2)  0.3236 0.06 (1)  0.2602 0.03 (5)  0.2525 0.03 (6)  0.2439 0.03 (7) 
german  0.6907 0.03 (2)  0.6960 0.02 (1)  0.6837 0.03 (5)  0.6860 0.03 (3)  0.6852 0.03 (4)  0.6826 0.02 (6)  0.6550 0.03 (7) 
ilpd  0.6126 0.04 (2)  0.5797 0.04 (5)  0.6153 0.04 (1)  0.5809 0.04 (4)  0.5733 0.04 (6)  0.5675 0.04 (7)  0.5865 0.04 (3) 
ionosphere  0.9238 0.03 (2)  0.9157 0.03 (4)  0.9298 0.02 (1)  0.9179 0.03 (3)  0.9105 0.03 (5)  0.9004 0.03 (6)  0.8617 0.03 (7) 
knowledge  0.9354 0.03 (2)  0.9315 0.03 (3)  0.9524 0.02 (1)  0.9155 0.03 (6)  0.8925 0.04 (7)  0.9184 0.03 (4)  0.9160 0.03 (5) 
vertebral  0.8036 0.04 (4)  0.8090 0.04 (2)  0.8001 0.04 (6)  0.8030 0.04 (5)  0.8130 0.05 (1)  0.8042 0.04 (3)  0.7857 0.04 (7) 
sonar  0.8072 0.05 (2)  0.7836 0.06 (4)  0.8371 0.05 (1)  0.7893 0.06 (3)  0.7797 0.06 (5)  0.7766 0.06 (6)  0.7021 0.05 (7) 
skulls  0.2358 0.06 (5)  0.2380 0.06 (3)  0.2362 0.06 (4)  0.2514 0.08 (1)  0.2436 0.06 (2)  0.2300 0.06 (7)  0.2309 0.07 (6) 
diabetes  0.9558 0.03 (7)  0.9647 0.03 (6)  0.9658 0.03 (5)  0.9725 0.02 (2)  0.9722 0.02 (3)  0.9727 0.03 (1)  0.9710 0.03 (4) 
physio  0.9069 0.02 (4)  0.9109 0.03 (2)  0.9079 0.02 (3)  0.9046 0.02 (5)  0.9136 0.03 (1)  0.8959 0.03 (6)  0.8847 0.03 (7) 
breasttissue  0.6766 0.08 (1)  0.6711 0.08 (2)  0.6606 0.08 (4)  0.6605 0.07 (5)  0.6347 0.08 (6)  0.6653 0.08 (3)  0.6338 0.08 (7) 
bupa  0.7027 0.04 (2)  0.7114 0.04 (1)  0.6926 0.04 (6)  0.7018 0.04 (3)  0.6944 0.04 (5)  0.6954 0.04 (4)  0.6433 0.05 (7) 
cleveland  0.2975 0.05 (2)  0.2845 0.04 (5)  0.3058 0.04 (1)  0.2938 0.04 (3)  0.2865 0.04 (4)  0.2736 0.04 (7)  0.2766 0.04 (6) 
haberman  0.5504 0.05 (6)  0.5743 0.05 (5)  0.5465 0.05 (7)  0.5751 0.05 (4)  0.5996 0.04 (1)  0.5757 0.05 (3)  0.5772 0.05 (2) 
hayes_roth  0.8602 0.05 (1)  0.8491 0.05 (3)  0.8510 0.04 (2)  0.6094 0.08 (6)  0.5683 0.10 (7)  0.7418 0.10 (4)  0.6691 0.10 (5) 
monks  0.9997 0.00 (2)  0.9981 0.01 (3)  1.0000 0.00 (1)  0.9671 0.06 (4)  0.9114 0.06 (5)  0.9002 0.06 (6)  0.8178 0.09 (7) 
newthyroid  0.3973 0.04 (4)  0.3867 0.04 (6)  0.3972 0.04 (5)  0.3742 0.04 (7)  0.4162 0.04 (2)  0.4275 0.04 (1)  0.4087 0.11 (3) 
yeast  0.5339 0.05 (1)  0.4701 0.03 (4)  0.5209 0.05 (3)  0.5225 0.05 (2)  0.4359 0.02 (5)  0.4187 0.03 (6)  0.4069 0.03 (7) 
spam  0.9477 0.01 (2)  0.9256 0.01 (4)  0.9508 0.01 (1)  0.9309 0.01 (3)  0.9225 0.01 (5)  0.9029 0.01 (6)  0.8870 0.01 (7) 
lymphography  0.6733 0.19 (2)  0.5074 0.13 (3)  0.7089 0.18 (1)  0.4454 0.10 (4)  0.3973 0.03 (5)  0.3966 0.03 (6)  0.3704 0.04 (7) 
movement_libras  0.7772 0.04 (1)  0.7488 0.05 (3)  0.7679 0.04 (2)  0.6434 0.05 (5)  0.6190 0.05 (6)  0.6715 0.05 (4)  0.5176 0.05 (7) 
SAheart  0.6214 0.04 (6)  0.6408 0.04 (4)  0.6090 0.04 (7)  0.6436 0.04 (3)  0.6509 0.03 (1)  0.6466 0.04 (2)  0.6237 0.05 (5) 
zoo  0.8548 0.12 (2)  0.8490 0.11 (3)  0.8740 0.11 (1)  0.8450 0.10 (4)  0.5455 0.11 (7)  0.5922 0.09 (5)  0.5840 0.05 (6) 
Average rank  2.78  3.33  2.98  3.7  4.3  4.82  6.08 
Each cell in the table shows the mean and standard deviation of the
score (higher value is better) for the times fold crossvalidation experiment for each algorithm and each of the datasets listed in Table 1. The values in parenthesis are the relative rankings of the algorithms on the dataset in the corresponding row (lower ranks are better).KFHEe  KFHEl  AdaBoost  GBM  SGBM  Bagging  CART  

mushroom  0.9972 0.00 (4)  0.9941 0.00 (6)  0.9993 0.00 (2)  0.9997 0.00 (1)  0.9990 0.00 (3)  0.9941 0.00 (6)  0.9941 0.00 (6) 
iris  0.9205 0.05 (7)  0.9413 0.03 (3)  0.9282 0.04 (6)  0.9359 0.03 (4)  0.9486 0.03 (1)  0.9436 0.03 (2)  0.9335 0.03 (5) 
glass  0.6818 0.09 (2)  0.6969 0.07 (1)  0.6597 0.08 (3)  0.6498 0.08 (4)  0.5532 0.08 (6)  0.5876 0.10 (5)  0.5367 0.06 (7) 
car_eval  0.8918 0.03 (1)  0.8639 0.03 (4)  0.8765 0.03 (3)  0.8820 0.04 (2)  0.8047 0.04 (7)  0.8451 0.03 (5)  0.8423 0.03 (6) 
cmc  0.5223 0.02 (5)  0.5285 0.02 (3)  0.5047 0.03 (7)  0.5303 0.02 (2)  0.5274 0.02 (4)  0.5330 0.02 (1)  0.5197 0.03 (6) 
tvowel  0.8346 0.03 (2)  0.8275 0.03 (4)  0.7903 0.03 (6)  0.8435 0.03 (1)  0.8329 0.03 (3)  0.7961 0.03 (5)  0.7805 0.04 (7) 
balance_scale  0.5989 0.03 (1)  0.5940 0.02 (3)  0.5917 0.03 (4)  0.5912 0.02 (5)  0.5982 0.02 (2)  0.5799 0.02 (6)  0.5418 0.02 (7) 
flags  0.3113 0.06 (3)  0.2988 0.05 (4)  0.3193 0.06 (1)  0.3185 0.06 (2)  0.2678 0.04 (5)  0.2518 0.03 (6)  0.2420 0.04 (7) 
german  0.6732 0.03 (2)  0.6765 0.03 (1)  0.6690 0.03 (4)  0.6695 0.03 (3)  0.6655 0.03 (5)  0.6608 0.03 (6)  0.6348 0.04 (7) 
ilpd  0.6130 0.04 (2)  0.5874 0.04 (3)  0.6220 0.04 (1)  0.5815 0.04 (4)  0.5755 0.04 (6)  0.5724 0.04 (7)  0.5759 0.04 (5) 
ionosphere  0.9093 0.03 (2)  0.9087 0.03 (4)  0.9090 0.03 (3)  0.9018 0.03 (6)  0.9103 0.03 (1)  0.9023 0.03 (5)  0.8507 0.04 (7) 
knowledge  0.9360 0.02 (2)  0.9300 0.02 (3)  0.9369 0.02 (1)  0.9188 0.03 (4)  0.8924 0.03 (7)  0.9177 0.03 (6)  0.9181 0.03 (5) 
vertebral  0.7928 0.05 (5)  0.8073 0.05 (3)  0.7740 0.05 (6)  0.8024 0.05 (4)  0.8163 0.04 (1.5)  0.8163 0.05 (1.5)  0.7736 0.05 (7) 
sonar  0.7900 0.05 (2)  0.7759 0.05 (3)  0.8116 0.05 (1)  0.7719 0.06 (4)  0.7708 0.05 (5)  0.7610 0.05 (6)  0.6867 0.06 (7) 
skulls  0.2462 0.07 (3)  0.2226 0.06 (6)  0.2275 0.06 (5)  0.2550 0.07 (1)  0.2510 0.07 (2)  0.2295 0.06 (4)  0.1935 0.06 (7) 
diabetes  0.9364 0.04 (6)  0.9731 0.03 (1)  0.9305 0.04 (7)  0.9722 0.02 (2)  0.9705 0.02 (5)  0.9710 0.03 (3)  0.9709 0.03 (4) 
physio  0.8781 0.03 (5)  0.9092 0.02 (2)  0.8712 0.03 (6)  0.8995 0.02 (3)  0.9113 0.02 (1)  0.8955 0.03 (4)  0.8658 0.03 (7) 
breasttissue  0.6557 0.07 (2)  0.6709 0.07 (1)  0.6511 0.08 (3)  0.6502 0.08 (4)  0.6285 0.08 (6)  0.6416 0.08 (5)  0.5927 0.08 (7) 
bupa  0.6852 0.04 (2)  0.6962 0.04 (1)  0.6625 0.05 (6)  0.6839 0.04 (4)  0.6846 0.04 (3)  0.6795 0.05 (5)  0.6309 0.05 (7) 
cleveland  0.2895 0.05 (4)  0.2906 0.05 (3)  0.3076 0.05 (1)  0.2922 0.05 (2)  0.2883 0.05 (5)  0.2793 0.04 (7)  0.2864 0.05 (6) 
haberman  0.5429 0.05 (6)  0.5554 0.06 (5)  0.5342 0.05 (7)  0.5665 0.06 (3)  0.5793 0.06 (1)  0.5643 0.06 (4)  0.5738 0.06 (2) 
hayes_roth  0.8022 0.07 (2)  0.8289 0.06 (1)  0.7815 0.07 (3)  0.5869 0.09 (6)  0.5208 0.09 (7)  0.7145 0.10 (4)  0.6695 0.10 (5) 
monks  0.9644 0.02 (2)  0.9985 0.00 (1)  0.9311 0.03 (5)  0.9473 0.06 (3)  0.9268 0.06 (6)  0.9379 0.06 (4)  0.8498 0.09 (7) 
newthyroid  0.3972 0.04 (5)  0.3962 0.04 (6)  0.4039 0.04 (4)  0.3960 0.04 (7)  0.4475 0.04 (1)  0.4395 0.03 (2)  0.4086 0.11 (3) 
yeast  0.4797 0.06 (2)  0.4354 0.03 (4)  0.4521 0.07 (3)  0.4829 0.05 (1)  0.4312 0.03 (5)  0.4176 0.02 (6)  0.4021 0.03 (7) 
spam  0.9325 0.01 (1)  0.9265 0.01 (4)  0.9311 0.01 (2)  0.9294 0.01 (3)  0.9219 0.01 (5)  0.9039 0.01 (6)  0.8866 0.01 (7) 
lymphography  0.6436 0.16 (2)  0.4896 0.12 (3)  0.6507 0.15 (1)  0.4402 0.09 (4)  0.3954 0.04 (5)  0.3941 0.04 (6)  0.3704 0.05 (7) 
movement_libras  0.7365 0.04 (1)  0.7138 0.05 (3)  0.7362 0.04 (2)  0.6064 0.06 (5)  0.5868 0.06 (6)  0.6517 0.06 (4)  0.5008 0.05 (7) 
SAheart  0.6120 0.04 (5)  0.6252 0.04 (4)  0.6020 0.04 (7)  0.6255 0.04 (3)  0.6446 0.03 (1)  0.6405 0.04 (2)  0.6106 0.05 (6) 
zoo  0.7765 0.11 (4)  0.8025 0.13 (2)  0.7841 0.12 (3)  0.8058 0.12 (1)  0.5566 0.11 (7)  0.5823 0.09 (5)  0.5704 0.07 (6) 
Average rank  3.07  3.07  3.77  3.27  4.08  4.62  6.13 
KFHEe  KFHEl  AdaBoost  GBM  SGBM  Bagging  CART  

mushroom  0.9942 0.00 (4)  0.9941 0.00 (6)  0.9970 0.00 (3)  0.9988 0.00 (1)  0.9987 0.00 (2)  0.9941 0.00 (6)  0.9941 0.00 (6) 
iris  0.8749 0.06 (6)  0.9384 0.04 (3)  0.8569 0.06 (7)  0.9319 0.05 (5)  0.9487 0.03 (1)  0.9433 0.03 (2)  0.9380 0.03 (4) 
glass  0.6850 0.09 (2)  0.6990 0.08 (1)  0.6801 0.07 (3)  0.6253 0.08 (4)  0.5658 0.08 (6)  0.6189 0.09 (5)  0.5641 0.10 (7) 
car_eval  0.8660 0.04 (1)  0.8621 0.03 (2)  0.7311 0.04 (7)  0.8411 0.05 (3)  0.7421 0.05 (6)  0.8111 0.04 (4)  0.7901 0.04 (5) 
cmc  0.5133 0.02 (6)  0.5232 0.02 (1)  0.4886 0.02 (7)  0.5220 0.02 (2)  0.5212 0.02 (4)  0.5217 0.02 (3)  0.5175 0.03 (5) 
tvowel  0.8264 0.03 (2)  0.8187 0.03 (4)  0.7747 0.04 (7)  0.8383 0.03 (1)  0.8238 0.03 (3)  0.7961 0.03 (5)  0.7791 0.03 (6) 
balance_scale  0.6016 0.03 (2)  0.5939 0.02 (3)  0.5929 0.03 (4)  0.5912 0.02 (5)  0.6024 0.02 (1)  0.5757 0.02 (6)  0.5332 0.02 (7) 
flags  0.2716 0.05 (4)  0.2544 0.03 (6)  0.2860 0.04 (2)  0.2885 0.06 (1)  0.2763 0.03 (3)  0.2577 0.03 (5)  0.2484 0.04 (7) 
german  0.6698 0.03 (4)  0.6786 0.03 (1)  0.6611 0.03 (6)  0.6695 0.03 (5)  0.6756 0.03 (2)  0.6706 0.03 (3)  0.6348 0.04 (7) 
ilpd  0.5836 0.04 (2)  0.5782 0.04 (3)  0.5849 0.05 (1)  0.5737 0.04 (4)  0.5696 0.04 (5)  0.5650 0.04 (7)  0.5694 0.05 (6) 
ionosphere  0.8922 0.04 (5)  0.9098 0.03 (1)  0.8867 0.04 (6)  0.9048 0.03 (4)  0.9095 0.03 (2)  0.9093 0.03 (3)  0.8486 0.04 (7) 
knowledge  0.9132 0.03 (4)  0.9300 0.02 (1)  0.9122 0.03 (5)  0.9191 0.03 (2)  0.8906 0.03 (7)  0.9111 0.02 (6)  0.9156 0.03 (3) 
vertebral  0.7904 0.05 (5)  0.7987 0.04 (3)  0.7749 0.04 (7)  0.7949 0.05 (4)  0.8099 0.05 (1)  0.8059 0.05 (2)  0.7838 0.05 (6) 
sonar  0.7818 0.05 (2)  0.7719 0.06 (3)  0.7947 0.05 (1)  0.7588 0.06 (5)  0.7702 0.06 (4)  0.7505 0.06 (6)  0.6677 0.06 (7) 
skulls  0.2396 0.06 (4)  0.2362 0.07 (5)  0.2275 0.05 (6)  0.2547 0.06 (3)  0.2586 0.06 (1)  0.2557 0.07 (2)  0.2247 0.07 (7) 
diabetes  0.9079 0.05 (7)  0.9529 0.04 (4)  0.9193 0.05 (6)  0.9567 0.04 (2)  0.9578 0.04 (1)  0.9545 0.04 (3)  0.9527 0.04 (5) 
physio  0.8736 0.03 (6)  0.9114 0.03 (1)  0.8458 0.04 (7)  0.8977 0.03 (4)  0.9100 0.03 (2)  0.8988 0.03 (3)  0.8822 0.03 (5) 
breasttissue  0.6136 0.08 (5)  0.6489 0.10 (2)  0.6150 0.09 (4)  0.6721 0.09 (1)  0.5737 0.10 (7)  0.6470 0.09 (3)  0.5890 0.07 (6) 
bupa  0.6737 0.05 (5)  0.6894 0.04 (3)  0.6670 0.05 (6)  0.6823 0.05 (4)  0.6901 0.04 (1)  0.6898 0.05 (2)  0.6241 0.05 (7) 
cleveland  0.2805 0.05 (4)  0.2849 0.05 (2)  0.2828 0.05 (3)  0.2889 0.06 (1)  0.2654 0.05 (6)  0.2608 0.04 (7)  0.2700 0.04 (5) 
haberman  0.5436 0.06 (6)  0.5729 0.06 (5)  0.5326 0.05 (7)  0.5819 0.05 (3)  0.5975 0.05 (1)  0.5858 0.06 (2)  0.5796 0.07 (4) 
hayes_roth  0.7349 0.07 (2)  0.7971 0.07 (1)  0.7291 0.09 (3)  0.5641 0.09 (6)  0.5330 0.11 (7)  0.7036 0.09 (4)  0.6459 0.08 (5) 
monks  0.9336 0.02 (2)  0.9835 0.02 (1)  0.8884 0.03 (6)  0.9058 0.08 (5)  0.9069 0.05 (4)  0.9162 0.05 (3)  0.8280 0.09 (7) 
newthyroid  0.3922 0.04 (7)  0.4255 0.04 (4)  0.4015 0.04 (6)  0.4178 0.04 (5)  0.4601 0.04 (3)  0.4641 0.04 (2)  0.4903 0.08 (1) 
yeast  0.5061 0.05 (1)  0.4508 0.03 (3)  0.4499 0.05 (4)  0.4756 0.05 (2)  0.4382 0.03 (5)  0.4099 0.03 (6)  0.3958 0.03 (7) 
spam  0.9215 0.01 (4)  0.9251 0.01 (2)  0.9128 0.01 (5)  0.9279 0.01 (1)  0.9231 0.01 (3)  0.8993 0.01 (6)  0.8848 0.01 (7) 
lymphography  0.5765 0.15 (1)  0.4878 0.12 (3)  0.5567 0.12 (2)  0.4466 0.10 (4)  0.3963 0.04 (5)  0.3924 0.03 (6)  0.3878 0.05 (7) 
movement_libras  0.7025 0.06 (1)  0.6890 0.05 (3)  0.6957 0.05 (2)  0.5678 0.05 (6)  0.5818 0.05 (5)  0.6224 0.05 (4)  0.4789 0.05 (7) 
SAheart  0.6259 0.04 (5)  0.6410 0.04 (3)  0.6137 0.05 (7)  0.6359 0.05 (4)  0.6503 0.04 (1)  0.6436 0.04 (2)  0.6141 0.05 (6) 
zoo  0.7423 0.11 (2)  0.7612 0.11 (1)  0.7238 0.10 (3)  0.6636 0.10 (4)  0.5182 0.08 (6)  0.5183 0.08 (5)  0.5104 0.06 (7) 
Average rank  3.70  2.70  4.77  3.37  3.50  4.10  5.87 
KFHEe  KFHEl  AdaBoost  GBM  SGBM  Bagging  CART  

mushroom  0.9941 0.00 (4.5)  0.9941 0.00 (4.5)  0.9967 0.00 (3)  0.9992 0.00 (1)  0.9990 0.00 (2)  0.9934 0.00 (6.5)  0.9934 0.00 (6.5) 
iris  0.8619 0.06 (6)  0.9379 0.04 (3)  0.8438 0.06 (7)  0.9317 0.05 (4)  0.9499 0.03 (1)  0.9463 0.04 (2)  0.9299 0.03 (5) 
glass  0.5976 0.09 (2)  0.6240 0.08 (1)  0.5901 0.09 (4)  0.5909 0.09 (3)  0.5076 0.08 (6)  0.5345 0.08 (5)  0.4721 0.08 (7) 
car_eval  0.8374 0.04 (3)  0.8394 0.04 (2)  0.6708 0.04 (7)  0.8563 0.04 (1)  0.7599 0.05 (5)  0.8108 0.05 (4)  0.7577 0.07 (6) 
cmc  0.5182 0.03 (5)  0.5199 0.02 (4)  0.4889 0.03 (7)  0.5221 0.02 (3)  0.5245 0.03 (2)  0.5258 0.03 (1)  0.4949 0.04 (6) 
tvowel  0.8261 0.03 (3)  0.8208 0.03 (4)  0.7552 0.03 (7)  0.8274 0.03 (2)  0.8311 0.03 (1)  0.7924 0.03 (5)  0.7758 0.03 (6) 
balance_scale  0.5908 0.03 (3)  0.5948 0.03 (2)  0.5808 0.04 (5)  0.5871 0.02 (4)  0.5949 0.02 (1)  0.5674 0.02 (6)  0.5326 0.03 (7) 
flags  0.3032 0.06 (1)  0.2998 0.05 (3)  0.2958 0.06 (4)  0.3021 0.06 (2)  0.2463 0.03 (5)  0.2451 0.03 (6)  0.2352 0.04 (7) 
german  0.6488 0.03 (3)  0.6522 0.03 (1)  0.6376 0.03 (5)  0.6491 0.03 (2)  0.6432 0.03 (4)  0.6275 0.03 (6)  0.6202 0.04 (7) 
ilpd  0.5645 0.04 (3)  0.5699 0.04 (1)  0.5698 0.04 (2)  0.5592 0.04 (4)  0.5564 0.04 (5)  0.5557 0.04 (6)  0.5517 0.04 (7) 
ionosphere  0.8572 0.04 (5)  0.8892 0.04 (3)  0.8416 0.05 (6)  0.8714 0.04 (4)  0.9025 0.04 (1)  0.8995 0.04 (2)  0.8043 0.06 (7) 
knowledge  0.9050 0.03 (4)  0.9291 0.03 (1)  0.8915 0.03 (6)  0.9090 0.03 (3)  0.8835 0.03 (7)  0.9133 0.03 (2)  0.9006 0.04 (5) 
vertebral  0.7463 0.04 (5)  0.7790 0.05 (3)  0.7275 0.05 (6)  0.7641 0.05 (4)  0.7997 0.05 (1)  0.7866 0.05 (2)  0.7267 0.05 (7) 
sonar  0.7548 0.06 (2)  0.7451 0.06 (5)  0.7462 0.06 (4)  0.7330 0.06 (6)  0.7643 0.06 (1)  0.7485 0.07 (3)  0.6356 0.08 (7) 
skulls  0.2545 0.06 (3)  0.2635 0.06 (1)  0.2587 0.06 (2)  0.2530 0.05 (4.5)  0.2408 0.08 (6)  0.2530 0.06 (4.5)  0.2183 0.06 (7) 
diabetes  0.8668 0.06 (6)  0.9494 0.04 (5)  0.8662 0.08 (7)  0.9523 0.04 (4)  0.9680 0.03 (1)  0.9536 0.03 (3)  0.9608 0.03 (2) 
physio  0.8496 0.03 (6)  0.8964 0.02 (3)  0.8155 0.04 (7)  0.8893 0.03 (4)  0.9052 0.02 (1)  0.8995 0.02 (2)  0.8734 0.03 (5) 
breasttissue  0.6072 0.08 (6)  0.6283 0.08 (2)  0.6202 0.09 (3)  0.6139 0.07 (5)  0.6161 0.08 (4)  0.6500 0.07 (1)  0.5652 0.08 (7) 
bupa  0.6553 0.04 (6)  0.6779 0.05 (2)  0.6565 0.05 (5)  0.6640 0.05 (4)  0.6828 0.05 (1)  0.6776 0.05 (3)  0.6089 0.05 (7) 
cleveland  0.2917 0.05 (3.5)  0.2869 0.05 (5)  0.2989 0.05 (2)  0.2998 0.05 (1)  0.2770 0.04 (6)  0.2917 0.05 (3.5)  0.2758 0.05 (7) 
haberman  0.5448 0.05 (6)  0.5606 0.06 (5)  0.5404 0.05 (7)  0.5625 0.06 (4)  0.5819 0.07 (1)  0.5645 0.06 (3)  0.5677 0.07 (2) 
hayes_roth  0.7335 0.08 (2)  0.7653 0.08 (1)  0.6615 0.09 (4)  0.5390 0.08 (6)  0.4691 0.10 (7)  0.6970 0.09 (3)  0.6558 0.10 (5) 
monks  0.8757 0.04 (4)  0.9623 0.03 (1)  0.8265 0.04 (6)  0.8671 0.07 (5)  0.9045 0.05 (3)  0.9134 0.06 (2)  0.7957 0.08 (7) 
newthyroid  0.3816 0.04 (7)  0.4305 0.04 (3)  0.3825 0.04 (6)  0.4224 0.04 (4)  0.4700 0.04 (1)  0.4646 0.04 (2)  0.4175 0.10 (5) 
yeast  0.4700 0.05 (1)  0.4462 0.04 (3)  0.4198 0.05 (4)  0.4695 0.05 (2)  0.4194 0.03 (5)  0.4115 0.03 (6)  0.4018 0.03 (7) 
spam  0.9154 0.01 (4)  0.9256 0.01 (1)  0.8968 0.02 (6)  0.9236 0.01 (2)  0.9206 0.01 (3)  0.9013 0.01 (5)  0.8808 0.01 (7) 
lymphography  0.5011 0.13 (1)  0.4120 0.08 (3)  0.4825 0.13 (2)  0.3881 0.03 (6)  0.4008 0.03 (4)  0.3964 0.03 (5)  0.3561 0.04 (7) 
movement_libras  0.6955 0.05 (2)  0.6914 0.06 (3)  0.6995 0.05 (1)  0.5758 0.05 (6)  0.5934 0.05 (5)  0.6325 0.06 (4)  0.4584 0.06 (7) 
SAheart  0.6105 0.04 (5)  0.6279 0.05 (4)  0.5929 0.04 (7)  0.6322 0.04 (2)  0.6368 0.04 (1)  0.6290 0.05 (3)  0.6081 0.04 (6) 
zoo  0.6541 0.12 (4)  0.7949 0.12 (1)  0.6618 0.12 (3)  0.7736 0.11 (2)  0.4358 0.10 (7)  0.5625 0.07 (6)  0.5627 0.07 (5) 
Average rank  3.87  2.68  4.83  3.48  3.27  3.75  6.12 
KFHEe  KFHEl  AdaBoost  GBM  SGBM  Bagging  CART  

mushroom  0.9939 0.00 (5)  0.9943 0.00 (4)  0.9964 0.00 (3)  0.9981 0.00 (2)  0.9984 0.00 (1)  0.9912 0.01 (7)  0.9914 0.00 (6) 
iris  0.8457 0.07 (6)  0.9208 0.05 (4)  0.7999 0.06 (7)  0.9199 0.05 (5)  0.9560 0.03 (1)  0.9537 0.03 (2)  0.9369 0.04 (3) 
glass  0.6253 0.08 (2)  0.6242 0.08 (3)  0.5804 0.09 (5)  0.5985 0.09 (4)  0.5692 0.07 (6)  0.6284 0.08 (1)  0.5329 0.09 (7) 
car_eval  0.8347 0.03 (3)  0.8403 0.04 (2)  0.6342 0.04 (7)  0.8512 0.04 (1)  0.7853 0.04 (6)  0.8151 0.05 (4)  0.8124 0.04 (5) 
cmc  0.5183 0.02 (4)  0.5167 0.02 (5)  0.4847 0.03 (7)  0.5202 0.02 (3)  0.5284 0.02 (2)  0.5285 0.02 (1)  0.5006 0.03 (6) 
tvowel  0.8244 0.03 (1)  0.8198 0.03 (4)  0.7136 0.04 (7)  0.8226 0.03 (3)  0.8237 0.03 (2)  0.7897 0.03 (5)  0.7669 0.03 (6) 
balance_scale  0.5856 0.04 (4)  0.5951 0.03 (2)  0.5442 0.03 (7)  0.5960 0.03 (1)  0.5892 0.03 (3)  0.5660 0.03 (5)  0.5471 0.04 (6) 
flags  0.2893 0.06 (3)  0.2824 0.05 (4)  0.2934 0.05 (2)  0.2964 0.05 (1)  0.2609 0.04 (5)  0.2598 0.05 (6)  0.2243 0.04 (7) 
german  0.6311 0.03 (5)  0.6613 0.03 (2)  0.6158 0.03 (7)  0.6496 0.03 (4)  0.6671 0.03 (1)  0.6535 0.03 (3)  0.6267 0.04 (6) 
ilpd  0.5794 0.04 (5)  0.5836 0.04 (4)  0.5667 0.04 (7)  0.5843 0.05 (3)  0.5909 0.04 (1)  0.5848 0.04 (2)  0.5713 0.04 (6) 
ionosphere  0.8458 0.04 (4)  0.8682 0.04 (3)  0.8260 0.04 (6)  0.8397 0.04 (5)  0.8731 0.04 (2)  0.8786 0.04 (1)  0.8053 0.06 (7) 
knowledge  0.8762 0.04 (5)  0.8970 0.03 (1)  0.8459 