An Extensive Evaluation of Filtering Misclassified Instances in Supervised Classification Tasks

12/13/2013 ∙ by Michael R. Smith, et al. ∙ Brigham Young University 0

Removing or filtering outliers and mislabeled instances prior to training a learning algorithm has been shown to increase classification accuracy. A popular approach for handling outliers and mislabeled instances is to remove any instance that is misclassified by a learning algorithm. However, an examination of which learning algorithms to use for filtering as well as their effects on multiple learning algorithms over a large set of data sets has not been done. Previous work has generally been limited due to the large computational requirements to run such an experiment, and, thus, the examination has generally been limited to learning algorithms that are computationally inexpensive and using a small number of data sets. In this paper, we examine 9 learning algorithms as filtering algorithms as well as examining the effects of filtering in the 9 chosen learning algorithms on a set of 54 data sets. In addition to using each learning algorithm individually as a filter, we also use the set of learning algorithms as an ensemble filter and use an adaptive algorithm that selects a subset of the learning algorithms for filtering for a specific task and learning algorithm. We find that for most cases, using an ensemble of learning algorithms for filtering produces the greatest increase in classification accuracy. We also compare filtering with a majority voting ensemble. The voting ensemble significantly outperforms filtering unless there are high amounts of noise present in the data set. Additionally, we find that a majority voting ensemble is robust to noise as filtering with a voting ensemble does not increase the classification accuracy of the voting ensemble.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of supervised machine learning is to infer an accurate generalizing function

from a set of input feature vectors

and a corresponding set of of label vectors . The quality of the inferred function by a learning algorithm is dependent on the quality of the data used for training. Many real-world data sets are often noisy where the noise in a data set can be label noise and/or attribute noise [1]

. The focus of this paper is on label noise. Noise arises from various sources such as subjectivity, human errors, and sensor malfunctions. As such, it is important to take the possibility of label noise into account when inferring a model of the data. Much previous work has examined the effects of class noise and how to handle it. As many real-wold data sets are inherently noisy, most learning algorithms are designed to tolerate a certain degree of noise by avoiding overfitting the training data. There are two general approaches for handling class noise: 1) creating learning algorithms that are robust to noise such as the C4.5 algorithm for decision trees

[2] and 2) preprocessing the data prior to inferring a model of the data such as filtering [3, 4] or correcting [5] noisy instances. In this work, we specifically examine handling noise by filtering.

Previous work has generally examined filtering in a limited context using a single or very few learning algorithms and/or using a limited number of data sets. This may be in part due to the extra computational requirement to first filter a data set and then infer a model of the data using the filtered data set. As such, previous work has generally limited itself to investigating relatively fast learning algorithms such as decision trees [6] and nearest-neighbor algorithms [7, 8]. In addition, filtering for instance-based learning algorithms was motivated in part to reduce the number of instances that have to be stored and because instance-based learning algroithms are more sensitive to noise than other learning algorithms. Also, most previous work artificially added noise to the data set to show that filtering, weighting, or cleaning the data set is beneficial. In this work, we examine filtering misclassified instances over a set of 54 data sets and 9 learning algorithms without

adding artificial noise. The artificial noise was added in previous work to show that filtering/weighting/cleaning provided significant improvements with noisy data sets. Within the context of the benefits of filtering established by the previous work, we show how filtering affects data sets without adding artificial noise to a data set. This also avoids making assumptions about the generation of the noise which may or may not be accurate. We also compare filtering with a voting ensemble with a diverse set of base classifiers.

The insights provided shed light on which learning algorithms are beneficial for filtering and which learning algorithms are the most robust to noise. Using a larger number of data sets allows for more statistical confidence in the results than if only a small number of data sets are used. We find that using an ensemble filter achieves significantly higher classification accuracy than using a single learning algorithm filter. We also find that, in general, a voting ensemble is robust to noise and achieves significantly higher classification accuracy trained on unfiltered data than a single learning algorithm trained on filtered data. On data sets with higher percentages of inherent noisy instances, however, using an ensemble filter achieves higher classification accuracy than a voting ensemble for some learning algorithms. And surprisingly, training a voting ensemble on filtered training data significantly decreases classification accuracy compared to training a voting ensemble on unfiltered training data.

In the next section, we present previous work for handling noise in supervised classification problems. A mathematical motivation for filtering misclassified instances is presented in Section 3. We then present our experimental methodology in Section 4 followed by a presentation of the results in Section 5. In Section 6 we provide conclusions and directions for future work.

2 Related Work

As many real-wold data sets are inherently noisy, most learning algorithms are designed to tolerate a certain degree of noise. Typically, learning algorithms are designed to be somewhat robust to noise by making a trade-off between the complexity of the inferred model and optimizing the inferred function on the training data to prevent overfit. Some techniques to avoid overfit include early stopping using a validation set, pruning (such as in the C4.5 algorithm for decision trees [2]

), or regularization by adding a complexity penalty to the loss function

[9]. Further, some learning algorithms have been adapted specifically to better handle label noise. For example, noisy instances are problematic for boosting algorithms [10, 11] where more weight is placed upon misclassified instances, which often include mislabeled and noisy instances. To address this, Servedio [12]

presented a boosting algorithm that does not place too much weight on any single training instance. For support vector machines, Collobert et al.

[13] use the ramp-loss function to place a bound on the maximum penalty for an instance that lies on the wrong side of the margin. Lawrence and Schölkopf [14]

explicitly model the possibility that an instance is mislabeled using a generative model and then use expectation maximization to update the probability that an instance is mislabeled.

Preprocessing the data set is another approach that explicitly handles label noise. This can be done by removing noisy instances, weighting the instances, or correcting incorrect labels. All three approaches first attempt to identify which instances are noisy by various criteria. Filtering noisy instances has received much attention and has generally resulted in an increase in classification accuracy [15, 16]. One frequently used filtering technique removes any instance that is misclassified by a learning algorithm [3] or set of learning algorithms [4]. Verbaeten and Van Assche [17]

further pursued the idea of using an ensemble for filtering using ideas from boosting and bagging. Other approaches use learning algorithm heuristics to remove noisy instances. Segata et al.

[18] remove instances that are too close or on the wrong side of the decision surface generated by a support vector machine. Zeng and Martinez [19]

remove instances while training a neural network that have a low probability of being labeled correctly where the probability is calculated using the output from the neural network. Filtering has the potential downside of discarding useful instances. However, it is assumed that there are significantly more non-noisy instances and that throwing away a few correct instances with the noisy instances will not have a negative impact on a large data set.

Weighting the instances in a training set has the benefit of not discarding any instances. Rebbapragada and Brodley [20] weight the instances using expectation maximization to cluster instances that belong to a pair of the classes. The probabilities between classes for each instances is compiled and used to weight the influence of each instance. Smith and Martinez [21] examine weighting the instances based on their probability of being misclassified.

Similar to weighting the training instances, data cleaning does not discard any instances, but rather strives to correct the noise in the instances. As in filtering, the output from a learning algorithm has been used to clean the data. Automatic data enhancement [22] uses the output from a neural network to correct the label for training instances that have a low probability of being correctly labeled. Polishing [23, 5] trains a learning algorithm (in this case a decision tree) to predict the value for each attribute (including the class). The predicted (i.e. correct) attribute values for the instances that increase generalization accuracy on a validation set are used instead of the uncleaned attribute values.

We differ from the related work in that we do not add artificial noise to the data sets when we examine filtering. Thus, we avoid making any assumptions about the noise source and focus on the noise inherent in the data sets. We also examine the effects of filtering on a larger set of learning algorithms and data sets providing more significance to the generality of the results.

3 Modeling Class Noise in a Discriminative Model

Lawrence and Schölkopf [14]

proposed to model a data set probabilistically using a generative model that models the noise process. They assume that the joint distribution

(where is the set of input features, is the possibly noisy class label given in the trianing set, and is the actual unkown class label) is factorized as as shown in Figure 1

a. However, since modeling the prior distribution of the unobserved random variable

is not feasible, it is more practical to estimate the prior distribution of

with some assumptions about the class noise as shown in Figure 1b.

Figure 1: Graphical model of the generative probabilistic model proposed by Lawrence and Schölkopf [14].

Here, we follow the premise of Lawrence and Schölkopf by explicitly modeling the possibility that an instance is misclassified. Rather than using a generative model, though, we use a discriminative model since we are focusing on classification tasks and do not require the full joint distribution. Also, discriminative models have been shown to yield better performance on classification tasks [24].

Figure 2: Graphical representation of a discriminative probabilistic model for a) and b) .

Let be a training set composed of instances drawn i.i.d. from the underlying data distribution . Each instance is composed of an input vector with a corresponding possibly noisy label vector . Given the training data , a learning algorithm generally seeks to find the most probable hypothesis that maps each . For supervised classification problems, most learning algorithms maximize for all instances in . This is shown graphically in Figure 2a where the probabilities are estimated using a discriminative approach such as a neural network or a decision tree to infer a hypothesis of the data. Using Bayes’ rule and decomposing into its individual constituent instances, the maximum a posteriori hypothesis is:

(1)

In Equation 1, the MAP hypothesis is found by finding a global optima where all instances are included in the optimization problem. However, noisy instances are often detrimental for finding the global optima since they are not representative of the true (and unknown) underlying data distribution . The possibility of label noise is not explicitly modeled in this form–completely ignoring . Thus, label noise is generally handled by avoiding overfit such that more probable, simpler hypotheses are preferred (). The possibility of label noise can be modeled explicitly by including the latent random variable with and . Thus, an instance is the triplet

and a supervised learning algorithm seeks to maximize

–modeled graphically in Figure 2b. Using the model in Figure 2b, the MAP hypothesis becomes:

(2)

Equation 2 shows that for an instance , the probability of an observed class label () should be weighted by the probability of the actual class (). We now show a method to estimate .

For filtering as a preprocessing step, we want to calculate and remove instances that have a low probability of being the same as . Using a discriminative model trained on , we can calculate as

Since the quantity is unknown, can be approximated as assuming that is represented in . In other words, the inferred discriminative model is able to model if one class label is more likely than another class label given an observed noisy label. Otherwise, all class labels are assumed to be equally likely given an observed label. Thus, can be approximated by finding the class distributions for a given from an inferred discriminative model. That is, after training a learning algorithm on , the class distribution for an instance can be calculated based on the output from the learning algorithm. As shown in Equation 1, is found naturally through a derivation of Bayes’ law. The quantity is the maximum likelihood of an instance given a hypothesis which a learning algorithm tries to maximize for each instance. Further, the dependence on can be removed by summing over all possible hypotheses in and multiplying each by :

(3)

This formulation is infeasible though because 1) it is not practical (or possible) to sum over the set of all hypotheses, 2) calculating

is non-trivial, and 3) not all learning algorithms produce a probability distribution. These limitation make probabilistic generative models attractive, such as the kernel Fisher discriminant algorithm

[14]. However, for classification tasks, generative models generally have a higher asymptotic error than discriminative models [24].

4 Methodology

Figure 3: Dendrogram of the considered learning algorithms clustered using unsupervised metalearning based on their classifier output difference.

The first step for filtering is to determine for each instance. Given that a number of different techniques could be employed to estimate , we conduct an extensive evaluation of filtering misclassified instances using a diverse set of learning algorithms. The diversity of the learning algorithm refers to the learning algorithms not having the same classification for all of the instances and is determined using unsupervised meta-learning (UML) [25]. UML first uses Classifier Output Difference (COD) [26] to measure the diversity between learning algorithms. COD measures the distance between two learning algorithms as the probability that the learning algorithms make different predictions. UML then clusters the learning algorithms based on their COD scores with hierarchical agglomerative clustering. We considered 20 learning algorithms from Weka with their default parameters [27]. The resulting dendrogram is shown in Figure 3, where the height of the line connecting two clusters corresponds to the distance (COD value) between them. A cut-point of 0.18 was chosen to create 9 clusters and a representative algorithm from each cluster was used to create a diverse set of learning algorithms. The learning algorithms that were used are listed in Table 1.

Learning Algorithms
* Multilayer Perceptron trained with Back Propagation (MLP)
* Decision Tree (C4.5) [2]
* Locally Weighted Learning (LWL)
* 5-Nearest Neighbors (5-NN)
* Nearest Neighbor with generalization (NNge)
* Naïve Bayes (NB)
* RIpple DOwn Rule learner (RIDOR)
* Random Forest (RandForest)
* Repeated Incremental Pruning to Produce Error Reduction (RIPPER)
Table 1: Set of learning algorithms used for filtering.

We investigate filtering using the learning algorithms shown in Table 1. Since not all learning algorithms produce a probability distribution, the indicator function is used in this paper instead of , thus, removing misclassified instances. Each learning algorithm first filters misclassified instances and then infers a model of the data using the filtered data set. We also examine using an ensemble filter–removing instances that are misclassified by different percentages of the 9 learning algorithms. The ensemble filter more closely approximates from Equation 3 since it sums over a set of learning algorithms (which in this case were chosen to be diverse and represent a larger subset of the hypothesis space ) lessening the dependence on a single hypothesis . For the ensemble filter, is estimated using a subset of learning algorithms :

(4)

where is the hypothesis from the learning algorithm trained on training set . From Equation 3, is estimated as for the hypothesis generated from training the learning algorithms in on and as zero for all of the other hypotheses in . Also, is estimated using the indicator function since not all learning algorithms produce a probability distribution over the output classes. Set up as such, the ensemble filter counts how many times an instance is misclassified by a set of learning algorithms. Brodley and Friedl [4] examined an ensemble of three learning algorithms on five data sets with artificially generated noise inserted into the data sets. In this paper, we examine an ensemble filter, removing instances that are misclassified by 50, 70, and 90 percent of the learning algorithms in the ensemble. One of the problems of using an ensemble filter is having to choose the percentage of learning algorithms that misclassify an instance for filtering. For the results, we report the accuracy from the percentage that produces the highest accuracy using 5 by 10-fold cross-validation to choose the best percentage for each data set. This method highlights the impact of using an ensemble filter, however, in practice a validation set is often used to determine the percentage that would be used.

In addition, we also examine using an adaptive filtering approach that iteratively adds a learning algorithm to a set of filtering learning algorithms by selecting the learning algorithm from a set of candidate learning algorithms that produces the highest classification accuracy on a validation set when added to the set of learning algorithms used for filtering, as shown in Algorithm 1. The function trains a learning algorithm on a data set using the filter set to filter the instances and returns the accuracy of the learning algorithm on a validation set. As with the ensemble filter, instances are removed that are misclassified by a given percentage of the filtering learning algorithms. The idea is to choose an optimal subset of learning algorithms through a greedy search of the candidate filtering algorithms. For the results, we report the accuracy from the percentage that produces the highest accuracy using 5 by 10-fold cross-validation to choose the best percentage for each data set.

1:  Let be the filter set used for filtering and be the set of candidate learning algorithms for .
2:  Initialize to the empty set:
3:  Initialize the current accuracy to the accuracy from an empty filter set: . returns the accuracy from a learning algorithm trained on a data set filtered with .
4:  while  do
5:     ; ;
6:     for all  do
7:        ; ;
8:        if  then
9:           ; ;
10:        end if
11:     end for
12:     if  then
13:        ; ; ;
14:     else
15:        break;
16:     end if
17:  end while
Algorithm 1 Adaptively constructing a filter set.

Each method for filtering is evaluated using 5 by 10-fold cross-validation (running 10-fold cross validation 5 times, each time with a different seed to partition the data). We examine filtering using the 9 chosen learning algorithms on a set of 47 data sets from the UCI data repository and 7 non-UCI data sets [28, 29, 30, 31]. For filtering, we examine two methods for training the filtering algorithms: 1) removing the misclassified instances when trained on the entire training set and 2) using cross-validation on the training set that removes instances that are misclassified in the validation set. The number of folds for using cross-validation for the training set was set to 2, 3, 4, and 5. Table 2 shows the data sets used in this study organized according to the number of instances, number of attributes, and attribute type. The non-UCI data sets are in bold.

# Ins # Attributes Attribute Type
Categorical Numerical Mixed

100

10 Contact Lenses Post-Operative
cm1_req
Lung Cancer desharnais Labor
Pasture

Breast-w Iris Badges 2
Breast Cancer Ecoli Teaching-
Pima Indians Assistant
Glass
Bupa
Balance Scale
Audiology Ionosphere Annealing
Soybean(large) Wine Dermatology
Lymphography Sonar Credit-A
Congressional- Heart-Statlog Credit-G
Voting Records ar1 Horse Colic
Vowel Heart-c
Primary-Tumor Hepatitis
Zoo Autos
Heart-h
eucalyptus
AP_Breast- Arrhythmia
Uterus

Car Evaluation Yeast
Titanic
Waveform-5000 Thyroid-
Segment (sick &
Spambase hypothyroid)
Ozone level-
Detection

Nursery MAGIC
Telescope
Chess- Eye-
(King-Rook vs. movements
King-Pawn)
Table 2: Datasets used organized by number of instances (# Ins), number of attributes, and attribute type.

Statistical significance between pairs of algorithms is determined using the Wilcoxon signed-ranks test as suggested by Demšar [32]. We emphasize the extensive nature of this evaluation:

  1. Filtering is examined on 9 diverse learning algorithms.

  2. 9 diverse learning algorithms are examined as misclassification filtering techniques.

  3. In addition to the single algorithm misclassification filters, an ensemble filter and an adaptive filter are examined.

  4. Each filtering method is examined on a set of 54 data sets using 5 by 10-fold cross-validation.

  5. Each filtering method is examined on the entire training set as well as using 2-, 3-, 4-, and 5-fold cross-validation.

5 Results

In this section, we present the results from filtering the 54 data sets using a biased filter (the same learning algorithm to filter misclassified instances is used to infer a model of the data), an ensemble filter, and the adaptive filter. Except for the adaptive filter, we find that using cross-validation on the training set for filtering resulted in a lower accuracy (and often significantly lower) accuracy than using the entire training set and, as such, the following results for the biased filter and the ensemble filter are from using the entire training set for filtering rather than using cross-validation. We first show how filtering affects each learning algorithm in Section 5.1. Next, we examine using a set of data set measures to determine when filtering is the most effective in Section 5.2. Our results suggest that using an ensemble filter in all cases produces the best results. In Section 5.3, we then compare filtering with a voting ensemble and show that a voting ensemble is preferable to filtering.

5.1 Filtering Results

The filtering results are summarized in Table 3–showing the average classification accuracy for each learning algorithm and filtering algorithm pair111The NNge learning algorithm did not finish running two data sets: eye-movements and Magic telescope. RIPPER did not finish on the lung cancer data set. In these cases, the data sets are omitted from the presented results. As such, NNge was evaluated on a set of 52 data sets and RIPPER was evaluated on a set of 53 data sets.. The values in bold represent those that are a statistically significant improvement over not filtering. The results of the statistical significance tests for each of the learning algorithms is provided in Tables 10-18 in A. The results are summarized below.

MLP C4.5 IB5 LWL NB NNge RF Rid RIP
Orig 81.74 80.80 79.91 72.80 76.94 80.14 82.28 79.90 79.76
Biased 81.72 80.75 79.53 70.91 75.88 80.34 82.14 79.02 79.87
Ensemble 83.40 81.61 80.85 73.48 78.92 82.21 82.93 80.57 81.26
Adaptive 82.38 80.63 80.01 73.44 78.48 81.33 81.87 80.00 80.43
Table 3: Summary of filtering using the same learning algorithm to filter misclassified instances and to infer a model of the data, an ensemble filter, and the adaptive filter. For all learning algorithms, the ensemble filter significantly increases the classification accuracy.

We find that using a biased filter does not significantly increase the classification for any of the learning algorithms and that using a biased filter significantly decreases the classification accuracy for the LWL, naïve Bayes, Ridor and RIPPER learning algorithms. These results suggest that simply removing the misclassified instances by a single learning algorithm is not sufficient. Bear in mind that these results reflect not adding any artificial noise to the training set. In the case where artificial noise is added to the training set (as was commonly done in previous work), using a biased filter may result in an improvement in accuracy. However, most real-world scenarios do not artificially add noise to their data set but are concerned with the inherent noise found within it.

For all of the learning algorithms, the ensemble filter significantly increases the classification accuracy over not filtering and over the other filtering techniques. An ensemble generally provides better predictive performance than any of the constituent learning algorithms [33] and generally yields better results when the underlying ensembled models are diverse [34]. Thus, by using a more powerful model, only the noisiest instances are removed. This provides empirical evidence supporting the notion that filtering instances with low that are not dependent on a single hypothesis is preferred to filtering instances where the probability of the class is dependent on a particular hypothesis as outlined in Equation 3.

Surprisingly, the adaptive filter does not outperform the ensemble filter and in, one case, it does not even outperform training on unfiltered data. Perhaps this is because it overfits the training data since the best accuracy is chosen on the training set. Adaptive filtering has significantly better results when cross-validation is used to filter misclassified instances as opposed to removing misclassified instances that were also used to train the filtering algorithm. Even with using the results with cross-validation, the results are not significantly better than using an ensemble filter.

RF C4.5 Rid IB5 NNge MLP LWL RIP NB
p-val 0.045 0.035 0.019 0.018 0.006 0.004 0.004
Table 4: The p-values from the Wilcoxon signed-ranks statistical significance test comparing not filtering with an ensemble filter. The learning algorithms are ordered in descending order of p-value from left to right.

Examining each learning algorithm individually, we find that some learning algorithms are more robust to noise than others. To determine which learning algorithms are more robust to noise, we compare the accuracy of the learning algorithms without filtering to the accuracy obtained using an ensemble filter. The p-values from the Wilcoxon signed-ranks statistical significance test are shown in Table 4 ordered from greatest (least significant impact) to least reading from left to right. We see that random forests and decision trees are the most robust to noise as filtering has the least significant impact on their accuracy. This is not too surprising given that the C4.5 algorithm was designed to take noise into account and random forests are built using decision trees. Ridor and 5-nearest neighbor (IB5) are more robust to noise, but still greatly improve with filtering. IB5 is more robust to noise since it compares with the 5 nearest neighbors of an instance. If were set to 1, then filtering would have a greater effect on the accuracy. Filtering has the most significant effect on the accuracy of the last five learning algorithms: MLP, NNge, LWL, RIPPER, and naïve Bayes.

5.2 Analysis of When to Filter

Using only the inherent noise in a data set, the efficacy of filtering is limited and can be detrimental in some data sets. Thus, we examine the cases in which filtering significantly improves the classification accuracy. This investigation is similar to the recent work by Sáez et al. [35] who investigate creating a set of rules to understand when to filter using a 1-nearest neighbor learning algorithm. They use a set of data complexity measures from Ho and Basu [36]. The complexity measures are designed for binary classification problems, yet we do not limit ourselves to binary classification problems. As such, we use a subset of the data complexity measures shown in Table 5 that have been extended to handle multi-class problems [37].

F2: Volume of overlap region:The overlap of the per-class bounding
boxes calculated for each attribute by normalizing the difference of
the maximum and minimum values from each class.
F3: Max individual feature efficiency: For all of the features, the
maximum ratio of the number of instances not in the overlapping
region to the total number of instances.
F4: Collective feature efficiency: F3 only return the ratio for the
attribute that maximizes the ratio. F4 is a measure for all of the
attributes.
N1: Fraction of points on class boundary: The fraction of instances
in a data set that are connected to their nearest neighbors that have
a different class in a spanning tree.
N2: Ratio of ave intra/inter class NN dist: The average distance to
the nearest intra-class neighbors divided by the average distance to
the nearest inter-class neighbors.
N3: Error rate of 1NN classifier: Leave-one-out error estimate of
1NN.
T1: Fraction of maximum covering spheres: The normalized count
of the number of clusters of instances containing a single class
T2: Ave number of points per dimension: Compares the number of
instances to the number of features.
Table 5: List of complexity measures from Ho and Basu [36].

In addition, we also examine a set of hardness measures [38] shown in Table 6. The hardness measures are designed to determine and characterize instances that have a high likelihood of being misclassified. We examine using the set of data complexity measures and the hardness measures to create rules and/or a classifier to determine when to use filtering. We set up the classification problem similar to Sáez et al. where filtering is set to “TRUE” if filtering significantly improves the classification accuracy for a data set using the Wilcoxon signed-ranks test. We also examine predicting the difference in accuracy between using and not using a filter. Unlike Sáez et al., we find that the data complexity measures and the hardness measures do not create a satisfactory classifier to determine when to filter. Granted, we examine more learning algorithms and do not artificially add noise to the data sets which provides for few data sets where filtering significantly improves the classification accuracy. In the study by Sáez et al., 75% of the data sets had at least 5% noise added providing more positive examples. More future work is required to determine when to use filtering on unmodified data sets. Based on our results, we would recommend always using an ensemble filter for all of the learning algorithms as it significantly outperforms the other filtering techniques.

kDN k-Disagreeing Neighbors: The percentage of the nearest
neighbors (using Euclidean distance) for an instance that do not
share its target class value.
DS Disjunct Size: The number of instances in a disjunct divided by
the number of instances covered by the largest disjunct in a
data set in an unpruned decision tree inferred using C4.5 [2].
DCP Disjunct Class Percentage: The number of instances in a
disjunct belonging to its class divided by the total number of
instances in the disjunct in a pruned decision tree.
TD Tree Depth: The depth of the leaf node that classifies an
instance in an induced decision tree.
CL Class Likelihood: The probability that an instance belongs to its
class given the input features.
CLD Class Likelihood Difference: The difference between the class
likelihood of an instance and the maximum likelihood for all of the
other classes.
MV Minority Value: The ratio of the number of instances sharing its
target class value to the number of instances in the majority class.
CB Class Balance: The difference of the ratio of the number of
instances belonging to a class and the ratio of the classes if they
were distributed equally.
Table 6: List of hardness measures from Smith et al. [38].

5.3 Voting Ensemble VS. Filtering

In This section, we compare the results of filtering using an ensemble filter with a voting ensemble. The voting ensemble uses the learning algorithms shown in Table 1 and the vote from each learning algorithm is equally weighted. Table 7 compares the voting ensemble with using an ensemble filter on each of the investigated learning algorithms giving the average accuracy, the p-value, and the number of times that the accuracy of a voting ensemble is greater than, equal to, or less than using an ensemble filter. The results for each data set are provided in Table 19 in B. With no artificially generated noise, a voting ensemble achieves significantly higher classification accuracy than an ensemble filter for each of the examined learning algorithms. This is not too surprising considering that previous research has shown that ensemble methods address issues that are common to all non-ensemble learning algorithms [39] and that ensemble methods generally obtain a greater accuracy than that from a single learning algorithm that makes up part of the ensemble [40]. Considering the computational requirements for training, using a voting ensemble for classification rather than filtering appears to be more beneficial.

Ensemble MLP C4.5 IB5 LWL NB
Acc 84.37 83.40 81.61 80.85 73.48 78.92
p-value 0.008
33,1,20 43,1,10 42,2,10 47,1,6 41,0,13
Ensemble NNge RF Rid RIP
Acc 84.37 81.59 82.93 80.57 80.76
p-value
44,2,8 39,0,15 47,1,6 44,1,9
Table 7: Summary of comparing a voting ensemble with filtering using an ensemble filter. Using an ensemble filter significantly improves the classification accuracy over using an ensemble filter for all of the examined learning algorithms.
Data set Ens MLP C4.5 IB5 LWL NB NNge RF Rid RIP Per
breastc 73.99 73.43 75.17 74.13 73.31 73.43 74.13 73.19 74.71 74.59 10.1
arrhyth 71.11 70.13 70.65 59.14 57.67 65.63 65.71 66.59 70.65 71.09 12.0
contact 76.67 83.33 83.33 76.39 76.39 76.39 80.56 80.56 79.17 77.78 12.5
lungCan 53.75 52.08 56.25 47.92 55.21 55.21 56.25 52.08 51.04 54.17 12.5
yeast 61.08 59.43 60.13 59.32 40.7 58.15 59.4 61.08 59.74 60.01 13.7
cm1_req 75.73 76.40 77.53 77.53 76.78 77.53 77.15 76.78 77.53 76.78 16.9
titanic 78.72 78.66 78.68 78.59 77.9 77.77 78.68 78.68 78.28 78.68 16.9
post-op 69.78 71.11 71.11 71.11 71.11 71.11 71.11 71.11 71.11 71.11 26.7
pri-tum 48.08 47.79 41.2 45.82 34.42 48.57 44.44 45.23 39.82 40.41 32.2
Acc 67.66 68.04 68.23 65.55 62.61 67.09 67.49 67.26 66.89 67.18
p-value 0.367 0.820 0.125 0.213 0.410 0.545 0.312 0.410 0.715
6,0,3 4,0,5 6,0,3 6,0,3 5,0,4 4,0,5 5,1,3 5,0,4 4,0,5
Table 8: Comparison of a voting ensemble against using an ensemble filter on a subset of data sets where more than 10% of the constituent instances are noisy . The accuracy of the voting ensemble (“Ens”) is in bold if it is greater than the accuracies from using an ensemble filter for the investigated learning algorithms. The accuracy from using an ensemble filter is in bold if it is higher than the accuracy from a voting ensemble. The column “Per” refers to the percentage of instances in the data set that are considered noisy.

Many previous studies [1, 14, 4, 17] have shown that when a large amount of artificial noise is added to a data set (i.e. ), then filtering outperforms a voting ensemble. We examine which of the 54 data sets have a high percentage of noise using instance hardness [38] to identify suspected noisy instances. Instance hardness approximates the likelihood that an instance will be misclassified by evaluating the classification of an instance from a set of learning algorithms : . The set of learning algorithms is composed of the learning algorithms shown in Table 1. The instances that have a probability greater than 0.9 of being misclassified we consider to be noisy instances. Table 8 shows the accuracies from a voting ensemble and the considered learning algorithms using an ensemble filter for the subset of data sets with more than 10% noisy instances. Examining the more noisy data sets shows that the gains from using an ensemble filter are more noticeable. However, only 9 out of the 54 investigated data sets were identified as having more than 10% noisy instances. We ran a Wilcoxon signed-ranks test, but with the small sample size it is difficult to determine the statistical significance of using the ensemble filter over using a voting ensemble. Based on the small sample provided here, training a learning algorithm on a filtered data set is statistically equivalent to training a voting ensemble classifier. The computational complexity required to train an ensemble is less than that to train an ensemble for filtering followed by training another learning algorithm from the filtered data set. A single learning algorithm trained on the filtered data set has the benefit that only one learning algorithm is queried for a novel instance. Future work will include discovering if a smaller subset of learning algorithms for filtering approximates using the ensemble filter in order to reduce the computational complexity.

Examining the more noisy data sets shows that filtering has a more significant effect on classification accuracy, however, the amount of noise is not the only factor that needs to be considered. For example, 32.2% of the instances in the primary-tumor data set are noisy, yet only one learning algorithm achieves a greater classification accuracy than the voting ensemble. On the other hand, the classification accuracy on the ar1 and ozone data sets for all of the considered learning algorithms trained on filtered data is greater than using a voting ensemble despite only having 3.3% and 0.5% noisy instances respectively. Thus, there are other unknown data set features affecting when filtering is appropriate. Future work also includes discovering and examining data set features that are indicative of when filtering should be used.

Ens FEns 50 FEns 70 FEns 90 FEns Max

All

Accuracy 84.37 83.40 82.21 73.96 83.62
p-value
greater-equal-less 42,3,9 44,2,8 48,1,5 39,2,13

Noisy

Accuracy 67.66 67.00 67.47 60.52 67.93
p-value 0.102 0.455 0.049 0.633
greater-equal-less 7,0,2 5,0,4 6,0,3 5,0,4

Accuracy 78.49 77.19 75.74 66.02 77.44
p-values
greater-equal-less 31,0,6 30,1,6 34,0,3 28,0,9

Accuracy 74.70 73.08 71.41 61.49 73.41
p-values 0.002
greater-equal-less 24,0,3 22,0,5 24,0,3 21,0,6

Accuracy 64.65 61.99 60.41 51.04 62.25
p-values 0.001 0.002 0.009
greater-equal-less 10,0,1 10,0,1 10,0,1 10,0,1

Accuracy 58.44 55.81 53.49 42.56 56.12
p-values 0.016 0.016 0.016 0.016
greater-equal-less 6,0,0 6,0,0 6,0,0 6,0,0

Accuracy 50.92 48.22 48.07 38.61 49.16
p-values 0.250 0.250 0.250 0.250
greater-equal-less 2,0,0 2,0,0 2,0,0 2,0,0
Table 9: Comparison of a majority voting ensemble trained on unfiltered (Ens) and filtered data (FEns). The value after “FEns” represents the percentage of learning algorithms that have to misclassify an instance for it to be filtered from the training set and “Max” uses the accuracy from the percentage that results in the greatest accuracy. Training with unfiltered data is significantly better than training with filtered data for a voting ensemble.

We further investigate the robustness of the majority voting ensemble to noise by applying an ensemble filter to the training data for the voting ensemble. We find that a majority voting ensemble is significantly better without filtering. The summary results are shown in Table 9 and the full results for each data set can be found in Table B.18 in B. Table 9 divides the data sets into subsets that have more than 10% noisy instances (“Noisy”), and those that have an original accuracy less than 90%, 80%, 70%, 60%, and 50% averaged across the investigated learning algorithms (). Even with harder data sets and more noisy instances, using unfiltered training data produces significantly higher classification accuracy for the voting ensemble. Thus, we find that a majority voting ensemble is more robust to noise than filtering in most cases. The strength of a voting ensemble comes from the diversity of the ensembled learning algorithms. However, the inferred models from the learning algorithms trained on the filtered training data are less diverse since the diversity often comes from how a learning algorithm treats a noisy instance, lessening the power of the voting ensemble. This is evidenced as we examined a voting ensemble consisting of C4.5, random forest, and Ridor which are three of the more similar learning algorithms using unsupervised meta-learning (see Section 4). When trained on the filtered training data, the less diverse voting ensemble achieves a significantly lower classification average accuracy of 82.09% compared to 83.62% from the voting ensemble composed of the 9 examined learning algorithms.

6 Conclusions

In this paper, we presented an extensive empirical evaluation of misclassification filters on a set of 54 data sets and 9 diverse learning algorithms. As opposed to other work on filtering, we used a large set of data sets and learning algorithms and we did not artificially add noise to the data set. In previous work, noise was added to a data set to verify that the noise filtering method was effective and that filtering was more effective when more noise was present. However, the artificial noise may not be representative of the actual noise and the impact of filtering on an unmodified data set is not always clear.

We examined each learning algorithm individually as a filter as well as using all of the learning algorithms combined as an ensemble filter. We also presented an adaptive filtering algorithm that greedily searches the set of candidate learning algorithms for filtering for a specific data set and learning algorithm combination. We found that, without artificially adding label noise, using the same learning algorithm for filtering and for inferring a model of the data can be significantly detrimental and does not significantly increase the classification accuracy even when examining harder data sets. We also examined using a set of data set features to induce rules that indicate when to use filtering, but did not find a set of rules that significantly improved the results. Using an ensemble filter significantly improved the accuracy over not filtering and outperformed both the adaptive filtering method and using each learning algorithm individually as a filter for all of the investigated learning algorithms.

We also compared filtering with a voting ensemble and found that a voting ensemble achieves significantly higher classification accuracy than any of the other considered learning algorithms trained on filtered data. A majority voting ensemble trained on unfiltered data significantly outperforms a voting ensemble trained on filtered data. Thus, a voting ensemble exhibits robustness to noise in the training set and is preferable to filtering.

References

  • [1]

    X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study of their impacts, Artificial Intelligence Review 22 (2004) 177–210.

  • [2] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, USA, 1993.
  • [3] D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics (2-3) (1972) 408–421.
  • [4] C. E. Brodley, M. A. Friedl, Identifying mislabeled training data, Journal of Artificial Intelligence Research 11 (1999) 131–167.
  • [5]

    C.-M. Teng, Combining noise correction with feature selection, in: Y. Kambayashi, M. K. Mohania, W. Wöß (Eds.), DaWaK, Vol. 2737 of Lecture Notes in Computer Science, Springer, 2003, pp. 340–349.

  • [6] G. H. John, Robust decision trees: Removing outliers from databases, in: Knowledge Discovery and Data Mining, 1995, pp. 174–179.
  • [7] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics 6 (1976) 448–452.
  • [8] D. R. Wilson, T. R. Martinez, Reduction techniques for instance-based learning algorithms, Machine Learning 38 (3) (2000) 257–286.
  • [9]

    C. M. Bishop, N. M. Nasrabadi, Pattern Recognition and Machine Learning, Vol. 1, springer New York, 2006.

  • [10] R. E. Schapire, The strength of weak learnability, Machine Learning 5 (1990) 197–227.
  • [11]

    Y. Freund, Boosting a weak learning algorithm by majority, in: Proceedings of the Third Annual Workshop on Computational Learning Theory, 1990, pp. 202–216.

  • [12] R. A. Servedio, Smooth boosting and learning with malicious noise, Journal of Machine Learning Research 4 (2003) 633–648.
  • [13] R. Collobert, F. Sinz, J. Weston, L. Bottou, Trading convexity for scalability, in: Proceedings of the 23rd International Conference on Machine learning, 2006, pp. 201–208.
  • [14] N. D. Lawrence, B. Schölkopf, Estimating a kernel fisher discriminant in the presence of label noise, in: In Proceedings of the 18th International Conference on Machine Learning, 2001, pp. 306–313.
  • [15] D. Gamberger, N. Lavrač, S. Džeroski, Noise detection and elimination in data preprocessing: Experiments in medical domains, Applied Artificial Intelligence 14 (2) (2000) 205–223.
  • [16] M. R. Smith, T. Martinez, Improving classification accuracy by identifying and removing instances that should be misclassified, in: Proceedings of the IEEE International Joint Conference on Neural Networks, 2011, pp. 2690–2697.
  • [17] S. Verbaeten, A. Van Assche, Ensemble methods for noise elimination in classification problems, in: Proceedings of the 4th international conference on Multiple classifier systems, MCS’03, Springer-Verlag, Berlin, Heidelberg, 2003, pp. 317–325.
  • [18] N. Segata, E. Blanzieri, P. Cunningham, A scalable noise reduction technique for large case-based systems, in: Proceedings of the 8th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development, 2009, pp. 328–342.
  • [19] X. Zeng, T. R. Martinez, A noise filtering method using neural networks, in: Proc. of the int. Workshop of Soft Comput. Techniques in Instrumentation, Measurement and Related Applications, 2003.
  • [20] U. Rebbapragada, C. Brodley, Class noise mitigation through instance weighting, in: Machine Learning: ECML 2007, Vol. 4701 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2007, pp. 708–715.
  • [21] M. R. Smith, T. Martinez, Reducing the effects of detrimental instances, in: Submission, 2013.
  • [22] X. Zeng, T. R. Martinez, An algorithm for correcting mislabeled data, Intelligent Data Analysis 5 (2001) 491–502.
  • [23] C.-M. Teng, Evaluating noise correction, in: PRICAI, 2000, pp. 188–198.
  • [24]

    A. Y. Ng, M. I. Jordan, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, in: Advances in Neural Information Processing Systems 14, 2001, pp. 841–848.

  • [25] J. Lee, C. Giraud-Carrier, A metric for unsupervised metalearning, Intelligent Data Analysis 15 (6) (2011) 827–841.
  • [26] A. H. Peterson, T. R. Martinez, Estimating the potential for combining learning models, in: Proceedings of the ICML Workshop on Meta-Learning, 2005, pp. 68–75.
  • [27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The weka data mining software: an update, SIGKDD Explorations Newsletter 11 (1) (2009) 10–18.
  • [28] K. Thomson, R. J. McQueen, Machine learning applied to fourteen agricultural datasets, Tech. Rep. 96/18, The University of Waikato (September 1996).
  • [29]

    J. Salojärvi, K. Puolamäki, J. Simola, L. Kovanen, I. Kojo, S. Kaski, Inferring relevance from eye movements: Feature extraction, Tech. Rep. A82, Helsinki University of Technology (March 2005).

  • [30] J. Sayyad Shirabad, T. Menzies, The PROMISE Repository of Software Engineering Databases., School of Information Technology and Engineering, University of Ottawa, Canada (2005).
    URL http://promise.site.uottawa.ca/SERepository/
  • [31] G. Stiglic, P. Kokol, GEMLer: Gene expression machine learning repository, University of Maribor, Faculty of Health Sciences (2009).
    URL http://gemler.fzv.uni-mb.si/
  • [32] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7.
  • [33] R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6 (3) (2006) 21–45.
  • [34] L. I. Kuncheva, C. J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy., Machine Learning 51 (2) (2003) 181–207.
  • [35] J. A. Sáez, J. Luengo, F. Herrera, Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification, Pattern Recognition 46 (1) (2013) 355–364.
  • [36] T. K. Ho, M. Basu, Complexity measures of supervised classification problems, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 289–300.
  • [37] A. Orriols-Puig, N. Macià, E. Bernadó-Mansilla, T. K. Ho, Documentation for the data complexity library in c++, Tech. Rep. 2009001, La Salle - Universitat Ramon Llull (April 2009).
  • [38] M. R. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data complexity, Machine Learning (2013) In pressdoi:10.1007/s10994-013-5422-z.
  • [39] T. G. Dietterich, Ensemble methods in machine learning, in: Multiple Classifier Systems, Vol. 1857 of Lecture Notes in Computer Science, Springer, 2000, pp. 1–15.
  • [40] D. W. Opitz, R. Maclin, Popular ensemble methods: An empirical study., Journal of Artificial Intelligence Research 11 (1999) 169–198.

Appendix A Statistical Significance Tables

This section provides the results from the statistical significance tests comparing not filtering with filtering with a biased filter, an ensemble filter, and the adaptive filter for the investigated learning algorithms. The results are in Tables 10 - 18. The p-values with a value less than 0.05 are in bold and “greater-equal-less” refers to the number of times that the algorithm listed in the row is greater than, equal to, or less than the algorithm listed in the column.

Orig Biased Ensemble Greedy
Accuracy 81.74 81.87 83.33 82.33
Orig p-values 1 0.771 1.000 0.953
greater-equal-less 0,54,0 25,2,27 16,2,36 18,3,33
Biased p-values 0.232 1 1 0.957
greater-equal-less 27,2,25 0,54,0 9,4,41 23,2,29
Ensemble p-values 1
greater-equal-less 36,2,16 41,4,9 0,54,0 40,1,13
Greedy p-values 0.048 0.044 1 1
greater-equal-less 33,3,18 29,2,23 13,1,40 0,54,0
Table 10:

Pair-wise comparison of filtering for multilayer perceptrons trained with backpropagation.

Accuracy 80.80 80.83 81.59 80.56
Orig p-values 1 0.460 1.000 0.221
greater-equal-less 0,54,0 26,5,23 17,3,34 29,2,23
Biased p-values 0.544 1 0.999 0.271
greater-equal-less 23,5,26 0,54,0 17,5,32 29,1,24
Ensemble p-values 0.001 1
greater-equal-less 34,3,17 32,5,17 0,54,0 44,2,8
Greedy p-values 0.782 0.732 1 1
greater-equal-less 23,2,29 24,1,29 8,2,44 0,54,0
Table 11: Pair-wise comparison of filtering for decision trees.
Orig Biased Ensemble Greedy
Accuracy 79.91 79.40 80.83 79.91
Orig p-values 1 0.693 0.985 0.877
greater-equal-less 0,54,0 25,1,28 17,2,35 20,2,32
Biased p-values 0.310 1 1 0.999
greater-equal-less 28,1,25 0,54,0 5,4,45 17,1,36
Ensemble p-values 0.015 1
greater-equal-less 35,2,17 45,4,5 0,54,0 44,1,9
Greedy p-values 0.125 0.001 1 1
greater-equal-less 32,2,20 36,1,17 9,1,44 0,54,0
Table 12: Pair-wise comparison of filtering for 5-nearest neighbors.
Orig Biased Ensemble Greedy
Accuracy 72.80 70.91 73.48 73.44
Orig p-values 1 0.992 0.988
greater-equal-less 0,54,0 34,11,9 14,9,31 16,8,30
Biased p-values 0.999 1 1 1
greater-equal-less 9,11,34 0,54,0 3,12,39 9,10,35
Ensemble p-values 0.009 1 0.595
greater-equal-less 31,9,14 39,12,3 0,54,0 19,8,27
Greedy p-values 0.013 0.409 1
greater-equal-less 30,8,16 35,10,9 27,8,19 0,54,0
Table 13: Pair-wise comparison of filtering for locally weighted learning (LWL).
Orig Biased Ensemble Greedy
Accuracy 76.94 75.84 78.82 78.45
Orig p-values 1 0.001 1.000 0.985
greater-equal-less 0,54,0 38,0,16 17,4,33 24,1,29
Biased p-values 0.999 1 1 1
greater-equal-less 16,0,38 0,54,0 4,2,48 10,2,42
Ensemble p-values 1 0.012
greater-equal-less 33,4,17 48,2,4 0,54,0 32,4,18
Greedy p-values 0.016 0.988 1
greater-equal-less 29,1,24 42,2,10 18,4,32 0,54,0
Table 14: Pair-wise comparison of filtering for naïve Bayes.
Orig Biased Ensemble Greedy
Accuracy 80.62 80.30 82.18 81.32
Orig p-values 1 0.080 1.000 0.888
greater-equal-less 0,52,0 25,5,22 12,4,36 24,1,27
Biased p-values 0.921 1 1 0.992
greater-equal-less 22,5,25 0,52,0 12,2,38 18,2,32
Ensemble p-values 1
greater-equal-less 36,4,12 38,2,12 0,52,0 41,2,9
Greedy p-values 0.114 0.008 1 1
greater-equal-less 27,1,24 32,2,18 9,2,41 0,52,0
Table 15: Pair-wise comparison of filtering for NNge.
Orig Biased Ensemble Greedy
Accuracy 82.28 82.21 82.92 81.85
Orig p-values 1 0.408 0.981 0.022
greater-equal-less 0,54,0 28,2,24 23,1,30 35,2,17
Biased p-values 0.595 1 0.992 0.084
greater-equal-less 24,2,28 0,54,0 22,4,28 31,2,21
Ensemble p-values 0.020 0.009 1
greater-equal-less 30,1,23 28,4,22 0,54,0 46,1,7
Greedy p-values 0.979 0.918 1 1
greater-equal-less 17,2,35 21,2,31 7,1,46 0,54,0
Table 16: Pair-wise comparison of filtering for random forests.
Orig Biased Ensemble Greedy
Accuracy 79.90 79.16 80.56 79.96
Orig p-values 1 0.016 1.000 0.895
greater-equal-less 0,54,0 33,2,19 15,1,38 20,3,31
Biased p-values 0.985 1 1 0.998
greater-equal-less 19,2,33 0,54,0 7,3,44 17,2,35
Ensemble p-values 1
greater-equal-less 38,1,15 44,3,7 0,54,0 36,3,15
Greedy p-values 0.107 0.002 1.000 1
greater-equal-less 31,3,20 35,2,17 15,3,36 0,54,0
Table 17: Pair-wise comparison of filtering for Ridor.
Orig Biased Ensemble Greedy
Accuracy 80.34 79.98 81.25 80.41
Orig p-values 1 0.040 1 0.704
greater-equal-less 0,53,0 30,2,21 11,2,40 21,6,26
Biased p-values 0.961 1 1 0.989
greater-equal-less 21,2,30 0,53,0 8,1,44 19,4,30
Ensemble p-values 1
greater-equal-less 40,2,11 44,1,8 0,53,0 38,4,11
Greedy p-values 0.300 0.011 1 1
greater-equal-less 26,6,21 30,4,19 11,4,38 0,53,0
Table 18: Pair-wise comparison of filtering for RIPPER.

Appendix B Ensemble Results for Each Data Set

This section provides the results for each data set comparing a voting ensemble with filtering using an ensemble filter for each investigated learning algorithm as well as filtering using an ensemble filter for a voting ensemble. The results comparing a voting ensemble with filtering for each investigated non-ensembled learning algorithm are shown in Table 19. The bold values represent the highest classification accuracy and the rows highlighted in gray are the data sets where filtering with an ensemble filter increased the accuracy over the voting ensemble for all learning algorithms. The results comparing a voting ensemble with a filtered voting ensemble are shown in Table 20. The bold values for the “Ens” column represent if the voting ensemble trained on unfiltered data achieves higher accuracy while the bold values for the “FEns” columns represent if the voting ensemble trained on filtered data achieves higher accuracy than the voting ensemble trained on unfiltered data.

Ens MLP C4.5 IB5 LWL NB NNge RF Rid RIP Per
anneal 98.08 98.29 91.72 92.91 92.72 83.93 92.87 94.8 96.59 94.84 0.33
AP-BU 97.61 96.87 94.87 96.87 93.3 96.72 96.72 98.01 93.59 94.73 0.85
ar1 90.08 92.29 92.56 92.56 92.29 92.29 92.29 92.29 92.56 92.56 3.31
arrhyth 71.11 70.13 70.65 59.14 57.67 65.63 65.71 66.59 70.65 71.09 11.95
audiolo 78.94 78.61 76.99 62.54 47.05 73.01 72.42 73.6 71.24 73.89 7.08
autos 83.51 78.54 79.84 64.72 51.71 56.1 74.8 82.6 69.59 76.1 4.88
badges2 100 100 100 100 100 99.66 100 99.89 100 100 0.00
balance 88.45 90.35 78.67 89.65 60.59 89.97 82.56 82.77 79.68 79.09 4.16
breastc 73.99 73.43 75.17 74.13 73.31 73.43 74.13 73.19 74.71 74.59 10.14
breastw 96.88 97.00 95.14 96.76 92.61 95.95 95.99 96.57 95.61 95.8 1.72
bupa 71.3 71.5 66.47 62.71 60.29 59.03 65.31 69.28 67.44 68.02 2.61
carEval 96.7 98.82 92.09 92.77 70.02 85.22 94.21 92.46 95.72 87.15 0.00
chess 99.53 99.41 99.44 96.17 72.15 87.85 98.56 98.77 98.72 99.21 0.03
cm1_req 75.73 76.4 77.53 77.53 76.78 77.53 77.15 76.78 77.53 76.78 16.85
colic 85.33 86.41 85.78 82.97 81.52 83.33 84.42 85.69 84.42 85.96 4.35
contact 76.67 83.33 83.33 76.39 76.39 76.39 80.56 80.56 79.17 77.78 12.50
credita 86.64 85.7 85.99 86.62 85.51 81.64 85.6 86.04 85.85 86.09 4.35
creditg 75.64 75.07 73.17 73.37 70.03 74.8 73.33 74.2 71.97 72.6 5.20
derma 97.43 97.09 93.99 96.08 87.61 97.36 95.36 95.99 94.35 88.8 0.00
desh 74.32 70.78 69.55 65.84 71.6 62.14 67.49 74.49 71.6 71.6 7.41
ecoli 87.44 86.21 84.72 87.2 65.87 86.9 85.22 85.71 83.73 82.74 4.76
eucalyp 65.11 63.32 62.86 55.8 51.04 57.52 56.84 56.88 61.73 63.32 6.66
eye-mov 64.76 54.34 63.72 54.93 42.88 44.11 48.13 62.8 54.11 56.04 1.77
glass 74.02 66.04 68.07 66.51 52.18 53.58 71.5 74.92 68.85 68.85 5.14
heart-c 83.83 83.39 77.56 83.17 75.69 83.94 79.98 81.63 79.65 80.97 3.30
heart-h 82.93 83.22 81.63 84.69 80.16 84.47 81.07 81.75 82.54 81.63 5.44
heart-s 82.44 83.09 81.23 81.73 74.44 84.07 78.89 83.09 79.01 79.51 3.33
hepatit 83.1 84.3 81.51 84.95 79.35 85.59 83.23 84.09 79.14 80.22 3.87
hypo 99.43 94.32 99.58 93.3 95.39 95.51 98.75 99.08 99.33 99.45 0.05
iono 92.99 89.84 90.98 84.43 83 83.57 90.79 92.78 89.84 90.6 1.71
iris 95.33 96.00 94.67 95.33 94 95.56 95.33 94.22 93.56 92.67 0.67
labor 92.98 87.72 78.36 89.47 83.63 92.4 87.72 85.96 78.36 82.46 0.00
lungCan 53.75 52.08 56.25 47.92 55.21 55.21 56.25 52.08 51.04 54.17 12.50
lympho 83.24 83.56 77.48 83.33 75.68 81.98 77.48 80.63 78.38 78.15 2.03
MagicTe 86.27 86.09 85.58 83.98 76.27 76.08 82.88 86.43 84.77 85.29 2.90
nursery 98.91 98.78 97 98.1 88.96 90.21 96.97 97.97 95.67 96.73 0.02
ozone 97.01 97.12 97.12 97.12 97.12 97.12 97.12 97.13 97.12 97.12 0.51
pasture 86.11 77.78 78.7 68.52 87.04 75 80.56 76.85 74.07 68.52 2.78
Continued on next page
Table 19: Comparison of the accuracy for each data set using a voting ensemble (Ens) with using an ensemble filter for the investigated learning algorithms. The column “Per” refers to the percentage of instances that have a greater than or equal to 90%. The rows in gray represent those datasets where filtering with an ensemble filter increased the accuracy over the voting ensemble for all learning algorithms.
Ens MLP C4.5 IB5 LWL NB NNge RF Rid RIP Per
pimaDia 77.06 76.56 76.61 75.17 73.26 75.78 75.22 76.22 75.3 75.26 6.77
post-op 69.78 71.11 71.11 71.11 71.11 71.11 71.11 71.11 71.11 71.11 26.67
pri-tum 48.08 47.79 41.2 45.82 34.42 48.57 44.44 45.23 39.82 40.41 32.15
segment 98.00 96.05 96.62 95.04 78.59 80.69 96.36 97.37 95.83 94.82 0.17
sick 98.45 96.93 98.51 96.3 96.55 94.82 96.86 98.16 98.1 98.03 0.13
sonar 81.92 81.89 72.92 82.53 74.84 68.27 71.63 79.49 73.24 79.01 0.00
soybean 94.32 94.05 91.7 90.14 56.95 92.83 93.02 92.53 90.41 91.85 1.46
spambas 94.95 91.79 92.8 90.33 78.28 82.24 92.24 94.73 92.07 92.71 0.54
T.A. 57.88 55.19 51.66 45.25 50.99 49.89 52.98 53.42 43.49 47.9 8.61
titanic 78.72 78.66 78.68 78.59 77.9 77.77 78.68 78.68 78.28 78.68 16.86
vote 95.82 95.86 95.71 92.8 95.63 90.96 95.4 96.4 94.18 95.63 1.61
vowel 95.54 92.83 75.81 93.13 35.05 63.54 87.12 94.48 75.93 71.57 0.00
wave 84.21 85.11 77.92 79.69 56.93 79.91 82.44 81.7 80.11 79.75 1.34
wine 97.53 97.75 93.26 95.88 90.26 97.57 96.25 97.57 91.01 92.51 0.00
yeast 61.08 59.43 60.13 59.32 40.7 58.15 59.4 61.08 59.74 60.01 13.68
zoo 95.25 95.38 92.41 94.72 85.48 94.72 94.72 91.75 90.43 86.8 1.98
Acc 84.37 83.40 81.61 80.85 73.48 78.92 81.59 82.94 80.57 80.76
Table 19: (Cont.) Comparison of the accuracy for each data set using a voting ensemble (Ens) with using an ensemble filter for the investigated learning algorithms. The column “Per” refers to the percentage of instances that have a greater than or equal to 90%. The rows in gray represent those datasets where filtering with an ensemble filter increased the accuracy over the voting ensemble for all learning algorithms.
Data set Ens FEns 50 FEnse 70 FEns 90 FEns Max
anneal.ORIG 98.08 97.57 96.26 86.15 97.57
AP-Breast-Uterus 97.61 97.69 97.44 96.15 97.69
ar1 90.08 90.08 90.41 92.40 92.40
arrhythmia 71.11 70.40 69.69 55.58 70.40
audiology 78.94 77.52 72.30 48.58 77.52
autos 83.51 82.15 73.56 49.66 82.15
badges2 100.00 100.00 100.00 100.00 100.00
balance-scale 88.45 86.46 85.28 71.42 86.46
breast-cancer 73.99 73.71 74.20 74.34 74.34
breast-w 96.88 96.71 96.62 93.45 96.71
bupa 71.30 70.84 68.35 60.52 70.84
carEval 96.70 95.51 91.81 70.02 95.51
chess-KRVKP 99.53 99.42 99.26 83.94 99.42
cm1-req 75.73 75.06 77.53 77.53 77.53
colic 85.33 85.54 85.98 81.52 85.98
contact-lenses 76.67 77.50 80.00 70.83 80.00
credit-a 86.64 86.26 86.03 85.51 86.26
credit-g 75.64 74.52 72.84 70.00 74.52
dermatology 97.43 97.43 97.27 91.58 97.43
desharnais 74.32 73.09 72.84 69.38 73.09
ecoli 87.44 87.74 86.90 64.88 87.74
eucalyptus 65.11 63.97 61.82 52.83 63.97
eye-movements 64.76 59.02 55.26 45.21 59.02
glass 74.02 62.52 61.59 49.53 62.52
heart-c 83.83 82.38 82.18 80.20 82.38
heart-h 82.93 82.86 83.40 82.04 83.40
heart-statlog 82.44 82.37 81.26 78.59 82.37
hepatitis 83.10 83.35 83.10 80.26 83.35
hypothyroid 99.43 99.37 98.17 94.04 99.37
ionosphere 92.99 92.82 91.34 84.67 92.82
iris 95.33 94.53 94.13 94.13 94.53
labor 92.98 91.58 88.07 82.11 91.58
lungCancer 53.75 51.25 53.13 38.75 53.13
lymphography 83.24 81.35 80.95 76.35 81.35
MagicTelescope 86.27 85.49 84.73 74.91 85.49
nursery 98.91 98.55 97.24 90.43 98.55
Continued on next page
Table 20: Comparison of the accuracy from a majority voting ensemble trained on unfiltered (Ens) and filtered data (FEns). The value after “FEns” represents the percentage of learning algorithms that have to misclassify an instance for it to be filtered from the training set and “Max” uses the accuracy from the percentage that results in the greatest accuracy. Training with unfiltered data is significantly better than training with filtered data. The values in bold for the “Ens” represent if the majority voting ensemble is greater then the filtered majority voting ensemble. The values in bold for the “FEns” columns represent if using filtered training data results in greater classification accuracy.
Data set Ens FEns 50 FEnse 70 FEns 90 FEns Max
ozone 97.01 97.07 97.09 97.12 97.12
pasture 86.11 81.11 78.89 66.67 81.11
pimaDiabetes 77.06 76.46 75.81 73.91 76.46
post-opPatient 69.78 70.22 71.11 71.11 71.11
primary-tumor 48.08 45.19 43.01 38.47 45.19
segment 98.00 97.62 96.54 85.65 97.62
sick 98.45 98.32 98.17 97.16 98.32
sonar 81.92 81.44 80.10 73.75 81.44
soybean 94.32 93.85 93.12 67.88 93.85
spambase 94.95 94.78 94.17 84.43 94.78
teachingAssistant 57.88 54.44 47.15 39.60 54.44
titanic 78.72 78.65 78.00 77.60 78.65
vote 95.82 95.72 95.68 95.40 95.72
vowel 95.54 94.75 86.44 40.14 94.75
waveform-5000 84.21 84.26 81.77 63.42 84.26
wine 97.53 97.64 96.97 96.18 97.64
yeast 61.08 61.01 60.55 40.50 61.01
zoo 95.25 94.65 93.66 87.13 94.65
Ave 84.37 83.40 82.21 73.96 83.62
Table 20: Cont. Comparison of the accuracy from a majority voting ensemble trained on unfiltered (Ens) and filtered data (FEns). The value after “FEns” represents the percentage of learning algorithms that have to misclassify an instance for it to be filtered from the training set and “Max” uses the accuracy from the percentage that results in the greatest accuracy. Training with unfiltered data is significantly better than training with filtered data. The values in bold for the “Ens” represent if the majority voting ensemble is greater then the filtered majority voting ensemble. The values in bold for the “FEns” columns represent if using filtered training data results in greater classification accuracy.