The goal of supervised machine learning is to infer an accurate generalizing function
from a set of input feature vectorsand a corresponding set of of label vectors . The quality of the inferred function by a learning algorithm is dependent on the quality of the data used for training. Many real-world data sets are often noisy where the noise in a data set can be label noise and/or attribute noise 
. The focus of this paper is on label noise. Noise arises from various sources such as subjectivity, human errors, and sensor malfunctions. As such, it is important to take the possibility of label noise into account when inferring a model of the data. Much previous work has examined the effects of class noise and how to handle it. As many real-wold data sets are inherently noisy, most learning algorithms are designed to tolerate a certain degree of noise by avoiding overfitting the training data. There are two general approaches for handling class noise: 1) creating learning algorithms that are robust to noise such as the C4.5 algorithm for decision trees and 2) preprocessing the data prior to inferring a model of the data such as filtering [3, 4] or correcting  noisy instances. In this work, we specifically examine handling noise by filtering.
Previous work has generally examined filtering in a limited context using a single or very few learning algorithms and/or using a limited number of data sets. This may be in part due to the extra computational requirement to first filter a data set and then infer a model of the data using the filtered data set. As such, previous work has generally limited itself to investigating relatively fast learning algorithms such as decision trees  and nearest-neighbor algorithms [7, 8]. In addition, filtering for instance-based learning algorithms was motivated in part to reduce the number of instances that have to be stored and because instance-based learning algroithms are more sensitive to noise than other learning algorithms. Also, most previous work artificially added noise to the data set to show that filtering, weighting, or cleaning the data set is beneficial. In this work, we examine filtering misclassified instances over a set of 54 data sets and 9 learning algorithms without
adding artificial noise. The artificial noise was added in previous work to show that filtering/weighting/cleaning provided significant improvements with noisy data sets. Within the context of the benefits of filtering established by the previous work, we show how filtering affects data sets without adding artificial noise to a data set. This also avoids making assumptions about the generation of the noise which may or may not be accurate. We also compare filtering with a voting ensemble with a diverse set of base classifiers.
The insights provided shed light on which learning algorithms are beneficial for filtering and which learning algorithms are the most robust to noise. Using a larger number of data sets allows for more statistical confidence in the results than if only a small number of data sets are used. We find that using an ensemble filter achieves significantly higher classification accuracy than using a single learning algorithm filter. We also find that, in general, a voting ensemble is robust to noise and achieves significantly higher classification accuracy trained on unfiltered data than a single learning algorithm trained on filtered data. On data sets with higher percentages of inherent noisy instances, however, using an ensemble filter achieves higher classification accuracy than a voting ensemble for some learning algorithms. And surprisingly, training a voting ensemble on filtered training data significantly decreases classification accuracy compared to training a voting ensemble on unfiltered training data.
In the next section, we present previous work for handling noise in supervised classification problems. A mathematical motivation for filtering misclassified instances is presented in Section 3. We then present our experimental methodology in Section 4 followed by a presentation of the results in Section 5. In Section 6 we provide conclusions and directions for future work.
2 Related Work
As many real-wold data sets are inherently noisy, most learning algorithms are designed to tolerate a certain degree of noise. Typically, learning algorithms are designed to be somewhat robust to noise by making a trade-off between the complexity of the inferred model and optimizing the inferred function on the training data to prevent overfit. Some techniques to avoid overfit include early stopping using a validation set, pruning (such as in the C4.5 algorithm for decision trees 
), or regularization by adding a complexity penalty to the loss function. Further, some learning algorithms have been adapted specifically to better handle label noise. For example, noisy instances are problematic for boosting algorithms [10, 11] where more weight is placed upon misclassified instances, which often include mislabeled and noisy instances. To address this, Servedio 
presented a boosting algorithm that does not place too much weight on any single training instance. For support vector machines, Collobert et al. use the ramp-loss function to place a bound on the maximum penalty for an instance that lies on the wrong side of the margin. Lawrence and Schölkopf 
Preprocessing the data set is another approach that explicitly handles label noise. This can be done by removing noisy instances, weighting the instances, or correcting incorrect labels. All three approaches first attempt to identify which instances are noisy by various criteria. Filtering noisy instances has received much attention and has generally resulted in an increase in classification accuracy [15, 16]. One frequently used filtering technique removes any instance that is misclassified by a learning algorithm  or set of learning algorithms . Verbaeten and Van Assche 
further pursued the idea of using an ensemble for filtering using ideas from boosting and bagging. Other approaches use learning algorithm heuristics to remove noisy instances. Segata et al. remove instances that are too close or on the wrong side of the decision surface generated by a support vector machine. Zeng and Martinez 
remove instances while training a neural network that have a low probability of being labeled correctly where the probability is calculated using the output from the neural network. Filtering has the potential downside of discarding useful instances. However, it is assumed that there are significantly more non-noisy instances and that throwing away a few correct instances with the noisy instances will not have a negative impact on a large data set.
Weighting the instances in a training set has the benefit of not discarding any instances. Rebbapragada and Brodley  weight the instances using expectation maximization to cluster instances that belong to a pair of the classes. The probabilities between classes for each instances is compiled and used to weight the influence of each instance. Smith and Martinez  examine weighting the instances based on their probability of being misclassified.
Similar to weighting the training instances, data cleaning does not discard any instances, but rather strives to correct the noise in the instances. As in filtering, the output from a learning algorithm has been used to clean the data. Automatic data enhancement  uses the output from a neural network to correct the label for training instances that have a low probability of being correctly labeled. Polishing [23, 5] trains a learning algorithm (in this case a decision tree) to predict the value for each attribute (including the class). The predicted (i.e. correct) attribute values for the instances that increase generalization accuracy on a validation set are used instead of the uncleaned attribute values.
We differ from the related work in that we do not add artificial noise to the data sets when we examine filtering. Thus, we avoid making any assumptions about the noise source and focus on the noise inherent in the data sets. We also examine the effects of filtering on a larger set of learning algorithms and data sets providing more significance to the generality of the results.
3 Modeling Class Noise in a Discriminative Model
Lawrence and Schölkopf 
proposed to model a data set probabilistically using a generative model that models the noise process. They assume that the joint distribution(where is the set of input features, is the possibly noisy class label given in the trianing set, and is the actual unkown class label) is factorized as as shown in Figure 1
a. However, since modeling the prior distribution of the unobserved random variable
is not feasible, it is more practical to estimate the prior distribution ofwith some assumptions about the class noise as shown in Figure 1b.
Here, we follow the premise of Lawrence and Schölkopf by explicitly modeling the possibility that an instance is misclassified. Rather than using a generative model, though, we use a discriminative model since we are focusing on classification tasks and do not require the full joint distribution. Also, discriminative models have been shown to yield better performance on classification tasks .
Let be a training set composed of instances drawn i.i.d. from the underlying data distribution . Each instance is composed of an input vector with a corresponding possibly noisy label vector . Given the training data , a learning algorithm generally seeks to find the most probable hypothesis that maps each . For supervised classification problems, most learning algorithms maximize for all instances in . This is shown graphically in Figure 2a where the probabilities are estimated using a discriminative approach such as a neural network or a decision tree to infer a hypothesis of the data. Using Bayes’ rule and decomposing into its individual constituent instances, the maximum a posteriori hypothesis is:
In Equation 1, the MAP hypothesis is found by finding a global optima where all instances are included in the optimization problem. However, noisy instances are often detrimental for finding the global optima since they are not representative of the true (and unknown) underlying data distribution . The possibility of label noise is not explicitly modeled in this form–completely ignoring . Thus, label noise is generally handled by avoiding overfit such that more probable, simpler hypotheses are preferred (). The possibility of label noise can be modeled explicitly by including the latent random variable with and . Thus, an instance is the triplet
and a supervised learning algorithm seeks to maximize–modeled graphically in Figure 2b. Using the model in Figure 2b, the MAP hypothesis becomes:
Equation 2 shows that for an instance , the probability of an observed class label () should be weighted by the probability of the actual class (). We now show a method to estimate .
For filtering as a preprocessing step, we want to calculate and remove instances that have a low probability of being the same as . Using a discriminative model trained on , we can calculate as
Since the quantity is unknown, can be approximated as assuming that is represented in . In other words, the inferred discriminative model is able to model if one class label is more likely than another class label given an observed noisy label. Otherwise, all class labels are assumed to be equally likely given an observed label. Thus, can be approximated by finding the class distributions for a given from an inferred discriminative model. That is, after training a learning algorithm on , the class distribution for an instance can be calculated based on the output from the learning algorithm. As shown in Equation 1, is found naturally through a derivation of Bayes’ law. The quantity is the maximum likelihood of an instance given a hypothesis which a learning algorithm tries to maximize for each instance. Further, the dependence on can be removed by summing over all possible hypotheses in and multiplying each by :
This formulation is infeasible though because 1) it is not practical (or possible) to sum over the set of all hypotheses, 2) calculating
is non-trivial, and 3) not all learning algorithms produce a probability distribution. These limitation make probabilistic generative models attractive, such as the kernel Fisher discriminant algorithm. However, for classification tasks, generative models generally have a higher asymptotic error than discriminative models .
The first step for filtering is to determine for each instance. Given that a number of different techniques could be employed to estimate , we conduct an extensive evaluation of filtering misclassified instances using a diverse set of learning algorithms. The diversity of the learning algorithm refers to the learning algorithms not having the same classification for all of the instances and is determined using unsupervised meta-learning (UML) . UML first uses Classifier Output Difference (COD)  to measure the diversity between learning algorithms. COD measures the distance between two learning algorithms as the probability that the learning algorithms make different predictions. UML then clusters the learning algorithms based on their COD scores with hierarchical agglomerative clustering. We considered 20 learning algorithms from Weka with their default parameters . The resulting dendrogram is shown in Figure 3, where the height of the line connecting two clusters corresponds to the distance (COD value) between them. A cut-point of 0.18 was chosen to create 9 clusters and a representative algorithm from each cluster was used to create a diverse set of learning algorithms. The learning algorithms that were used are listed in Table 1.
|*||Multilayer Perceptron trained with Back Propagation (MLP)|
|*||Decision Tree (C4.5) |
|*||Locally Weighted Learning (LWL)|
|*||5-Nearest Neighbors (5-NN)|
|*||Nearest Neighbor with generalization (NNge)|
|*||Naïve Bayes (NB)|
|*||RIpple DOwn Rule learner (RIDOR)|
|*||Random Forest (RandForest)|
|*||Repeated Incremental Pruning to Produce Error Reduction (RIPPER)|
We investigate filtering using the learning algorithms shown in Table 1. Since not all learning algorithms produce a probability distribution, the indicator function is used in this paper instead of , thus, removing misclassified instances. Each learning algorithm first filters misclassified instances and then infers a model of the data using the filtered data set. We also examine using an ensemble filter–removing instances that are misclassified by different percentages of the 9 learning algorithms. The ensemble filter more closely approximates from Equation 3 since it sums over a set of learning algorithms (which in this case were chosen to be diverse and represent a larger subset of the hypothesis space ) lessening the dependence on a single hypothesis . For the ensemble filter, is estimated using a subset of learning algorithms :
where is the hypothesis from the learning algorithm trained on training set . From Equation 3, is estimated as for the hypothesis generated from training the learning algorithms in on and as zero for all of the other hypotheses in . Also, is estimated using the indicator function since not all learning algorithms produce a probability distribution over the output classes. Set up as such, the ensemble filter counts how many times an instance is misclassified by a set of learning algorithms. Brodley and Friedl  examined an ensemble of three learning algorithms on five data sets with artificially generated noise inserted into the data sets. In this paper, we examine an ensemble filter, removing instances that are misclassified by 50, 70, and 90 percent of the learning algorithms in the ensemble. One of the problems of using an ensemble filter is having to choose the percentage of learning algorithms that misclassify an instance for filtering. For the results, we report the accuracy from the percentage that produces the highest accuracy using 5 by 10-fold cross-validation to choose the best percentage for each data set. This method highlights the impact of using an ensemble filter, however, in practice a validation set is often used to determine the percentage that would be used.
In addition, we also examine using an adaptive filtering approach that iteratively adds a learning algorithm to a set of filtering learning algorithms by selecting the learning algorithm from a set of candidate learning algorithms that produces the highest classification accuracy on a validation set when added to the set of learning algorithms used for filtering, as shown in Algorithm 1. The function trains a learning algorithm on a data set using the filter set to filter the instances and returns the accuracy of the learning algorithm on a validation set. As with the ensemble filter, instances are removed that are misclassified by a given percentage of the filtering learning algorithms. The idea is to choose an optimal subset of learning algorithms through a greedy search of the candidate filtering algorithms. For the results, we report the accuracy from the percentage that produces the highest accuracy using 5 by 10-fold cross-validation to choose the best percentage for each data set.
Each method for filtering is evaluated using 5 by 10-fold cross-validation (running 10-fold cross validation 5 times, each time with a different seed to partition the data). We examine filtering using the 9 chosen learning algorithms on a set of 47 data sets from the UCI data repository and 7 non-UCI data sets [28, 29, 30, 31]. For filtering, we examine two methods for training the filtering algorithms: 1) removing the misclassified instances when trained on the entire training set and 2) using cross-validation on the training set that removes instances that are misclassified in the validation set. The number of folds for using cross-validation for the training set was set to 2, 3, 4, and 5. Table 2 shows the data sets used in this study organized according to the number of instances, number of attributes, and attribute type. The non-UCI data sets are in bold.
|# Ins||# Attributes||Attribute Type|
|Voting Records||ar1||Horse Colic|
Statistical significance between pairs of algorithms is determined using the Wilcoxon signed-ranks test as suggested by Demšar . We emphasize the extensive nature of this evaluation:
Filtering is examined on 9 diverse learning algorithms.
9 diverse learning algorithms are examined as misclassification filtering techniques.
In addition to the single algorithm misclassification filters, an ensemble filter and an adaptive filter are examined.
Each filtering method is examined on a set of 54 data sets using 5 by 10-fold cross-validation.
Each filtering method is examined on the entire training set as well as using 2-, 3-, 4-, and 5-fold cross-validation.
In this section, we present the results from filtering the 54 data sets using a biased filter (the same learning algorithm to filter misclassified instances is used to infer a model of the data), an ensemble filter, and the adaptive filter. Except for the adaptive filter, we find that using cross-validation on the training set for filtering resulted in a lower accuracy (and often significantly lower) accuracy than using the entire training set and, as such, the following results for the biased filter and the ensemble filter are from using the entire training set for filtering rather than using cross-validation. We first show how filtering affects each learning algorithm in Section 5.1. Next, we examine using a set of data set measures to determine when filtering is the most effective in Section 5.2. Our results suggest that using an ensemble filter in all cases produces the best results. In Section 5.3, we then compare filtering with a voting ensemble and show that a voting ensemble is preferable to filtering.
5.1 Filtering Results
The filtering results are summarized in Table 3–showing the average classification accuracy for each learning algorithm and filtering algorithm pair111The NNge learning algorithm did not finish running two data sets: eye-movements and Magic telescope. RIPPER did not finish on the lung cancer data set. In these cases, the data sets are omitted from the presented results. As such, NNge was evaluated on a set of 52 data sets and RIPPER was evaluated on a set of 53 data sets.. The values in bold represent those that are a statistically significant improvement over not filtering. The results of the statistical significance tests for each of the learning algorithms is provided in Tables 10-18 in A. The results are summarized below.
We find that using a biased filter does not significantly increase the classification for any of the learning algorithms and that using a biased filter significantly decreases the classification accuracy for the LWL, naïve Bayes, Ridor and RIPPER learning algorithms. These results suggest that simply removing the misclassified instances by a single learning algorithm is not sufficient. Bear in mind that these results reflect not adding any artificial noise to the training set. In the case where artificial noise is added to the training set (as was commonly done in previous work), using a biased filter may result in an improvement in accuracy. However, most real-world scenarios do not artificially add noise to their data set but are concerned with the inherent noise found within it.
For all of the learning algorithms, the ensemble filter significantly increases the classification accuracy over not filtering and over the other filtering techniques. An ensemble generally provides better predictive performance than any of the constituent learning algorithms  and generally yields better results when the underlying ensembled models are diverse . Thus, by using a more powerful model, only the noisiest instances are removed. This provides empirical evidence supporting the notion that filtering instances with low that are not dependent on a single hypothesis is preferred to filtering instances where the probability of the class is dependent on a particular hypothesis as outlined in Equation 3.
Surprisingly, the adaptive filter does not outperform the ensemble filter and in, one case, it does not even outperform training on unfiltered data. Perhaps this is because it overfits the training data since the best accuracy is chosen on the training set. Adaptive filtering has significantly better results when cross-validation is used to filter misclassified instances as opposed to removing misclassified instances that were also used to train the filtering algorithm. Even with using the results with cross-validation, the results are not significantly better than using an ensemble filter.
Examining each learning algorithm individually, we find that some learning algorithms are more robust to noise than others. To determine which learning algorithms are more robust to noise, we compare the accuracy of the learning algorithms without filtering to the accuracy obtained using an ensemble filter. The p-values from the Wilcoxon signed-ranks statistical significance test are shown in Table 4 ordered from greatest (least significant impact) to least reading from left to right. We see that random forests and decision trees are the most robust to noise as filtering has the least significant impact on their accuracy. This is not too surprising given that the C4.5 algorithm was designed to take noise into account and random forests are built using decision trees. Ridor and 5-nearest neighbor (IB5) are more robust to noise, but still greatly improve with filtering. IB5 is more robust to noise since it compares with the 5 nearest neighbors of an instance. If were set to 1, then filtering would have a greater effect on the accuracy. Filtering has the most significant effect on the accuracy of the last five learning algorithms: MLP, NNge, LWL, RIPPER, and naïve Bayes.
5.2 Analysis of When to Filter
Using only the inherent noise in a data set, the efficacy of filtering is limited and can be detrimental in some data sets. Thus, we examine the cases in which filtering significantly improves the classification accuracy. This investigation is similar to the recent work by Sáez et al.  who investigate creating a set of rules to understand when to filter using a 1-nearest neighbor learning algorithm. They use a set of data complexity measures from Ho and Basu . The complexity measures are designed for binary classification problems, yet we do not limit ourselves to binary classification problems. As such, we use a subset of the data complexity measures shown in Table 5 that have been extended to handle multi-class problems .
|F2:||Volume of overlap region:The overlap of the per-class bounding|
|boxes calculated for each attribute by normalizing the difference of|
|the maximum and minimum values from each class.|
|F3:||Max individual feature efficiency: For all of the features, the|
|maximum ratio of the number of instances not in the overlapping|
|region to the total number of instances.|
|F4:||Collective feature efficiency: F3 only return the ratio for the|
|attribute that maximizes the ratio. F4 is a measure for all of the|
|N1:||Fraction of points on class boundary: The fraction of instances|
|in a data set that are connected to their nearest neighbors that have|
|a different class in a spanning tree.|
|N2:||Ratio of ave intra/inter class NN dist: The average distance to|
|the nearest intra-class neighbors divided by the average distance to|
|the nearest inter-class neighbors.|
|N3:||Error rate of 1NN classifier: Leave-one-out error estimate of|
|T1:||Fraction of maximum covering spheres: The normalized count|
|of the number of clusters of instances containing a single class|
|T2:||Ave number of points per dimension: Compares the number of|
|instances to the number of features.|
In addition, we also examine a set of hardness measures  shown in Table 6. The hardness measures are designed to determine and characterize instances that have a high likelihood of being misclassified. We examine using the set of data complexity measures and the hardness measures to create rules and/or a classifier to determine when to use filtering. We set up the classification problem similar to Sáez et al. where filtering is set to “TRUE” if filtering significantly improves the classification accuracy for a data set using the Wilcoxon signed-ranks test. We also examine predicting the difference in accuracy between using and not using a filter. Unlike Sáez et al., we find that the data complexity measures and the hardness measures do not create a satisfactory classifier to determine when to filter. Granted, we examine more learning algorithms and do not artificially add noise to the data sets which provides for few data sets where filtering significantly improves the classification accuracy. In the study by Sáez et al., 75% of the data sets had at least 5% noise added providing more positive examples. More future work is required to determine when to use filtering on unmodified data sets. Based on our results, we would recommend always using an ensemble filter for all of the learning algorithms as it significantly outperforms the other filtering techniques.
|kDN||k-Disagreeing Neighbors: The percentage of the nearest|
|neighbors (using Euclidean distance) for an instance that do not|
|share its target class value.|
|DS||Disjunct Size: The number of instances in a disjunct divided by|
|the number of instances covered by the largest disjunct in a|
|data set in an unpruned decision tree inferred using C4.5 .|
|DCP||Disjunct Class Percentage: The number of instances in a|
|disjunct belonging to its class divided by the total number of|
|instances in the disjunct in a pruned decision tree.|
|TD||Tree Depth: The depth of the leaf node that classifies an|
|instance in an induced decision tree.|
|CL||Class Likelihood: The probability that an instance belongs to its|
|class given the input features.|
|CLD||Class Likelihood Difference: The difference between the class|
|likelihood of an instance and the maximum likelihood for all of the|
|MV||Minority Value: The ratio of the number of instances sharing its|
|target class value to the number of instances in the majority class.|
|CB||Class Balance: The difference of the ratio of the number of|
|instances belonging to a class and the ratio of the classes if they|
|were distributed equally.|
5.3 Voting Ensemble VS. Filtering
In This section, we compare the results of filtering using an ensemble filter with a voting ensemble. The voting ensemble uses the learning algorithms shown in Table 1 and the vote from each learning algorithm is equally weighted. Table 7 compares the voting ensemble with using an ensemble filter on each of the investigated learning algorithms giving the average accuracy, the p-value, and the number of times that the accuracy of a voting ensemble is greater than, equal to, or less than using an ensemble filter. The results for each data set are provided in Table 19 in B. With no artificially generated noise, a voting ensemble achieves significantly higher classification accuracy than an ensemble filter for each of the examined learning algorithms. This is not too surprising considering that previous research has shown that ensemble methods address issues that are common to all non-ensemble learning algorithms  and that ensemble methods generally obtain a greater accuracy than that from a single learning algorithm that makes up part of the ensemble . Considering the computational requirements for training, using a voting ensemble for classification rather than filtering appears to be more beneficial.
Many previous studies [1, 14, 4, 17] have shown that when a large amount of artificial noise is added to a data set (i.e. ), then filtering outperforms a voting ensemble. We examine which of the 54 data sets have a high percentage of noise using instance hardness  to identify suspected noisy instances. Instance hardness approximates the likelihood that an instance will be misclassified by evaluating the classification of an instance from a set of learning algorithms : . The set of learning algorithms is composed of the learning algorithms shown in Table 1. The instances that have a probability greater than 0.9 of being misclassified we consider to be noisy instances. Table 8 shows the accuracies from a voting ensemble and the considered learning algorithms using an ensemble filter for the subset of data sets with more than 10% noisy instances. Examining the more noisy data sets shows that the gains from using an ensemble filter are more noticeable. However, only 9 out of the 54 investigated data sets were identified as having more than 10% noisy instances. We ran a Wilcoxon signed-ranks test, but with the small sample size it is difficult to determine the statistical significance of using the ensemble filter over using a voting ensemble. Based on the small sample provided here, training a learning algorithm on a filtered data set is statistically equivalent to training a voting ensemble classifier. The computational complexity required to train an ensemble is less than that to train an ensemble for filtering followed by training another learning algorithm from the filtered data set. A single learning algorithm trained on the filtered data set has the benefit that only one learning algorithm is queried for a novel instance. Future work will include discovering if a smaller subset of learning algorithms for filtering approximates using the ensemble filter in order to reduce the computational complexity.
Examining the more noisy data sets shows that filtering has a more significant effect on classification accuracy, however, the amount of noise is not the only factor that needs to be considered. For example, 32.2% of the instances in the primary-tumor data set are noisy, yet only one learning algorithm achieves a greater classification accuracy than the voting ensemble. On the other hand, the classification accuracy on the ar1 and ozone data sets for all of the considered learning algorithms trained on filtered data is greater than using a voting ensemble despite only having 3.3% and 0.5% noisy instances respectively. Thus, there are other unknown data set features affecting when filtering is appropriate. Future work also includes discovering and examining data set features that are indicative of when filtering should be used.
|Ens||FEns 50||FEns 70||FEns 90||FEns Max|
We further investigate the robustness of the majority voting ensemble to noise by applying an ensemble filter to the training data for the voting ensemble. We find that a majority voting ensemble is significantly better without filtering. The summary results are shown in Table 9 and the full results for each data set can be found in Table B.18 in B. Table 9 divides the data sets into subsets that have more than 10% noisy instances (“Noisy”), and those that have an original accuracy less than 90%, 80%, 70%, 60%, and 50% averaged across the investigated learning algorithms (). Even with harder data sets and more noisy instances, using unfiltered training data produces significantly higher classification accuracy for the voting ensemble. Thus, we find that a majority voting ensemble is more robust to noise than filtering in most cases. The strength of a voting ensemble comes from the diversity of the ensembled learning algorithms. However, the inferred models from the learning algorithms trained on the filtered training data are less diverse since the diversity often comes from how a learning algorithm treats a noisy instance, lessening the power of the voting ensemble. This is evidenced as we examined a voting ensemble consisting of C4.5, random forest, and Ridor which are three of the more similar learning algorithms using unsupervised meta-learning (see Section 4). When trained on the filtered training data, the less diverse voting ensemble achieves a significantly lower classification average accuracy of 82.09% compared to 83.62% from the voting ensemble composed of the 9 examined learning algorithms.
In this paper, we presented an extensive empirical evaluation of misclassification filters on a set of 54 data sets and 9 diverse learning algorithms. As opposed to other work on filtering, we used a large set of data sets and learning algorithms and we did not artificially add noise to the data set. In previous work, noise was added to a data set to verify that the noise filtering method was effective and that filtering was more effective when more noise was present. However, the artificial noise may not be representative of the actual noise and the impact of filtering on an unmodified data set is not always clear.
We examined each learning algorithm individually as a filter as well as using all of the learning algorithms combined as an ensemble filter. We also presented an adaptive filtering algorithm that greedily searches the set of candidate learning algorithms for filtering for a specific data set and learning algorithm combination. We found that, without artificially adding label noise, using the same learning algorithm for filtering and for inferring a model of the data can be significantly detrimental and does not significantly increase the classification accuracy even when examining harder data sets. We also examined using a set of data set features to induce rules that indicate when to use filtering, but did not find a set of rules that significantly improved the results. Using an ensemble filter significantly improved the accuracy over not filtering and outperformed both the adaptive filtering method and using each learning algorithm individually as a filter for all of the investigated learning algorithms.
We also compared filtering with a voting ensemble and found that a voting ensemble achieves significantly higher classification accuracy than any of the other considered learning algorithms trained on filtered data. A majority voting ensemble trained on unfiltered data significantly outperforms a voting ensemble trained on filtered data. Thus, a voting ensemble exhibits robustness to noise in the training set and is preferable to filtering.
X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study of their impacts, Artificial Intelligence Review 22 (2004) 177–210.
-  J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, USA, 1993.
-  D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics (2-3) (1972) 408–421.
-  C. E. Brodley, M. A. Friedl, Identifying mislabeled training data, Journal of Artificial Intelligence Research 11 (1999) 131–167.
C.-M. Teng, Combining noise correction with feature selection, in: Y. Kambayashi, M. K. Mohania, W. Wöß (Eds.), DaWaK, Vol. 2737 of Lecture Notes in Computer Science, Springer, 2003, pp. 340–349.
-  G. H. John, Robust decision trees: Removing outliers from databases, in: Knowledge Discovery and Data Mining, 1995, pp. 174–179.
-  I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics 6 (1976) 448–452.
-  D. R. Wilson, T. R. Martinez, Reduction techniques for instance-based learning algorithms, Machine Learning 38 (3) (2000) 257–286.
C. M. Bishop, N. M. Nasrabadi, Pattern Recognition and Machine Learning, Vol. 1, springer New York, 2006.
-  R. E. Schapire, The strength of weak learnability, Machine Learning 5 (1990) 197–227.
Y. Freund, Boosting a weak learning algorithm by majority, in: Proceedings of the Third Annual Workshop on Computational Learning Theory, 1990, pp. 202–216.
-  R. A. Servedio, Smooth boosting and learning with malicious noise, Journal of Machine Learning Research 4 (2003) 633–648.
-  R. Collobert, F. Sinz, J. Weston, L. Bottou, Trading convexity for scalability, in: Proceedings of the 23rd International Conference on Machine learning, 2006, pp. 201–208.
-  N. D. Lawrence, B. Schölkopf, Estimating a kernel fisher discriminant in the presence of label noise, in: In Proceedings of the 18th International Conference on Machine Learning, 2001, pp. 306–313.
-  D. Gamberger, N. Lavrač, S. Džeroski, Noise detection and elimination in data preprocessing: Experiments in medical domains, Applied Artificial Intelligence 14 (2) (2000) 205–223.
-  M. R. Smith, T. Martinez, Improving classification accuracy by identifying and removing instances that should be misclassified, in: Proceedings of the IEEE International Joint Conference on Neural Networks, 2011, pp. 2690–2697.
-  S. Verbaeten, A. Van Assche, Ensemble methods for noise elimination in classification problems, in: Proceedings of the 4th international conference on Multiple classifier systems, MCS’03, Springer-Verlag, Berlin, Heidelberg, 2003, pp. 317–325.
-  N. Segata, E. Blanzieri, P. Cunningham, A scalable noise reduction technique for large case-based systems, in: Proceedings of the 8th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development, 2009, pp. 328–342.
-  X. Zeng, T. R. Martinez, A noise filtering method using neural networks, in: Proc. of the int. Workshop of Soft Comput. Techniques in Instrumentation, Measurement and Related Applications, 2003.
-  U. Rebbapragada, C. Brodley, Class noise mitigation through instance weighting, in: Machine Learning: ECML 2007, Vol. 4701 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2007, pp. 708–715.
-  M. R. Smith, T. Martinez, Reducing the effects of detrimental instances, in: Submission, 2013.
-  X. Zeng, T. R. Martinez, An algorithm for correcting mislabeled data, Intelligent Data Analysis 5 (2001) 491–502.
-  C.-M. Teng, Evaluating noise correction, in: PRICAI, 2000, pp. 188–198.
-  J. Lee, C. Giraud-Carrier, A metric for unsupervised metalearning, Intelligent Data Analysis 15 (6) (2011) 827–841.
-  A. H. Peterson, T. R. Martinez, Estimating the potential for combining learning models, in: Proceedings of the ICML Workshop on Meta-Learning, 2005, pp. 68–75.
-  M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The weka data mining software: an update, SIGKDD Explorations Newsletter 11 (1) (2009) 10–18.
-  K. Thomson, R. J. McQueen, Machine learning applied to fourteen agricultural datasets, Tech. Rep. 96/18, The University of Waikato (September 1996).
J. Salojärvi, K. Puolamäki, J. Simola, L. Kovanen, I. Kojo, S. Kaski, Inferring relevance from eye movements: Feature extraction, Tech. Rep. A82, Helsinki University of Technology (March 2005).
J. Sayyad Shirabad, T. Menzies,
The PROMISE Repository
of Software Engineering Databases., School of Information Technology and
Engineering, University of Ottawa, Canada (2005).
G. Stiglic, P. Kokol, GEMLer: Gene
expression machine learning repository, University of Maribor, Faculty of
Health Sciences (2009).
-  J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7.
-  R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6 (3) (2006) 21–45.
-  L. I. Kuncheva, C. J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy., Machine Learning 51 (2) (2003) 181–207.
-  J. A. Sáez, J. Luengo, F. Herrera, Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification, Pattern Recognition 46 (1) (2013) 355–364.
-  T. K. Ho, M. Basu, Complexity measures of supervised classification problems, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 289–300.
-  A. Orriols-Puig, N. Macià, E. Bernadó-Mansilla, T. K. Ho, Documentation for the data complexity library in c++, Tech. Rep. 2009001, La Salle - Universitat Ramon Llull (April 2009).
-  M. R. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data complexity, Machine Learning (2013) In pressdoi:10.1007/s10994-013-5422-z.
-  T. G. Dietterich, Ensemble methods in machine learning, in: Multiple Classifier Systems, Vol. 1857 of Lecture Notes in Computer Science, Springer, 2000, pp. 1–15.
-  D. W. Opitz, R. Maclin, Popular ensemble methods: An empirical study., Journal of Artificial Intelligence Research 11 (1999) 169–198.
Appendix A Statistical Significance Tables
This section provides the results from the statistical significance tests comparing not filtering with filtering with a biased filter, an ensemble filter, and the adaptive filter for the investigated learning algorithms. The results are in Tables 10 - 18. The p-values with a value less than 0.05 are in bold and “greater-equal-less” refers to the number of times that the algorithm listed in the row is greater than, equal to, or less than the algorithm listed in the column.
Pair-wise comparison of filtering for multilayer perceptrons trained with backpropagation.
Appendix B Ensemble Results for Each Data Set
This section provides the results for each data set comparing a voting ensemble with filtering using an ensemble filter for each investigated learning algorithm as well as filtering using an ensemble filter for a voting ensemble. The results comparing a voting ensemble with filtering for each investigated non-ensembled learning algorithm are shown in Table 19. The bold values represent the highest classification accuracy and the rows highlighted in gray are the data sets where filtering with an ensemble filter increased the accuracy over the voting ensemble for all learning algorithms. The results comparing a voting ensemble with a filtered voting ensemble are shown in Table 20. The bold values for the “Ens” column represent if the voting ensemble trained on unfiltered data achieves higher accuracy while the bold values for the “FEns” columns represent if the voting ensemble trained on filtered data achieves higher accuracy than the voting ensemble trained on unfiltered data.
|Continued on next page|
|Data set||Ens||FEns 50||FEnse 70||FEns 90||FEns Max|
|Continued on next page|
|Data set||Ens||FEns 50||FEnse 70||FEns 90||FEns Max|