1 Introduction
The goal of supervised machine learning is to induce an accurate generalizing function (hypothesis)
from a set of input feature vectors
and a corresponding set of of label vectors that maps given a set of training instances . The quality of the induced function by a learning algorithm is dependent on the values of the learning algorithm’s hyperparameters and the quality of the training data . It is wellknown that realworld data sets are often noisy. It is also known that no learning algorithm or hyperparameter setting is best for all data sets (no free lunch theorem (Wolpert, 1996)). Thus, it is important to consider both the quality of the data and the learning algorithm with its associated hyperparameters when inducing a model of the data.Prior work has shown that hyperparameter optimization (Bergstra & Bengio, 2012) and improving the quality of the training data (i.e., correcting (Kubica & Moore, 2003), weighting (Rebbapragada & Brodley, 2007), or filtering (Smith & Martinez, 2011) suspected noisy instances) can significantly improve the generalization of an induced model. However, searching the hyperparameter space and improving the quality of the training data have generally been examined in isolation. In this paper, we compare the effects of hyperparameter optimization and improving the quality of the training data through filtering. We estimate the potential benefit of filtering and hyperparameter optimization by choosing the subset of training instances/hyperparameters that produce the highest 10fold crossvalidation accuracy for each data set. Maximizing the 10fold cross validation accuracy, in a sense, overfits the data (although none of the data points in the validation set are used for training). However, maximizing the 10fold crossvalidation accuracy provides more perspective on the magnitude of the potential improvement provided by each method. The results could then be analyzed to determine algorithmically which subset of the training data and/or hyperparameters to use for a given task and learning algorithm.
For filtering, we use an ensemble filter (Brodley & Friedl, 1999) as well as an adaptive ensemble filter. In an ensemble filter, an instance is removed if it is misclassified by of the learning algorithms in the ensemble. The adaptive ensemble filter is built by greedily searching for learning algorithms from a set of candidate learning algorithms that results in the highest crossvalidation accuracy on the entire data set when it is added to the ensemble filter. For hyperparameter optimization, we use random hyperparameter optimization (Bergstra & Bengio, 2012). Given the same amount of time constraints, the authors showed that a random search of the hyperparameters out performs a grid search.
We find that filtering has the potential for a significantly higher increase in classification accuracy than hyperparameter optimization. On a set of 52 data sets using 6 learning algorithms with default hyperparameters, the average classification accuracy increases from 79.15% to 87.52% when the training data is filtered. Hyperparameter optimization for the 6 learning algorithms only increases the average accuracy to 81.61%. The significant improvement in accuracy caused by filtering demonstrates the magnitude of the potential benefit that filtering could have on classification accuracy. These results provide motivation for further research into developing algorithms that improve the quality of the training data.
2 Preliminaries
In this section, we establish the preliminaries and notation that will be used to discuss hyperparameter optimization and the quality of the training data. Given that in most cases, all that is known about a task is contained in the set of training instances, at least initially, the instances in a data set are generally considered equally. Therefore, with most machine learning algorithms, one is concerned with maximizing , where is a hypothesis or function mapping input feature vectors to their corresponding label vectors , and is a training set. Generally, there is some overfit avoidance for a learning algorithm to maximize while also minimizing the expected loss on validation data. The goodness of the induced hypothesis
is then characterized by its empirical error for a specified loss function
on a validation set :In practice, is induced by a learning algorithm trained on with hyperparameters , i.e., .
Characterizing the success of a learning algorithm at the data set level (e.g., accuracy or precision) optimizes over the entire training set and marginalizes the impact of a single training instance on an induced model. Some instances can be more beneficial than other instances for inducing a model of the data and some instances can even be detrimental. By detrimental instances
, we mean instances that have a negative impact on the model induced by a learning algorithm. For example, outliers or mislabeled instances are not as beneficial as border instances and are detrimental in many cases. In addition, other instances can be detrimental for inducing a model of the data even if they are labeled correctly and are not outliers. Formally, a detrimental instance
is an instance that, when it is used for training, increases the empirical error:The effects of training with detrimental instances is demonstrated in the hypothetical twodimensional data set shown in Figure 1
. Instances A and B represent detrimental instances. The solid line represents the “actual” classification boundary and the dashed line represents a potential induced classification boundary. Instances A and B adversely affect the induced classification boundary because they “pull” the classification boundary and cause several other instances to be misclassified that otherwise would have been classified correctly even though the induced classification boundary (the dashed line) may maximize
.Mathematically, the effect of each instance on the induced hypothesis can be found through a decomposition of Bayes’ theorem:
(1) 
Despite having a mechanism to avoid overfitting detrimental instances (often denoted as ), the presence of detrimental instances still affects the induced model for most learning algorithms. Detrimental instances have the most significant impact during the early stages of training where it is difficult to identify detrimental instances and undo the negative effects caused by them (Elman, 1993).
In the sections that follow, we discuss how a) hyperparameter optimization and b) improving the quality of the data handle detrimental instances.
2.1 Hyperparameter Optimization
The quality of an induced model by a learning algorithm depends in part on the learning algorithm’s hyperparameters. With hyperparameter optimization, the hyperparameter space is searched to minimize the empirical error on :
(2) 
The hyperparameters can have a significant effect on suppressing detrimental instances and inducing better models in general. For example, the loss function in a support vector machine could be set to use the ramploss function which limits the penalty on instances that are too far from the decision boundary
(Collobert et al., 2006). Suppressing the effects of the detrimental instances improves the induced model, but does not change the fact that detrimental instances still affect the model. This is shown graphically in Figure 2a using the same hypothetical 2dimensional data set in Figure 1. The original induced hypothesis without hyperparameter optimization is shown as the gray dashed line. The new induced hypothesis is represented as the bold dashed line. The effect of instance A and instance B may be reduced but instance A and instance B still affect the induced hypothesis. As shown mathematically in Equation 1, each instance is still considered during the learning process.a  b 
2.2 Improving the Training Data Quality
The quality of an induced model also depends on the quality of the training data. Low quality training data results in lower quality induced models. The quality of each training instance is generally not considered when searching for the most probable hypothesis given the training data other than overfit avoidance. Thus, the learning process could also seek to improve the quality of the training data–such as searching for a subset of the training data that results in lower empirical error:
where is a subset of and is the power set of . The effects of training on only a subset of the training data is shown in Figure 2b. The detrimental instances A and B are removed from the data set and obviously have no effect on the induced model. Mathematically, it is readily apparent that the removed instances have no effect on the induced model as they are not included in the product in Equation 1.
3 Identifying Detrimental Instances
Recognizing that some instances are detrimental for inducing a classification function leads to the question of how to identify which instances are detrimental. Identifying detrimental instances is a nontrivial task. Fully searching the space of subsets of training instances generates subsets of training instances where is the number of training instances. Even for small data sets, it is computationally infeasible to induce models to determine which instances are detrimental. There is no known way to determine how a single instance will affect the induced classification function without inducing a classification function with the investigated instance removed from the training set.
As previously discussed, most learning algorithms induce the most probable classification function given the training data. For classification problems, we are interested in maximizing the probability of a class label given a set of input features:
. Using the chain rule,
from Equation 1 can be substituted with :Each instance contributes to as the quantity . Previous work in noise handling has shown that class noise is more detrimental than attribute noise (Zhu & Wu, 2004; Nettleton et al., 2010). Thus, searching for detrimental instances that have a low is a natural place to start.
Recently, Smith et al. (2013) investigated the presence of instances that are likely to be misclassified in commonly used data sets as well as their characteristics. We follow their procedure of using a set of diverse learning algorithms to estimate . The dependence of on can be lessened by summing over the space of all possible hypotheses:
(3) 
Practically, to sum over , one would have to sum over the complete set of hypotheses, or, since , over the complete set of learning algorithms and hyperparameters associated with each algorithm. This, of course, is not feasible. In practice, can be estimated by restricting attention to a diverse set of representative algorithms (and hyperparameters). Also, it is important to estimate because if all hypotheses were equally likely, then all instances would have the same under the no free lunch theorem (Wolpert, 1996). A natural way to approximate the unknown distribution is to weight a set of representative learning algorithms, and their associated hyperparameters, , a priori with an equal, nonzero probability while treating all other learning algorithms as having zero probability. Given such a set of learning algorithms, we can then approximate Equation 3 to the following:
(4) 
where is approximated as and is the j learning algorithm from . The distribution is estimated using the indicator function with 5 by 10fold crossvalidation (running 10fold cross validation 5 times, each time with a different seed to partition the data) since not all learning algorithms produce probabilistic outputs.
To get a good representation of , and hence a reasonable estimate of , we select a diverse set of learning algorithms using unsupervised metalearning (UML) (Lee & GiraudCarrier, 2011). UML uses Classifier Output Difference (COD) (Peterson & Martinez, 2005) to measure the diversity between learning algorithms. COD measures the distance between two learning algorithms as the probability that the learning algorithms make different predictions. UML then clusters the learning algorithms based on their COD scores with hierarchical agglomerative clustering. Here, we considered 20 commonly used learning algorithms with their default hyperparameters as set in Weka (Hall et al., 2009). The resulting dendrogram is shown in Figure 3, where the height of the line connecting two clusters corresponds to the distance (COD value) between them. A cutpoint of 0.18 was chosen and a representative algorithm from each cluster was used to create as shown in Table 1.
Learning Algorithms  

*  Multilayer Perceptron trained with Back 
Propagation (MLP)  
*  Decision Tree (C4.5) 
*  Locally Weighted Learning (LWL) 
*  5Nearest Neighbors (5NN) 
*  Nearest Neighbor with generalization (NNge) 
*  Naïve Bayes (NB) 
*  RIpple DOwn Rule learner (RIDOR) 
*  Random Forest (RandForest) 
*  Repeated Incremental Pruning to Produce Error 
Reduction (RIPPER) 
In addition to using the entire set of learning algorithms in the ensemble filter (the filter), we also dynamically create the ensemble filter from using a greedy algorithm. This allows us to find a specific set of learning algorithms that are best for filtering a given data set and learning algorithm combination. Algorithm 1 outlines our approach. The adaptive ensemble filter is constructed by iteratively adding the learning algorithm from that produces the highest crossvalidation classification accuracy when is added to the ensemble filter. Because we are using the probability that an instance will be misclassified rather than a binary yes/no, we also use a threshold to determine which instances to examine as being detrimental. Instances with a less than are discarded from the training set. A constant threshold value for is set to filter the instances in for all iterations. The baseline accuracy for the adaptive approach is the accuracy of the learning algorithm without filtering (line 3). The search stops once adding one of the remaining learning algorithms to the ensemble filter does not increase accuracy, or all of the learning algorithms in have been added to the ensemble filter.
The adaptive filtering approach overfits the data since the cross validation accuracy is maximized (all detrimental instances are included for evaluation). This allows us to find those instances that are actually detrimental in order to examine the effects that they can have on an induced model. Of course, this is not feasible in practical settings, but provides insight into the potential improvement gained from improving the quality of the training data.
4 Filtering Versus HyperParameter Optimization
In this section, we empirically compare the potential effects of filtering detrimental instances with those of hyperparameter optimization. When comparing two learning algorithms, statistical significance is determined using the Wilcoxon signedranks test (Demšar, 2006) with an alpha value of 0.05. In addition to reporting the average classification accuracy, we also report two relative measures–the average percentage of reduction in error and the average percentage of reduction in accuracy, calculated as:
where returns the baseline accuracy on data set and returns the accuracy on data set of the algorithm that we are interested in (i.e., filtering or hyperparameter optimization). The percent reduction in error captures the fact that increasing the accuracy from 98% to 99% is more difficult than increasing the accuracy from 50% to 51%, which is not expressed in the average accuracy. Likewise, the percent reduction in accuracy captures the relative difference in loss of accuracy for the investigated method. We also show the number of times that is greater than, equal to, or less than (count
). Together, the count, the percent reduction in error, and the percent reduction in accuracy show the potential benefit of the using an algorithm and also the variance of an algorithm. For example, an algorithm has more variance than another if it increases the percent reduction in error but also increases the percent reduction in accuracy compared to an algorithm that may not increase the percent reduction in error as much but also does not increase the percent reduction in accuracy. The average accuracy for each algorithm on a data set is determined using 5 by 10fold crossvalidation.
Hyperparameter optimization uses 10 random searches of the hyperparameter space. Random hyperparameter selection was chosen based on the work by (Bergstra & Bengio, 2012). The premise of random hyperparameter optimization is that most machine learning algorithms have very few hyperparameters that considerably affect the final model while most of the other hyperparameters have little to no effect on the final model. Random search provides a greater variety of the hyperparameters that considerably affect the model, thus allowing for better selection of these hyperparameters. Given the same amount of time constraints, random search has been shown to out perform a grid search. For reproducibility, the exact process of hyperparameter optimization for the learning algorithms is provided in the supplementary material. The accuracy from the hyperparameters that resulted in the highest crossvalidation accuracy for each data set is reported.
For filtering using the ensemble (filter) and adaptive filtering, we use thresholds of 0.5, 0.3, and 0.1 to identify detrimental instances. The filter estimates using all of the learning algorithms in the set (Table 1). The adaptive filter greedily constructs an ensemble to estimate for a specific data set/learning algorithm combination. The accuracy from the that produced the highest accuracy for each data set is reported.
To show the effect of filtering detrimental instances and hyperparameter optimization on an induced model, we examine filtering and hyperparameter optimization in six commonly used learning algorithms: MLP, C4.5, IBk, NB, Random Forest, and RIPPER. The LWL, NNge, and Ridor learning algorithms are not used for analysis because they do not scale well with the larger data sets–not finishing due to memory overflow or large amounts of running time with many hyperparameter settings.^{1}^{1}1For the data sets on which the learning algorithms did finish, the effects of hyperparameter optimization and filtering on LWL, NNge, and Ridor are consistent with the other learning algorithms.
MLP  C4.5  IBk  NB  RF  RIP  

Orig  80.74  80.11  79.03  75.68  81.59  77.83 
filter  83.53  82.57  82.90  78.51  83.79  81.37 
%red  16.01  12.43  18.28  14.92  14.65  14.96 
%red  0.64  1.23  1.08  0.47  0.70  0.64 
count  44,1,7  45,1,6  44,2,6  42,0,10  38,3,11  50,0,2 
HPO  83.14  81.93  82.15  79.63  82.75  80.04 
%red  20.24  12.73  20.02  22.77  19.81  12.59 
%red  2.83  1.11  0.32  0.67  2.13  0.48 
count  47,0,5  39,0,13  41,2,9  42,1,9  37,2,13  47,1,4 
The results comparing the accuracy of the default hyperparameters set in weka (Hall et al., 2009) with the filter and hyperparameter optimization (HPO) are shown in Table 2. The increases in accuracy that are statistically significant are in bold. Not surprisingly, both the filter and hyperparameter optimization significantly increase the classification accuracy for all of the investigated learning algorithms. The magnitude of the increase in average accuracy is similar for both approaches. However, hyperparameter optimization shows more variance than filtering as demonstrated by larger percent reduction in error and in a larger percent reduction in accuracy. The filter also generally increases the accuracy on more data sets than hyperparameter optimization.
MLP  C4.5  IBk  NB  RF  RIP  

HPO  83.14  81.93  82.15  79.63  82.75  80.04 
F  83.53  82.57  82.90  78.51  83.79  81.37 
%red  10.31  9.43  8.26  13.14  12.33  11.98 
%red  2.11  2.45  1.94  5.65  1.64  2.57 
count  27,3,22  33,4,15  30,2,20  22,2,28  27,1,24  34,1,17 
F  84.50  83.08  83.83  81.05  85.22  82.85 
%red  11.26  9.24  9.39  8.33  12.34  12.60 
%red  0.39  0.13  0.89  0.39  0.06  0.20 
count  35,4,13  39,5,8  43,2,7  43,2,7  45,1,6  45,5,2 
A  88.24  87.39  90.08  81.91  90.17  87.34 
%red  37.87  35.22  48.41  22.93  47.02  37.98 
%red  1.41  0.47  0.41  4.69  1.47  1.42 
count  45,0,7  48,2,2  51,0,1  34,0,18  46,0,6  48,0,4 
A  85.78  84.45  84.87  82.25  86.57  84.24 
%red  19.23  17.67  16.26  14.66  22.67  21.43 
%red  0.00  0.00  0.90  0.23  0.00  NA 
count  50,1,1  47,3,2  49,1,2  49,0,3  50,1,1  50,1,0 
Table 3 compares the hyperparameter optimized learning algorithms with filtering. F and F represent using the filter where the learning algorithms in have default hyperparameter settings and where the hyperparameters of the learning algorithms in have been optimized. Likewise, A and A represent the results from the adaptive filtering algorithm when the ensemble filter is built from with and without hyperparameter optimization. Compared with hyperparameter optimization, the filter without hyperparameter optimization significantly increases the classification accuracy for the C4.5, random forest and RIPPER learning algorithms. However, if the filter is composed of hyperparameter optimized learning algorithms, then the increase in accuracy is significant for all of the considered learning algorithms. The accuracy increases on more data sets for all of the examined learning algorithms when is composed of hyperparameter optimized learning algorithms. Thus, in combination, hyperparameter optimization can result in more significant gains in classification accuracy and exhibits less variance.
The adaptive filter also significantly improves classification accuracy over hyperparameter optimization and provides a greater increase in accuracy than the filter with hyperparameter optimization. The results from the adaptive filters show the potential gain in filtering if a filtering algorithm could more accurately identify detrimental instances for each data set and learning algorithm combination. With and without hyperparameter optimization, the adaptive filter results in significant gains in generalization accuracy for each learning algorithm. These results show the impact of choosing an appropriate subset of the training data and provide motivation for improving filtering algorithms. Building an adaptive filter from the set of learning algorithms without hyperparameter optimization provides higher average classification accuracy but also exhibits more variance. However, the adaptive filter composed of hyperparameter optimized learning algorithms increases the accuracy on more data sets than the adaptive filter composed of learning algorithms with their default hyperparameters for the MLP, naïve Bayes, random forest, and RIPPER learning algorithms.
ALL  MLP  C4.5  IB5  NB  RF  RIP  

None  5.36  2.69  2.95  3.08  5.64  5.77  1.60 
MLP  18.33  16.67  15.77  20.00  25.26  23.72  16.36 
C4.5  17.17  17.82  15.26  22.82  14.49  13.33  20.74 
IB5  12.59  11.92  14.23  1.28  10.00  17.18  16.89 
LWL  6.12  3.59  3.85  4.36  23.72  3.33  3.59 
NB  7.84  5.77  6.54  8.08  5.13  10.26  4.92 
NNge  19.49  26.67  21.15  21.03  11.15  24.74  23.40 
RF  21.14  22.95  26.54  23.33  15.77  15.13  24.20 
Rid  14.69  14.87  16.79  18.33  11.92  16.54  12.77 
RIP  8.89  7.82  7.69  8.85  13.08  7.44  4.39 
There is no one learning algorithm that is the optimal filter for all learning algorithms and/or data sets. Table 4 shows the frequency for which a learning algorithm with default hyperparameters was selected for filtering by the adaptive filter. The greatest percentage of cases a learning algorithm is selected for filtering for each learning algorithm is in bold. The column “ALL” refers to the average from all of the learning algorithms as the base learner. No instances are filtered in 5.36% of the cases. Thus, filtering to some extent increases the classification accuracy in about 95% of the cases. Furthermore, the random forest, NNge, MLP, and C4.5 learning algorithms are the most commonly chosen algorithms for filtering. However, no one learning algorithm is selected in more than 27% of the cases. The filtering algorithm that is most appropriate is dependent on the data set and the learning algorithm. Previously, Sáez et al. (2013) examined when filtering is most beneficial for the nearest neighbor classifier. They found that the efficacy of noise filtering is dependent on the characteristics of the data set and provided a rule set to determine when filtering will be most beneficial. Future work includes better understanding the efficacy of noise filtering for each learning algorithm and determining which filtering approach to use for a given data set.
5 Related Work
To our knowledge, this is the first work that examines both filtering and hyperparameter optimization in handling detrimental instances and compares their effects. Our work was motivated in part by Smith et al. (2013) who examined the existence of and characterization of instances that are hard to classify correctly. They found that a significant number of instances are hard to classify correctly and that the hardness of each instance is dependent on its relationship with the other instances in the training set as well as the learning algorithm used to induce a model of the data. Thus, there is a need for improving the way detrimental instances are handled during training as they affect the classification of the other instances in the data set.
Other previous work has examined improving the quality of the training data and hyperparameter optimization in isolation. Frénay & Verleysen (2014) provide a survey of the current approaches for dealing with label noise in classification problems which often take the approach of improving the quality of the data. Improving the quality of the training data has typically fallen into three approaches: filtering, cleaning, and instance weighting. Each technique within an approach differs in how detrimental instances are identified. A common technique for filtering removes instances from a data set that are misclassified by a learning algorithm (Tomek, 1976) or an ensemble of learning algorithms (Brodley & Friedl, 1999). Removing the training instances that are suspected to be noise and/or outliers prior to training has the advantage that they do not influence the induced model and generally increase classification accuracy (Smith & Martinez, 2011)
. Training on a subset of the training data has also been observed to increase accuracy in active learning
(Schohn & Cohn, 2000).A negative sideeffect of filtering is that beneficial instances can also be discarded and produce a worse model than if all of the training data had been used (Smith & Martinez, 2014a). Rather than discarding the instances from a training set, one approach seeks to clean or correct instances that are noisy or possibly corrupted (Kubica & Moore, 2003). However, this could artificially corrupt valid instances. Weighting, on the other hand, weights suspected detrimental instances rather than discarding them (Rebbapragada & Brodley, 2007; Smith & Martinez, 2014b). Weighting the instances allows for an instance to be considered on a continuum of detrimentality rather than making a binary decision of an instance’s detrimentality. These approaches have the advantage that data is not discarded which is especially beneficial when data is scarce.
Much of the previous work has artificially inserted noise and/or corrupted instances into the initial data set to determine how well an approach would work in the presence of noisy or mislabeled instances. In some cases, a given approach only has a significant impact when there are large degrees of artificial noise. We show that correctly labeled, nonnoisy instances can also be detrimental for inducing a model of the data and that properly handling them can result in significant gains in classification accuracy. Thus, in contrast to much previous work, we did not artificially corrupt a data set to create detrimental instances. Rather, we sought to identify the detrimental instances that are already contained in a data set.
The grid search and manual search are the most common types of hyperparameter optimization techniques in machine learning. A combination of the two approaches are commonly used (Larochelle et al., 2007). Bayesian optimization has also been used to search the hyperparameter space (Snoek et al., 2012; Hutter et al., 2011) as an alternative approach. Further, Bergstra & Bengio (2012) proposed to use a random search of the hyperparameter space as discussed previously.
6 Conclusions
In this paper we compared the potential benefits of filtering with hyperparameter optimization both mathematically and empirically. Mathematically, hyperparameter optimization may reduce the effects of detrimental instances on an induced model but the detrimental instances are still considered in the learning process. Filtering, on the other hand, fully removes the detrimental instances–completely eliminating the effects of the detrimental instances on the induced model.
Empirically, we estimated the potential benefits of each method by maximizing the 10fold crossvalidation accuracy of each data set. For the adaptive filter, we also chose a set of learning algorithms for an ensemble filter by maximizing the 10fold crossvalidation accuracy. Both filtering and hyperparameter optimization significantly increase the classification accuracy for all of the considered learning algorithms. On average, a learning algorithm increases in classification accuracy from 79.15% to 87.52% by removing the detrimental instances on the observed data sets using the adaptive filter. On the other hand, hyperparameter optimization only increases the average classification accuracy to 81.61%. Filtering with hyperparameter optimized learning algorithms produces more stable results (i.e. consistent increases in classification accuracy and low decreases in classification accuracy) than filtering or hyperparameter optimization in isolation.
One of the reasons that instance subset selection is overlooked is due to the fact that there is a large computational requirement to observe the relationship of an instance on the other instances in a data set. As such, determining which instances to filter is nontrivial. We hope that the results presented in this paper will provide motivation for furthering the work in improving the quality of the training data.
References
 Bergstra & Bengio (2012) Bergstra, James and Bengio, Yoshua. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.

Brodley & Friedl (1999)
Brodley, Carla E. and Friedl, Mark A.
Identifying mislabeled training data.
Journal of Artificial Intelligence Research
, 11:131–167, 1999.  Collobert et al. (2006) Collobert, Ronan, Sinz, Fabian, Weston, Jason, and Bottou, Léon. Trading convexity for scalability. In Proceedings of the 23rd International Conference on Machine learning, pp. 201–208, 2006.
 Demšar (2006) Demšar, Janez. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.

Elman (1993)
Elman, Jeffery L.
Learning and development in neural networks: The importance of starting small.
Cognition, 48:71–99, 1993.  Frénay & Verleysen (2014) Frénay, Benoit and Verleysen, Michel. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, pp. in press, 25 pages, 2014.
 Hall et al. (2009) Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer, Bernhard, Reutemann, Peter, and Witten, Ian H. The weka data mining software: an update. SIGKDD Explorations Newsletter, 11(1):10–18, 2009.
 Hutter et al. (2011) Hutter, Frank, Hoos, Holger H., and LeytonBrown, Kevin. Sequential modelbased optimization for general algorithm configuration. In Proceedings of the International Learning and Intelligent Optimization Conference, pp. 507–523, 2011.
 Kubica & Moore (2003) Kubica, Jeremy and Moore, Andrew. Probabilistic noise identification and data cleaning. In Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 131–138, 2003.
 Larochelle et al. (2007) Larochelle, Hugo, Erhan, Dumitru, Courville, Aaron, Bergstra, James, and Bengio, Yoshua. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th International Conference on Machine Learning, pp. 473–480, 2007.
 Lee & GiraudCarrier (2011) Lee, Jun and GiraudCarrier, Christophe. A metric for unsupervised metalearning. Intelligent Data Analysis, 15(6):827–841, 2011.

Nettleton et al. (2010)
Nettleton, David F., OrriolsPuig, Albert, and Fornells, Albert.
A study of the effect of different types of noise on the precision of supervised learning techniques.
Artificial Intelligence Review, 33(4):275–306, 2010.  Peterson & Martinez (2005) Peterson, Adam H. and Martinez, Tony R. Estimating the potential for combining learning models. In Proceedings of the ICML Workshop on MetaLearning, pp. 68–75, 2005.
 Rebbapragada & Brodley (2007) Rebbapragada, Umaa and Brodley, Carla E. Class noise mitigation through instance weighting. In Proceedings of the 18th European Conference on Machine Learning, pp. 708–715, 2007.
 Sáez et al. (2013) Sáez, José A., Luengo, Julián, and Herrera, Francisco. Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition, 46(1):355–364, 2013.
 Schohn & Cohn (2000) Schohn, Greg and Cohn, David. Less is more: Active learning with support vector machines. In Proceedings of the 17th International Conference on Machine Learning, pp. 839–846, 2000.
 Smith & Martinez (2011) Smith, Michael R. and Martinez, Tony. Improving classification accuracy by identifying and removing instances that should be misclassified. In Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 2690–2697, 2011.
 Smith & Martinez (2014a) Smith, Michael R. and Martinez, Tony. An extensive evaluation of filtering misclassified instances in supervised classification tasks. In submission, 2014a. URL http://arxiv.org/abs/1312.3970.
 Smith & Martinez (2014b) Smith, Michael R. and Martinez, Tony. Becoming more robust to label noise with classifier diversity. In Submission, 2014b. URL http://arxiv.org/abs/1403.1893.
 Smith et al. (2013) Smith, Michael R., Martinez, Tony, and GiraudCarrier, Christophe. An instance level analysis of data complexity. Machine Learning, pp. in press, 32 pages, 2013. doi: 10.1007/s109940135422z.
 Snoek et al. (2012) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pp. 2960–2968. 2012.
 Tomek (1976) Tomek, Ivan. An experiment with the edited nearestneighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, 6:448–452, 1976.
 Wolpert (1996) Wolpert, David H. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996.
 Zhu & Wu (2004) Zhu, Xingquan and Wu, Xindong. Class noise vs. attribute noise: a quantitative study of their impacts. Artificial Intelligence Review, 22:177–210, November 2004.