Nested cross-validation when selecting classifiers is overzealous for most practical applications

09/25/2018 ∙ by Jacques Wainer, et al. ∙ University of Campinas University of East Anglia 0

When selecting a classification algorithm to be applied to a particular problem, one has to simultaneously select the best algorithm for that dataset and the best set of hyperparameters for the chosen model. The usual approach is to apply a nested cross-validation procedure; hyperparameter selection is performed in the inner cross-validation, while the outer cross-validation computes an unbiased estimate of the expected accuracy of the algorithm with cross-validation based hyperparameter tuning. The alternative approach, which we shall call `flat cross-validation', uses a single cross-validation step both to select the optimal hyperparameter values and to provide an estimate of the expected accuracy of the algorithm, that while biased may nevertheless still be used to select the best learning algorithm. We tested both procedures using 12 different algorithms on 115 real life binary datasets and conclude that using the less computationally expensive flat cross-validation procedure will generally result in the selection of an algorithm that is, for all practical purposes, of similar quality to that selected via nested cross-validation, provided the learning algorithms have relatively few hyperparameters to be optimised.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A practitioner who builds a classification model has to select the best algorithm for that particular problem. There are hundreds of classification algorithms described in the literature, such as k-nearest neighbour [1], SVM [2]

, neural networks

[3], naïve Bayes [4]

, gradient boosting machines


, and so on. Although there are sometimes theoretical and/or empirical reasons to prefer a particular algorithm over another when tackling a particular problem, our current understanding of machine learning does not allow us to predict

a-priori whether one algorithm will perform better than another. Furthermore, the so called “no-free-lunch” theorems even suggest that no algorithm can outperform all others for all problems [6]. Therefore, for most difficult tasks, one should benefit from trying many competing algorithms to discover which gives the best performance. However, most algorithms have one or more hyperparameters that must be set externally, for example, the k-nearest neighbour method has (usually) one hyperparameter,

, whereas random forest has at least two, the number of trees to be constructed and the number of features considered at each split. Unfortunately selecting an algorithm and tuning its hyperparameters are dependent steps: an algorithm may perform very well for a problem when using a particular set of hyperparameters, but may perform worse than other algorithms with a different, sub-optimal, set of hyperparameters. One will therefore want to chose both the algorithm and its hyperparameters in such a way as to maximize its expected performance on future data.

Choosing an appropriate classifier (model) and optimising the hyperparameters are most often performed by minimising a cross-validation [7] estimate of generalisation performance. The most basic form of cross-validation, known as k-fold cross-validation partitions the available data into disjoint chunks of approximately equal size. In each iteration a training set is formed from a different combination of chunks, with the remaining chunk used as the test set; a model is then fit to the training set and its performance evaluated using the test set. The average of the performance metric on the test set in each iteration is then used as an estimate of the generalisation performance of a model fit to all of the available data. There are two common procedures for selecting the best algorithm and tuning the hyperparameters via cross-validation, the first is called nested cross-validation, also known as double cross-validation [7], the second appears to have no standard name, so we will call it flat cross-validation:

The hyperparameters of each model are tuned to minimise a cross-validation based estimate of generalisation performance. The cross-validation performance estimate, evaluated for those optimal hyperparameter values, is then used to select the best model to use in operation. This approach is computationally inexpensive, however an optimistic bias is introduced into the performance estimate as it has been directly optimised in tuning the hyperparameters [8]. Unless this bias is commensurate for all of the candidate models, the re-use of the hyperparameter optimisation criterion as a model selection criterion may result in a sub-optimal choice of model, potentially selecting a model that is particularly susceptible to this bias, rather than a model with genuinely higher performance.

An outer cross-validation procedure is performed to provide a performance estimate used to select the optimal model. In each fold of the outer cross-validation, the hyperparameters of the model are tuned independently to minimise an inner cross-validation estimate of generalisation performance. The outer cross-validation is then essentially estimating the performance of a method for fitting a model, cross-validation based hyperparameter tuning. This eliminates the bias introduced by the flat cross-validation procedure as the test data in each iteration of the outer cross-validation has not been used to optimise the performance of the model in any way, and may therefore provide a more reliable criterion for choosing the best model. The computational expense of nested cross-validation, however, is substantially higher.

The aim of this study is to perform an empirical evaluation to determine whether the additional computational expense of the nested cross-validation procedure is generally justified by providing a more reliable means of choosing the best model and statistically superior performance. While no empirical study can possibly cover every learning task or apply to all learning systems, a thorough study using a large suite of benchmark datasets and a set of learning systems representative of those used in applied work remains the best available guidance for practitioners.

1.1 Estimating the Generalisation Performance of a Model

Let us denote by , the accuracy of algorithm when trained on data with hyperparameters and tested on data . Let us assume that a data set is an i.i.d. sample from some underlying distribution . The best algorithm for the data set is the algorithm that when trained on the whole data set , with the optimal values for the hyperparameters, will have the highest expected accuracy for future data.

The expected accuracy for future data (for algorithm trained on with hyperparameters is:

Given a set of candidate classification algorithms, , the best algorithm is then:


where denote the best set of hyperparameters for algorithm , that is:


Both nested and flat cross-validation procedures estimate the expected performance of the classifier, with optimal hyperparameter settings,


and then select the algorithm, , having the highest estimate. Let us denote as the flat cross-validation estimate of the term in Equation 3, and , the nested cross-validation estimate. Both estimates, and , will likely result in different numeric values (Section 1.2).

The nested cross-validation (CV) procedure is considered the more appropriate because is an unbiased estimate of the expectation in Equation 3, whereas has a positive bias [8], that is, on average, will have higher, overly optimistic values than . This arises as the data used in performance evaluation are also used indirectly in tuning the hyperparameters. Thus the algorithm selected by the nested procedure is considered the more “correct” because the estimates of that procedure are unbiased relative to the true expected accuracy. However, the nested procedure has a much higher computational cost than the flat procedure. Therefore one may want to use the flat procedure, even at some risk of not selecting the best algorithm, where the computational expense is prohibitive. Notice that for the purpose of algorithm selection the positive bias of the flat procedure is not itself a problem provided the algorithm ranked highest by the nested procedure is the same as that highest ranked by the flat procedure, which implies the degree of bias is approximately the same for all classifiers.

Let us assume that is the algorithm selected by the nested CV procedure, and the algorithm selected using flat CV for data set . Clearly if in a very high percentage of cases, then one may choose the less expensive procedure, at some slight risk. In the cases where the two procedures do not agree on the best algorithm, we will compute the accuracy gain of the nested procedure selection relative to the flat selection procedure, or in other words, the difference in the expected accuracy on future data of the nested selection and the flat selection.

Let be the optimal hyperparameters for the algorithm, and those of the algorithm, as computed by the expression 2. Let us define the accuracy gain of using the nested CV procedure on dataset as:


where we left implicit the subscript for the and symbols, for the sake of clarity. Of course, one cannot determine the true value of the accuracy gain, but as we will discuss in Section 2.1, we will be able to estimate it.

1.2 Flat and nested CV estimates

Both nested and flat CV procedures rely on using cross-validation to estimate an expectation of the form when is not known. Given a set of data , cross-validation defines a set of pairs of sets and where is called a training set, and is called test set or sometimes validation set, and where:

Common cross-validation procedures include: k-fold, bootstrap, leave-one-out, hold-out, and so on.

Given a particular cross-validation procedure (which given defines the sets and and the number of such pairs), the cross-validation estimate for the expected accuracy of the classifier (for a particular algorithm and hyperparameters ) is calculated as:

The flat CV estimate of will select as the value of that maximizes :


and then use to estimate , that is:


In nested CV, each training set (of the outer cross-validation) is further subdivided into pairs of sets of data and where again:

The nested cross-validation procedure will select the best hyperparameter for each training set as:

The nested CV estimate of the expected accuracy for future data is:

Figure 1 gives an implementation of the flat CV as a Python program and Figure 2 provides the corresponding implementation for the nested CV procedure where the following functions are assumed:

  • createCV(data,...) creates a list of pairs (train,test) from the data. Other parameters may include, for example, if a k-fold CV procedure is used, or the proportion of cases in the training set, if a hold-out procedure is used.

  • createGrid() creates the list of hyperparameter tuples to be tested, forming a regular grid.

  • classtrain(train,theta) returns the classifier trained on data train with hyperparameters set to theta.

  • accuracy(model,test) returns the accuracy (or any other quality measure) for the classifier model when run on data test.

def flat(data,…):
  cv.lst = createCV(data,…)
  for theta in createGrid(…):
     for train,test in cv.lst:
     if acc > accmax:
  return accmax/len(cv.lst)
Fig. 1: Implementation of the flat cross-validation procedure as a Python program.
def nested(data,…):
  cv.lst = createCV(data,…)
  for tr.o,te.o in cv.lst:
     for theta in createGrid(…):
       cv.inner.lst = createCV(tr.o,…)
       for tr.i,te.i in cv.inner.lst:
       if acc > accmax:
          accmax = acc
          thetamax = theta
     model2 = classtrain(tr.o,thetamax)
     accfinal = accfinal+accuracy(model2,te.o)
  return accfinal/len(cv.list)
Fig. 2: Implementation of the nested cross-validation procedure as a Python program.

Note that the nested CV procedure does not calculate a single best set of hyperparameter values; each training set of the outer cross-validation () may select a different set of “optimal” hyperparameters. In practice, the hyper-parameter values of the classifier used in operation are found by cross-validation using all available data. These are, of course, the hyperparameter values determined by the flat cross-validation procedure. Essentially nested cross-validation estimates the performance of the full procedure used to generate the final model, including hyperparameter tuning. This provides an unbiased estimator for choosing the best classifier system, but does not affect the operational hyperparameter values.

The research presented here evaluated the mean accuracy gain of the nested CV procedure over flat-CV, by estimating its value over 115 real-life datasets, for 12 different classification algorithms. We show that the expected accuracy gain is very small, and we argue that the gain is of negligible practical consequence for most applications. That is, in the majority of cases, either the selection of the flat and nested procedures coincide, or where they differ the algorithms selected by each approach are so close in terms of expected accuracy that this difference can be considered irrelevant, provided the algorithms have relatively few tunable hyperparameters (as this strongly influences the bias of the flat-CV procedure).

2 Data and Methods

2.1 Experimental procedure

In this section, we set out in general terms the experimental procedure followed by this research. We performed 6 repetitions of a 50% split of each data set into train and test subsets, each with the same proportion of patterns belonging to each class. For each dataset , is the training subset for repetition and is the corresponding test subset. For each train set, , we computed the expected accuracy using a 5-fold-within-5-fold nested-CV procedure () and using a 5-fold flat-CV procedure () for 12 different classification algorithms (the classification algorithms are discussed in section 2.3). The flat-CV procedure also determines the best selection of hyperparameters () for each algorithm , for each . 111Following the nested cross-validation procedure, the selected model is re-trained on all of the available data, with 5-fold cross-validation based tuning of the hyperparameter values, which will of course give the same hyperparameter values as those already determined from the flat cross-validation trials.

Let us define as the algorithm selected by the flat procedure on , and as the algorithm selected by the nested procedure. The future accuracy of an algorithm on repetition for data set is the accuracy of the algorithm when trained on with the best hyperparameters selected by the flat procedure and tested on . In particular we are interested in the future accuracy of the algorithms selected by the nested procedure and by the flat procedure , and will define as the future accuracy of the nested selection (for data set and round ) – similarly is the future accuracy of the flat selection. Formally:


The accuracy gain of using the nested procedure instead of the flat procedure is the difference between the future accuracy of the nested selection and the future accuracy of the flat selection,


Finally, the accuracy gain of a data set is the average of the accuracy gains for the six rounds for that data set:


Since the nested procedure is considered the “more correct” one, it should select the “more correct” algorithm, and thus it is more likely that the future accuracy of the nested selection would be higher than that of the flat selection. Thus, in general one would expect a positive accuracy gain.

To show that the least costly flat procedure achieves similar results (in future accuracy) as the nested procedure, we must show that the accuracy gains over all data sets is small. Unfortunately there is no standard way of showing that an “aggregated” accuracy gain is small. A null hypothesis test will only determine if the aggregated accuracy gain is significantly different than 0, but a) even if it is significantly different than 0 that difference may not be sufficiently large to be of

practical significance, and b) if the accuracy gain is not significantly different than 0 that does not establish that it actually is small, unless the statistical power of the test is high.

Thus, to provide evidence that the accuracy gains are small, we define a threshold of irrelevance for each data set , which is the change in accuracy one should consider as irrelevant or of no “practical significance”. Below we discuss our proposal for this threshold. Given we want to show that:


We use the Wilcoxon signed rank test (a paired non-parametric test) to show that the median of the set for all data sets is smaller and significantly different than the median of the set

We also report the mean and the 95% confidence interval of

so the reader may gain a sense of the magnitude of the differences. The confidence interval was calculated using bootstrap with 5000 rounds.

The idea for a threshold of irrelevance is based on unavoidable errors in the accuracy estimate; unavoidable because they depend on random factors, such as the sampling of the data to form training and test sets. The threshold depends both on the data set and the algorithm. If the data set is small one expects larger changes in accuracy when different splits of train and test or when comparing estimated accuracy with the real accuracy on future, unseen data. If the algorithms overfits the data, or if the algorithm underfits the data, one would also expect larger differences in the accuracy in those different conditions.

Our proposal for the irrelevance threshold is based on the idea that the nested procedure estimate of the future accuracy is only an estimate of the actual generalisation performance. Differences between the estimate and the measured accuracy for some unseen data may indicate how sensitive is the combination of data set and algorithm to these unavoidable variations. We define as the difference between the nested estimate of future accuracy and the measured future accuracy for a particular algorithm , data set an repetition . Formally:


The threshold of irrelevance for a data set and round , , is the minimum between and


The idea is that the threshold of irrelevance for a data set and a round is the smallest of the errors between estimated and measured future accuracy for the two “important/best” algoritms for that data set and for that round, and . The reason to take the minimum is to achieve a more restrictive definition of irrelevance.

The final threshold for the data set is the average of for all repetitions:


Finally, it is interesting to understand the role of the repetition in this experimental procedure. Repetitions are seen as different experiments to compute the accuracy gain of the nested procedure versus the flat procedure. Each repetition may select different algorithms in the nested and in the flat procedures. The goal of the experiment/repetition is to compute the accuracy gain (Equation 8) and the irrelevance threshold (Equation 12). Only then are the accuracy gain and irrelevance thresholds aggregated across repetitions on the same data set (Equations 9 and  13).

This form of analysis is inspired by the nested cross-validation procedure, which only aggregates the data on the different folds/hold-out subsets to compute the final measure of interest, the expected accuracy. The two measures of interest in this research are the accuracy gain and the threshold of irrelevance, and only at that level, the results are averaged across repetitions. A discusses different ways of using the repetitions and presents the corresponding results. A also presents a different definition of the threshold of irrelevance for a data set, and the corresponding results.

2.2 Scenarios

In this paper we are interested in answering two questions regarding nested and flat procedure. The first question is whether one need use a nested procedure to select the best among three very good algorithms for classification: random forest (rf), SVM with RBF kernel (svmRadial), and gradient boosting machine (gbm). There is some independent evidence to suggest that these three algorithms are likely the best classification algorithms in general. Fernández-Delgado et al. [9] do not test gradient booting machines, and find that random forest and SVM with RBF kernel are the two best families of algorithms. Wainer [10] does test gradient boosting machines, and finds that those three form the best three families of classification algorithms. As we will discuss in section 3, this research finds that random forest is the algorithm with lowest mean rank, followed by the SVM with an RBF kernel, followed by gradient boosting machines. Thus, practitioners that have a restriction on the time needed to select the best classification algorithm should restrict themselves to these three algorithms. The first question we will address is whether, when selecting among rf, svmRadial, and gbm one can avoid the nested procedure and use a flat procedure instead. In this scenario, called top3, we restrict the analysis to only those three algorithms.

The second question is whether the nested procedure is necessary when a wider range of classifiers are being compared. In this case, we tested 12 different families of classifiers (the algorithms are discussed in Section 2.3). We call this the full scenario.

2.3 Datasets and classification algorithms

We used the suite of data sets collected from the UCI public repository and processed by Fernández-Delgado et al. [9] and further processed by Wainer [10], such that all data sets are binary classification tasks. For the 9 datasets with more than 10000 data points, we applied the procedures (nested and flat CV) on only a random subset of 5000 data points (from each subset). For each subset, we applied 12 different classification algorithms. The algorithms and their abbreviations are as follows:

A boosting of linear classifiers.

Gradient boosting machines - a boosting of short decision trees


The k-nearest neighbours classifier [1].

Learning vector quantization


A 1-hidden layer neural network with sigmoid transfer function [3].

Random forest - a bagging of decision trees [12].

A bagging of knn classifiers on a random subset of the original features


A L1 regularized linear discriminant classifier [13].

A SVM with linear kernel [2].

A SVM with polynomial kernel [2].

A SVM with RBF kernel [2]. Details of the particular implementations of these algorithms and hyperparameter search grid are described in [10].

2.4 Reproducibility

The data sets used in the paper are available at as described in [10]. The program to run the different procedures and the different classifiers, the results of the multiple runs, and the R program to perform the statistical analysis described in this paper are available at .

3 Results

Table I lists the mean rankings of the algorithms, according to the nested CV estimate of their accuracies, over all repetitions and over all data sets.

algorithm mean rank
rf 3.4
svmRadial 3.6
gbm 4.0
nnet 4.8
rknn 5.3
svmPoly 5.3
knn 5.4
svmLinear 6.1
sda 6.6
lvq 6.7
nb 7.9
bst 8.7
TABLE I: Ranking of the algorithms based on the mean rank for each repetition.

The results of the top 3 agree with the order in [10].

3.1 Results for the top 3 and full scenarios

Figure 3 displays the accuracy gain and the thresholds of irrelevance for the top3 scenario (random forest, SVM with RBF kernel, and gradient boosting machines). The first figure is the distributions of the absolute value of the accuracy gain and the data set thresholds of irrelevance. The second figure relates each measure of accuracy gain (in the vertical) with the corresponding threshold of irrelevance (horizontal). Notice that most points are in the lower part of the line, which show that in most cases, the threshold of irrelevance is higher than the corresponding accuracy gain. Figure 4 displays the corresponding distributions and comparisons of the accuracy gain and threshold of irrelevance for the full scenario.

Table II displays the results for statistical analysis for the top 3 and full scenarios. “Same choice” is the proportion of times the algorithm selected using flat CV agreed with that selected using nested CV. The column “p.value” is the p-value of the one-sided Wilcoxon signed rank test between the accuracy gain and the irrelevance threshold. The “mean” column is the mean of the difference of the accuracy gain and the irrelevance threshold, and it is negative as expected, the “low CI” and “high CI” columns are the lower and higher limits of the 95% confidence interval for the mean.

For the top 3 scenario the flat procedure selects the same algorithm that the nested procedure selects in 71% of the cases (a random choice would give a figure of 33%). The p-value is below 0.05, which shows that the accuracy gains for the nested procedure are significantly smaller than the corresponding thresholds of irrelevance A similar conclusion can be reached by inspecting the confidence interval of the mean, which does not include the 0. Therefore, either using the p-value or the confidence interval one can claim that the accuracy gain is statistically significantly less than the the corresponding irrelevance threshold (at the 95% level of significance). Thus our claim that there is no practical difference on average between using either the nested or the flat procedure to select among random forest, SVM with RBF kernel, and gradient boosting machines. For the full scenario, the agreement rate between flat and nested is 62% (against 8% if the decision was random), and again a p-value below 0.05 and the confidence interval does not include the 0. Therefore again one can be confident that on average the accuracy gain is below the corresponding threshold of irrelevance.

scenario same choice p.value mean low CI high CI
top 3 71% 0.001 -0.004 -0.007 -0.002
full 62% 3.6e-06 -0.004 -0.007 -0.003
TABLE II: The results for the selection of the top 3 and full scenarios. The column “Same choice” is the proportion of times the selection using flat CV agreed with the selection using nested CV. “p.value” is the p-value of the one sided Wilcoxon signed rank test of the accuracy gain and the corresponding threshold. “Mean” is the mean value of the difference , and “low CI” and “high CI” are the limits of the 95% confidence interval of that mean.
Fig. 3: The distribution and comparison of accuracy gain and irrelevance threshold - top 3 scenario
Fig. 4: The distribution and comparison of accuracy gain and irrelevance threshold - full scenario

4 Discussion

This paper makes two claims:

  • Nested CV procedures are probably not needed when selecting from among random forest, SVM with Gaussian kernel, and gradient boosting machines (which are on average the three best classification algorithms for the suite of data sets used in this research).

  • Nested CV procedures are probably not needed when selecting from among any set of classifier algorithms (provided they have only a limited number of hyper-parameters that must be tuned as we will discuss below).

The first claim was explicitly tested on 115 data sets and thus to generalize it, the reader must believe that the 115 datasets are a unbiased sample of future data sets a practitioner will face in the future. We discuss the limits of such generalization below. This second claim caries another risk for generalization, namely that the full set of 12 classifiers is a good sample of future sets of classifiers that will be selected in future applications.

Wainer [10] discusses some of the limits to the generalization of conclusions obtained from the set of 115 data sets to any future data set and we briefly summarise them here. The data sets tested in this research were only of medium size (up to 100,000 data points), only binary data sets were used, and none of them are derived from text classification problems (with high dimensionality and high sparsity). It is not immediately obvious how the number of dimensions, sparsity, or the fact that there are more than two classes could have a substantial impact on the claims made in this research. Data set size could be an issue, however, as the bias introduced by the flat cross-validation procedure generally decreases as the size of the dataset increases.

Table III reports the statistical tests for the two scenarios, only for the 32 data sets with 2000 data or more. For the larger data sets only, the strength of the evidence (the p-value or the range of the CI) in favour of the practical equivalence of the nested and flat procedure diminishes, as expected, given that there are fewer datasets/measures used in the significance test, but the effect size (the mean of the difference between the accuracy gain and the threshold of irrelevance) remains very similar to the ones in Table II, and the proportion of cases with the same choice for both procedures increases for the larger data sets. We believe that one can safely make the claim of the practical equivalence of the nested and flat procedures even for larger datasets than have been tested in this research.

scenario same choice p.value mean low CI high CI
top 3 80% 0.0016 -0.005 -0.014 -0.002
full 71% 0.0108 -0.004 -0.015 -0.001
TABLE III: The results for 32 data sets with at least 2000 data points.

The second generalization, that our analysis of all 12 classification algorithms is a sample of any future selection choice the practitioner will face in the future has at least one severe limit. All algorithms tested in this research had a small number of hyperparameters, from 1 to at most 3 (for the gbm). Cawley and Talbot [8] show an interesting example of a LS-SVM classifier with ARD kernel (automatic relevance detection) which means that each original data dimension has its own hyperparameter (of a RBF kernel). In principle, a LS-SVM with ARD subsumes the standard RBF LS-SVM, and thus it should not have an expected error higher than the classical RBF LS-SVM. But Cawley and Talbot [8] show that although the LS-SVM with ARD kernel achieves a lower expected error when using the flat CV estimate of the error, when using the nested procedure, the classical RBF LS-SVM has a statistically significant lower error in 7 out of 13 data sets tested, while the ARC LS-SVM is statistically better in only one of those 13 data sets [8, Table 2]. In this case, because the ARD LS-SVM has so many more hyperparameters than the RBF LS-SVM, the flat procedure will likely overfit the data. Thus, in the case of algorithms with very different number of hyperparameters (such as ARD based algorithms or deep networks) we feel less confidence in the practical equivalence results between the nested and flat procedures.

Appendix A shows that the conclusions reached by this paper do not strongly depend on the method of analysis - two other methods of analysis result in the same conclusions. Appendix A also shows that the results remain even when a different definition of the threshold of irrelevance is used. Appendix B shows that one should not go a step further and skip the selection of the algorithm altogether - in that case mean accuracy gain is significantly larger than the threshold of irrelevance.

The results in this paper are only applicable to practitioners, that is, for users that have the goal of selecting the likely best classification algorithm to solve a particular problem. Our results cannot be applied by a scientist whose goal is to provide evidence that one classification algorithm is better than another. Our claim of practical equivalence applies only to the best ranked algorithm for both procedures, and not that the two procedures have some significant agreement regarding the full ranking. For example, Table IV list the rank of the 12 algorithms when using the flat procedure estimate to order them. The table should be compared to Table I. The order of the algorithms is very different; in particular using the flat estimate, the gbm would be classified as the best algorithm while using the nested CV estimate, it is ranked third. In particular, given that the gbm has 1 hyperparameter more than svmRadial or rf, we believe that this improvement in the ranking could be due to the model overfitting described above [8].

algorithm mean flat-CV rank
gbm 3.0
svmRadial 3.2
rf 4.0
nnet 4.1
rknn 4.2
svmPoly 5.2
knn 5.3
lvq 6.4
svmLinear 6.4
sda 7.0
nb 8.4
bst 8.6
TABLE IV: Ranking of the algorithms based on the mean rank for each subset ordered by the flat CV estimate of the expected error.

5 Conclusion

There is very strong evidence that when selecting among a random forest, a SVM with Gaussian kernel, and a gradient boosting machine (the three best algorithms on average for the 115 real life datasets tested) one can generally use the flat cross-validation procedure to both search for the best hyperparameters and to select the best algorithm itself. Our analysis shows that the algorithm selected by the flat procedure will, on average, perform as well as the one that would be selected by the nested cross-validation procedure, for most practical purposes. Also there is some indication that the conclusions remain even for data sets larger than the ones tested.

There is also a strong evidence that in any selection process, regardless of the algorithms that are being selected, provided they all have a low number of hyperparameters, one can use the flat cross-validation procedure to simultaneously select the algorithm and the hyperparameters, and again for all practical purposes, that algorithm would perform as well as the algorithm selected using nested cross-validation.


  • [1] B. V. Dasarathy (Ed.), Nearest Neighbour (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, 1991.
  • [2] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297.
  • [3]

    C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.

  • [4] D. J. Hand, K. Yu, Idiot’s Bayes — not so stupid after all?, International Statistical Review 69 (3) (2001) 385–398.
  • [5] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of statistics (2001) 1189–1232.
  • [6] D. H. Wolpert, The lack of a priori distinctions between learning algorithms 8 (7) (1996) 1341–1390.
  • [7] M. Stone, Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society. Series B (Methodological) (1974) 111–147.
  • [8] G. Cawley, N. Talbot, On over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research 11 (2010) 2079–2107.
  • [9] M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems?, Journal of Machine Learning Research 15 (2014) 3133–3181.
  • [10] J. Wainer, Comparison of 14 different families of classification algorithms on 115 binary datasets, ArXiv e-prints 1606.00930.
  • [11] T. Kohonen, Learning vector quantization, Springer, 1995.
  • [12] T. K. Ho, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8) (1998) 832–844.
  • [13]

    M. Ahdesmäki, K. Strimmer, et al., Feature selection in omics prediction problems using CAT scores and false nondiscovery rate control, The Annals of Applied Statistics 4 (1) (2010) 503–519.

Appendix A Other analysis methods and irrelevance threshold

As discussed, the analysis method in this paper assumes that each repetition is an independent experiment, and the repetitions are only aggregated at the last step, to compute the accuracy gain and the threshold of irrelevance for a data set. But there are some alternatives to that analysis method. The first alternative is to consider the repetition as a way of obtaining multiple estimates for each of the accuracies measures. Thus, all measured accuracies are first averaged across the six repetition and only then used in the procedure, that is:

The flat and nested selections for each data set ( and ) would be selected using and (in contrast to the method used which selects and for each repetition). Then equations 7 and 8 would be


Similarly, the thresholds of irrelevance are not defined for each repetition but only for each data set:

The second alternative is to consider each repetition as an independent experiment at par with the data set themselves. The results for each data set it only aggregated at the last level, when considering the p-value of the Wilcoxon test that compares with 0. In this second alternative, we would perform the Wilcoxon test to compare to 0.

Finally, there is another measure that can play the role of irrelevance threshold. When we discussed the irrelevance threshold we mention unavoidable error or variance and we chose the mean difference between the nested estimate of accuracy and the true measure of accuracy on the test set. But standard deviation is a common way of measuring error we could use it instead of the mean difference of two accuracies as we did. There are three measures of accuracy for each repetition:

, and . We define the threshold for each data set as the smallest of the three standard deviations of measured accuracies across the six repetitions:

The results of both the first alternative and second alternative analysis methods are reported in Table V. The third block of results use minimum of the standard deviations as the threshold of irrelevance (and using the paper’s original method of analysis).

scenario same choice p.value mean low CI high CI
first alternative analysis
top 3 78% 6.32e-06 -0.005 -0.015 -0.002
full 72% 0.0005 -0.005 -0.015 -0.002
second alternative analysis
top 3 78% 3.05e-13 -0.002 -0.002 -0.001
full 72% 2.67e-08 -0.001 -0.002 -0.001
other definition of irrelevance
top 3 80% 5.94e-06 -0.002 -0.003 -0.001
full 71% 1e-06 -0.002 -0.003 -0.001
TABLE V: Results for the two alternative analysis methods and the alternative definition of the threshold of irrelevance

The two different methods of analysis and the other definition of the irrelevance threshold yield results that are consistent with the claims of the paper.

Appendix B Should one select the algorithm at all?

Given that our research shows an unexpected result that flat CV is acceptable as a method to select classification algorithms, contrary to the common practice in Machine Learning, we decided to explore another unexpected result, whether the selection of algorithms is really necessary, or if one should just use random forests, which was the best ranked algorithm in the experiments. We compared the decision of using only rf against the nested procedure. The results are in Table VI

scenario same choice p.value mean low CI high CI
full 28% 1 0.011 0.006 0.024
TABLE VI: Results for only choosing random forest as the classifier against selecting the best using the nested procedure.

The results show that the accuracy gain is certainly above the threshold of irrelevance, and thus selecting the algorithm results in a expected accuracy gain of practical consequence. One cannot assume that random forest is likely result in a classifier within the irrelevance threshold of the best option, and thus avoid the search altogether.