Fernández-Delgado et al. (2014)
evaluated 179 different implementations of classification algorithms (from 17 different families of algoritms) on 121 public datasets. We believe that such highly empirical research are very important both for researchers in machine learning and specially for practitioners. For researchers, this form of research allows them to focus their efforts on more likely useful endeavors. For example, if a researcher is interested in developing algorithms for very large classification problems, it is probably more useful to develop a big-data random forest (which is the family of classification algoritms with best performance according toFernández-Delgado et al. (2014)
) than to do it for Bayesian networks (mainly naive Bayes) or even Nearest Neighbors methods, which perform worse than random forests.
For practitioners, this form of research is even more important. Practitioners in machine learning will have limited resources, time, and expertise to test many different classification algorithms on their problem, and this form of research will allow them to focus on the most likely useful algorithms.
Despite its importance and breath, we believe that Fernández-Delgado et al. (2014) had some “imperfections” which we address in this research. The “imperfections” are discussed below as the extensions we carried in this paper:
We used the same set of datasets, but we transform them so that the problems are all binary. Many classification algorithms are defined only to binary problems, for example the SVM family, logistic regression, among others. Of course there are meta-extensions of such algorithms to multi-class problems, for example, one-vs-one, one-vs-all, error correcting output coding (ECOC) (Dietterich and Bakiri, 1995), stacked generalization (Wolpert, 1992), pairwise coupling (Hastie et al., 1998), among others. Also, there are alternative formulations for specific binary classifiers to deal with multiclass, for example, Franc et al. (2002) for SVM, Engel (1988) for logistic regression, and so on.
For the algorithms that are intrinsically binary, the application to multiclass problems poses two problems. The first is that one has to decide on which meta-extension to use, or if one should use the specific multiclass formulation of the algorithm. In some cases, one meta-extension is implemented by default, and the user must be aware that this decision was already made for him/her. For example, the libSVM default approach is one-vs-one. But a second, more subtle problem is the search for hyperparameters: it is very common that each combination of hyperparameters are tested only once for all classifiers in the one-vs-one solution. That is, allclassifiers in a one-vs-one solution has the same set of hyperparameters, and that may cause a decrease in accuracy in comparison to the case in which classifier is allow to choose its one set of hyperparameter. Thus, on multiclass problems, those intrinsically binary algorithms could be facing many disadvantages.
On the issue of binary classifiers, Fernández-Delgado et al. (2014) did not include in their comparisons the gradient boosting machine (gbm) algorithm, considered a very competitive algorithm for classification problems because, as reported in Zajac (2016), the implementation did not work in multiclass problems. We included gbm in our comparison.
We reduced the number of classifiers to only a few classes/algorithms and not different implementations of the same algorithm. Fernández-Delgado et al. (2014) compared an impressing 179 different classification programs, but it was unclear how many were just different implementations of the same algorithm, and how many were variations within the same “family” of algorithms. We believe that for practitioner and research communities, it is more useful to have an understanding of how different families of algorithms rank in relation to each other. For the practitioner, which should have more limited access to the different algorithms, this knowledge allow them to order which algorithms should be applied first to their particular problem.
In fact, Fernández-Delgado et al. (2014) also perform an analysis of their results based on the algorithm’s “family”, but they have difficulty of extracting useful information from this analysis, since in most cases, different “implementations” in the same family have widely different results. In one analysis, they rank the families by the worse performing member, which does not provide useful information. But in the end, their conclusions are mainly centered on the families of classifiers, because that is the most useful level of knowledge. From the abstract of the paper:
A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
We performed a more careful search for hyperparameters for each classification algorithm. Given Fernández-Delgado et al. (2014)
daunting task of testing 179 programs, they had to rely on default values for the hyperparameters which may lead to suboptimal results. Since we are working with significantly fewer algorithms, we could spend more time selecting reasonable ranges of values for the hyperparameters. In particular we tried to limit the number of training steps for the hyperparameter search in 24 so that no algorithm would have an advantage of having more degrees of freedom to adjust to the data (but in section2.4 we discuss that some algorithms may allow testing many values of the hyperparameter with just one training and testing step).
Besides computing when two algorithms are significantly different in their error rates, in the standard null-hypothesis significance tests (NHST), we also use a Bayesian analysis to discover when the differences of two algorithms has no “practical consequences.” As far as we know, this is the first time a Bayesian ANOVA is used to compare classifiers, which we believe is an important direction in analysis of results in empirical machine learning. But more significantly is the use of “limits of practical significance”, that is, the definition of thresholds below which the differences are irrelevant from practical purposes, which goes beyond an more standard “significance test” analysis currently used in the machine learning literature.
Fernández-Delgado et al. (2014) follow the standard null hypothesis significant test in analyzing their result, but even within this framework, their analysis is not as rigorous as it should have been. The NIST standard for comparing many classifiers across different datasets was proposed by Demsar (2006) and it is discussed in section 3.1. In particular, when testing the different algorithms for statistical significant differences, the Demsar procedure requires one to use the Nemenyi test, which is a nonparametric test that performs the appropriate multiple comparison p-value correction. But Fernández-Delgado et al. (2014)
used a paired t-test (a parametric test) between the first ranked and the following 9 top ranked algorithms, apparently without multiple comparions corrections. Very likely, given the large number of comparisons need to contrast all 179 algorithms, very few, if any of the pairwise comparisons would have been flagged as significant by the Nemenyi test.
In this paper we followed the full Demsar procedure to analyse the results, but we also used the Bayesian ANOVA to verify when the differences between the algoritms is not only “statistically significant”, but also “of practical significance”.
Fernández-Delgado et al. (2014) first two ranked programs are two different implementation of the same algorithm - parallel implementation of random forest (first ranked) and a non-parallel implementation (second ranked). The authors state that in some sense, the difference between the two results should not be “important” (they use the term “significant”) and indeed the statistical significance analysis shows that the difference was not statistically significant, but neither was the next 6 ranked algorithms (in comparison to the top performing).
We studied the computational costs of running a “standard” implementation of the different algorithms. With information regarding the computational cost a practitioner may want to balance execution time and expected accuracy. Furthermore, this information may encourage the practitioner choose other implementations that the ones tested, and researcher to develop faster implementations of the best ranked algorithms.
2 Data and Methods
2.1 Experimental procedure
In general terms the experimental procedure followed by this research is the following.
Each dataset (the datasets are discussed in section 2.2) is divided in half into two subsets and , each with the same proportion of classes. For each subset , we used a a 5-fold cross validation procedure to select the best set of hyperparameters for the procedure , (). Then we trained the whole subset using the procedure with hyperparameters and computed the error rate on the subset , (where and ). We call the error rate of algorithm when learning on the subset , with hyperparameters , when testing on the subset as .
We started with the 121 datasets collected from the UCI site, processed, and converted by the authors of Fernández-Delgado et al. (2014) into a unified format. The datasets is derived from the 165 available at UCI website in March 2013, where 56 were excluded by the authors of Fernández-Delgado et al. (2014). For the remaining 121, Fernández-Delgado et al. (2014)
We downloaded the datasets preprocessed by the authors of Fernández-Delgado et al. (2014) in November 2014. We performed the following transformations:
65 of the datasets were multiclass problems. We converted them to a binary classification problem by ordering the classes by their names, and alternatively assigning the original class to the positive and negative classes. The datasets have different proportions between the positive and negative classes, approximately normally distributed with mean 0.6 and standard deviation of 0.17.
19 of the datasets were separated into different training and test sets and on the November 2014 data, the test set was not standardized. We standardized the test set (separately from the train set) and joined both sets to create a single dataset.
we removed the 6 datasets with less than 100 datapoints,
the 9 datasets with more than 10.000 datapoints (more data points for subset) we searched the hyperparameters on a subset of 5000 datapoints. The final training was done with the whole subset, and the testing with the other full subset.
Thus, in this research we tested the algorithms in 121 - 6 (very small datasets removed) = 115 datasets.
Details of each dataset are described in A.
2.3 Classification algorithms
We used 14 classification algorithms from very different families. As discussed in the Introduction, we argued that one of the possible criticisms to the Fernández-Delgado et al. (2014) paper is that the authors do not distinguish different algorithms from different implementations of that algorithm.
Although we do not have a clear or formal definition of what are “families of algorithms” we believe that we have a sufficiently diverse collection of algorithms.
We chose not to use algorithms that do not require hyperparameters, such as Linear Discriminant Analysis (LDA) and logistic regression. Hyperparameters allow the algorithm to better adjust to the data details, and so we believe LDA and logistic regressions would be in disadvantages to the other algorithm. We added regularized versions of these algorithms, which contain at least one hyperparameters.
Thus, glmnet (L1 and L2 regularized logistic regression (Zou and Hastie, 2005)) and sda a L1-regularized LDA are two mainly linear algorithms. We would also add bst a boosting of linear models.
Among the distance based classifiers, knn is the usual k-nearest neighbor, and rknn is a bagging of knn, where each base learner is a knn on a random sample of the original features. lvq, or learning vector quantization (Kohonen, 1995) is a cluster plus distance, or dictionary based classification: a set of “prototypes,” or clusters, or “codebook” is discovered in the data, many for each class, and new data is classified based on the distance to these prototypes.
Neural network based classifiers include nnet a common 1-hidden layer logistic network, and elm or extreme learning machines (Huang et al., 2006). Extreme learning machines are a 1-hidden layer neural network, where the weights from the input to the hidden layer are set at random, and only the second set of weights are learned (usually via least square optimization).
We only tested the naive Bayes nb as a Bayesian network based algorithm.
We divided the SVM family into the linear SVM (svmLinear), the polynomial SVM (svmPoly) and the RBF SVM (svmRadial).
Finally we included one implementation of random forests (rf) and one implementation of gradient boosting machine classifiers (gbm) (Friedman, 2001).
The implementation information of the algorithms are listed below.
Boosting of linear classifiers. We used the R implementation in the package bst (Wang, 2014)
Extreme learning machines Implementation: package elmNN (Gosso, 2012))
Gradient boosting machines. Implementation: package gbm (Ridgeway, 2013)
Elastic net logistic regression classifier. Implementation : package glmnet (Friedman et al., 2010))
k-nearest neighbors classifier. Implementation: package class (Venables and Ripley, 2002).
Learning vector quantization. Implementation: package class (Venables and Ripley, 2002))
A 1-hidden layer neural network with sigmoid transfer function. Implementation: package nnet (Venables and Ripley, 2002)
Random forest. Implementation: package randomForest (Liaw and Wiener, 2002)
A bagging of knn classifiers on a random subset of the original features. Implementation: package rknn (Li, 2015)
A L1 regularized linear discriminant classifier. Implementation: package sparseLDA (Clemmensen, 2012)
A SVM with linear kernel. package e1071 (Meyer et al., 2014))
A SVM with polynomial kernel. package e1071 (Meyer et al., 2014))
A SVM with RBF kernel. package e1071 (Meyer et al., 2014))
2.4 Hyperparameters ranges
The RandomForest classifier is a particularly convenient algorithm to discuss the grid search on hyperparameters. Most implementation of rf use two or three hyperparameters: the mtry, the number of trees, and possible some limit of complexity of the trees, either the minimum size of a terminal node, or the maximum profundity, or a maximum number of terminal nodes. We did not set a range of possible values for the hyperparameters that limit the complexity of the trees. Mtry is the number of random features that will be used to construct a particular tree. There is a general suggestion (we do not know the source or other papers that tested this suggestion) that this number should be the square root of the number of features of the dataset. If nfeat is the number of features of the dataset, we set the possible values to . Also, the range is limited to at most nfeat.
The number of trees is what we will call a free hyperparameter, that is, an hyper-parameter can be tested for many values but it only need one training step. One can train a random forest with trees, but at testing time, some implementations can return the classification of each tree on the test data. Thus one can test the accuracy of a random forest with trees, just by computing the more frequent classification of the first trees (or any random sample of trees). Thus, to select this hyperparameter, it is not necessary to create a grid with multiple training or testing. So, for the random forest, ntree is a free hyperparameter, that will be tested from 500 to 3000 by 500. But we also put another limit on the number of trees, half of the number data points in the subset.
The number of repetitions or boosts in a boosting procedure is also a free parameter. The last class of free hyperparameter refer to an implementation of classification algorithms that calculate all the possible values of a parameter that causes changes in the learned function. The only relevant algorithm for this research is the elastic-net regularized logistic regression implemented by the package glmnet (Friedman et al., 2010)), which computes all the values (or the path as it is called) of the regularization parameter . Hastie et al. (2004) discuss a complete path algorithm for SVM (for the C hyperparameter) but we did not use this implementation in this paper.
We list the range of hyperparameters for each of the algorithms tested, where nfeat is the number of features in the data and ndat is the number of data points.
Hyperparameters: shrinkage = . Free hyperparameter, number of boosts, from 100 to 3000 by 200, at most ndat.
Hyperparameter: number of hidden units = at most 24 values equally spaced between and ndat
Hyperparameters: interaction-depth = 1..5, shrinkage=. number of boosts is a free hyperparameter, tested from 50 to 500 with increments of 20 to at most ndat.
Hyperparameters , 8 values equally spaced between 0 and 1. Free hyperparameter , 20 values geometrically spaced between to .
Hyperparameter k:= 1, and at most 23 random values from 3 to ndat.
Hyperparameter: size of the codebook = at most 8 values equally spaced between and nfeat
Hyperparameters: usekernel = true, false , fL =
Hyperparameter: number of hidden units = at most 8 values equally spaced between and nfeat, decay = .
Hyperparameter mtry = up to a value of nfeat. Number of trees is a free hyperparameters, tested from 500 to 3000 by 500 up to ndat.
Hyperparameters: mtry = 4 values equally spaced between 2 and nfeat, k= 1, and at most 23 random values from 3 to ndat. The number of classifiers is a free hyperparameter from 5 to 100, in steps of 20.
Hyperparameter: = 8 values geometrically spaced from to
. Hyperparameter: C = .
Hyperparameter C as in the linear kernel and degree from 2 to 5.
Hyperparameters C as in the linear SVM and .
The data described above, the R programs that tested the 14 algorithms, the results of running there algorithms, the R programs used to analyse these results and generate the tables and figures in this paper, and the saved interactions of the MCMC algorithm are available at https://figshare.com/s/d0b30e4ee58b3c9323fb.
3 Statistical procedures
3.1 Demsar procedure
We will follow the procedure proposed by Demsar (2006). The procedure suggests one should first apply a Friedman test (which can be seen as a non-parametric version of the repeated measure ANOVA test) to determine if there is sufficient evidence that the error rate measures for each procedure are not samples from the same distribution. If the p-value is below 0.05 (for a 95% confidence) than one can claim that it is unlikely that the error rates are all “the same” and one can proceed to compare each algorithm to the others. When comparing all algorithms among themselves, which is the case here, Demsar proposed the Nemenyi test, which will compute the p-value of all pairwise comparisons. Again, a p-value below 0.05 indicates that that comparison is statistically significant, that is, it is unlikely that two sets fo error rates are samples from the same distribution.
3.2 Bayesian comparison of multiple groups
A standard null hypothesis significant test assumes the null hypothesis, usually that the two samples came form the same population and computes the probability (p-value) of two samples from the same population having as large a difference in mean (or median) as the one encountered in the data. If the p-value is not low, one cannot claim that the null hypothesis is true and that the two data sets came from the same population ( and thus all observed differences are due to “luck”). Failing to disprove the null hypothesis because the p-value is too high is not the same as proving the null hypothesis.
Equivalence tests are a form of “proving” a weaker version of the usual null hypothesis. Equivalence tests assume as the null hypothesis that the difference between the mean (or median) of the two sets is above a certain threshold, and if the p-value is low enough, one can claim that this null hypothesis is false, and thus that the difference between the means is smaller than the threshold. Equivalence tests are useful when that threshold is a limit of practical irrelevance, that is, a limit below which changes in the mean of two groups are of no practical consequence. Of course, this limit of irrelevance is very problem dependent. Section 3.3 will discuss our proposal for such a limit.
A different approach to prove that the differences are not important is to use Bayesian methods. Bayesian methods will compute the (posterior) distribution of probability for some set of measures, given the prior hypothesis on those measures. Thus a Bayesian method can compute the posterior distribution of the difference of two means. The area of the distribution of the difference that falls within the limits of irrelevance is the probability that the difference is of no practical importance. In Bayesian statistics, the limit of irrelevance is called Region of Practical Equivalence (ROPE).
The standard Bayesian ANOVA, as described in Kruschke (2014) is based on normal distributions. We are interested in a 2-factor ANOVA, where one of the factors is the classification algorithm , and the other factor is the dataset. We assume that there is no interaction component, that is, we are not interested in determining particularities of each algorithm on each dataset - we are making a claim that the datasets used in this research are a sample of real world problems, and we would like to make general statements about the algorithms.
Let us denote as the error rate for the algorithm on dataset , then the usual hierarchical model is (Kruschke, 2014):
is the uniform distribution withand as the low and high limits; is the normal distribution with mean and standard deviation ; and Gamma(
) is the Gamma distribution with modeand standard deviation - note that this is not the usual parametrization of the Gamma distribution. is the standard deviation of the data, and , the mean of that data.
We are interested in the joint posterior probability. From that one can compute the relevant pairwise differences and in particular, how much of the probability mass fall within the region of irrelevant differences. More specifically, if , then the simulation approach we will follow generates a set for where is the number of chains in the MCMC simulation. We compute characteristics of distribution of from the .
Finally, one commonly used robust variation of the model above is to substitute the normal distribution in Equation 2a for a student-t distribution, with low degree of freedom, that is:
is the exponential distribution with rate. We also run the robust version of the model, as discussed in E.
3.3 Threshold of irrelevance
We propose two different forms of defining the threshold of irrelevance for differences in error rates. We will compute the two thresholds and use the lowest of the two measures as the threshold of irrelevance.
The first proposal is to compare the two measures of error for each dataset and for each classification algorithm. In Equation 1 they were the two terms and , that is, the error of classifier when learning on training subset and tested on the subset , and the dual of that. The difference
can be seen as the change on the error rate that one would expect from applying the classifier on two different samples ( and ) from the same population (). We will then compute the median of for all datasets and for all classification algorithms that are among the three best (as calculated by ) for each dataset. That is, we are considering as a threshold of irrelevance, the median of the change one should expect from using a good classifier (among the top three for that dataset) on two different samples of the same dataset.
The second proposal compares the estimate of the errorcomputed from the 5-fold CV procedure on the subset which selects the best hyperparameters with the measure of the error itself. We have not made explicit the steps in the 5-fold CV, but intuitively, for each combination of hyperparameters values , it computes the average of the error of training in 4 folds and testing on the remaining one. Let us call it - that is the CV error computed within the subset itself. We select the combination of hyperparameters that have the lowest . But this CV error is an estimate of the future error of the classifier (trained with that combination of hyperparameters). On the other hand is a “future” error of the classifier - the training set is slightly different from the 5-fold CV since it includes the whole subset while for each fold it included only 4/5 of , and the testing set is totally new. The difference
can be seen as the change in error one would expect from applying the classifier on a slightly larger training set and testing it on completely new data. We will compute the median of for all datasets and for all classification algorithms that are among the three best (as calculated by ) for each dataset.
In both these proposals, we are defining the threshold of irrelevance based on a “futility” point of view, and not based on theoretical considerations. One knows that given a different sample from the same distribution or given new data to test, all algorithms will have different error rates.
3.4 Computational costs
We compute to measures of computational costs to run the algorithms. The first one is the 1-train-test, which measures the total time to train the algorithm on a subset and to test it in the subset . Thus, that is the time to train the algorithm and to run it on two equally sized data.
But all algorithms must search for the appropriate hyperparameters. Thus also compute the total time to search for the correct hyperparameter (using, as discusses above a 5-fold CV). But different algorithms may have a different gird size of tentative hyperparameters (since as discussed for some algorithms some of range of hyperparameter may depend on characteristics of the dataset). Thus we divide the total time to search for the best hyperparameter by the number of hyperparameter combinations tested. We call it the per hyperparameter time.
Since the execution time varies greatly on different datasets, we will use the mean rank of each execution time to rank the algorithms, and use the Demsar procedure to determine which execution times are significantly different than the others (from a statistical sense). We have no intuition on what could be considered an irrelevant change in either execution times, so we will not perform the Bayesian ANOVA analysis.
The program ran on different cores of a cluster of different machines, so there is no guarantee that the machines have the same hardware specification. But we ran all the 14 algorithms for a particular subset on the same program. Thus the unit of distribution is all the 14 algorithms searching for the best combination of hyperparameters on a 5-fold CV of half of a dataset, then training on the full half dataset with the best selection of the hyperparameters and finally testing the classifier on the other half of the dataset.
Therefore, for the timing analysis we do not average the two measures from the subsets to obtain the measure per dataset. Instead we perform the Demsar statistical analysis using the subsets as subject indicator.
4.1 Error rates of the different algorithms
Table 1 list the mean ranking of all algorithms, the number of times each algorithm was among the top for a dataset. The best performing algorithm, in terms of mean rank across all subsets was the random forest, followed by SVM with Gaussian kernels, followed by gradient boosting machines. The three worst performing algorithms in terms of mean rank were boosting of linear classifiers naive Bayes and L1-regularized LDA. We make no claim that these three algorithms are intrinsically “bad”. It is possible that our choice of hyperparameters was outside the range of more useful values, or the implementation used was particularly not robust. In particular both nb and sda did not run at all for 20 subsets, which may explain partially their poor performance.
The ranking of the algorithm is dense, that is, all best performing algorithm receive rank 1, all second best algorithms receive rank 2, and so on. Also for the ranking, we rounded the error rates to 3 significant digits, so two algorithms have the same rank if their error rates have a difference of less than 0.0005. We do not use the rounding for the calculations of the irrelevance threshold.
Figure 1 displays the heat map of the rank distribution.
4.2 Demsar procedure
Table 2 lists the p-value of the pairwise comparisons for all classifiers. The p-values below 0.05 indicate that the differences between the corresponding classifiers is statistically significant.
The Demsar analysis shows that there is no statistical significance difference between the top 3 classification algorithms – the p-values of the pairwise comparisons among rf, svmRadial, and gbm are all well above 0.05. nnet, the next algorithm in the rank, is statistically significantly different from each of the first three. As discussed, the failure to prove that the three top algorithms are not statistically different, does not prove that they are equivalent. We will need the Bayesian ANOVA for that.
4.3 Irrelevance thresholds
The median and for all datasets, and for the three best algorithms for each dataset are:
Thus, the two measures are comparable, a little over 1%, and we will use the lower among the two as out threshold of irrelevance (0.0112). The fact that the median of is smaller than the other is somewhat surprising, since measures the error of learning (and testing) with two different samples of the data population, while is the difference of learning on basically the same subset, but testing in two different samples of different sizes from the same data population.
4.4 Bayesian ANOVA analysis
Table 3 displays the pairwise probability that the differences of the classifiers’ error rates is within our limits of irrelevance (from -0.0112 o 0.0112).
The comparison between the top three algorithms shows that the differences are with high probability (0.83 to 0.64) within our range of irrelevance, that is, there is no “practical difference” in the error rates of the three algorithms. In particular there is a stronger argument to be made that the best algorithm, rf is equivalent for practical purposed to the svmRadial. The claim of equivalence between gbm and the other two is less strong, in particular, to our surprise, in relation to the second best svmRadial.
4.5 Computational costs of the algorithms
|alg||mean rank 1-t-t||alg||mean rank p/h|
Table 4 shows the mean rank for the 1-train-test and per hyperparameter times. The top 6 algorithms in term of accuracy are indeed computationally costly in terms of training and testing, and they correspond to the 7 worse 1-train-test times. The result for the nb is surprising. Also surprising, to the author at least, is how more costly is rf in comparison to the two other “equivalent” classifiers, svmRadial and gbm. Table 5 is the Demsar procedure p-values for the pairwise comparisons of the 1-train-test cost of the top 6 algorithms (B contains the full p-value table). Notice that the difference between svmRadial and gbm is not significant; the difference between rf and svmRadial and gbm are significant.
Table 6 lists the p-values of the pairwise comparisons of the top 6 algorithms. B contains the full p-value table. svmRadial is significantly faster than all other algorithm in the per hyperparameter measure.
The per hyperparameter cost is surprising low for the svmRadial, which is a welcome result since the hyperparameter grid for the svmRadial does not depend on characteristics of the dataset, and therefore one will need to test many combinations, regardless of the dataset.
On the other hand the large per hyperparameter cost of rf could be a problem if one cannot use the free hyperparameter “trick” we described. Not all implementations of rf allow access to each tree decision111The predict method of random forest in sklearn, for example, does not allow access to each three individual decision., and if one has to explicitly train for different number of trees, the total number of hyperparameters tested times the high per hyperparameter cost may make the random forest solution less viable.
An important part of these results is the definition of an irrelevance threshold, a difference in error rates between to algorithms which is likely to be irrelevant from practical purposes. We claim that 0.0112 is median change of error rate one should expect when a classifier is trained and tested on a fully different set of data from the same population. This is a very strong claim, for which we have no theoretical backing, but which is empirically derived from the 115 datasets used in this research. And if one agrees that these datasets are a random sample of real life datasets that a practitioner will “find in the wild” (more on this issue below), that one must accept that this conclusion is likely generalizable for all data.
In general terms, one should not loose too much sleep trying to improve a classifier by less than 1% since it is likely that one will loose (or gain) that much when new data is used.
We can provide some independent evidence that the change in error rate with new data is around 1%. Kaggle competitions are evaluated against two hold out test sets, the public and the private. Usually the public dataset is around 25 to 30% of the private dataset (so the private dataset is not totally new data, but 70 to 75% new data. Thus the difference of private and public accuracy scores are an independent lower bound estimate of the change in accuracy for new data. Unfortunately very few of the Kaggle competitions are measured using accuracy, but one of them the “cats and dogs222https://www.kaggle.com/c/dogs-vs-cats” did. The median of the difference between the public and private test scores, for the top 50 entries on the public score is 0.0068, which is 60% of our threshold of irrelevance.
If one is reluctant to accept our 1% as a limit of irrelevance when comparing classifiers, Table 8 shows the probability that the difference between the accuracy rate is below 0.0056, that is half of our irrelevance threshold, for the six best algorithms. Even with this extremely rigorous limit of irrelevance, there is a some probability that rf, svmRadial, and gbm would perform at similar levels. In particular, there is still a 50% probability that rf and svmRadial will be equivalent for practical purposes. .
Another important point of discussion is the Bayesian analysis itself. We followed a more traditional modeling of the Bayesian ANOVA, where the priors for the means follow a Gaussian distribution (Equations2e and 2f). But error rates does not follow a Gaussian distribution, first because they are limited to a 0-1 range. Second, the whole point of the Demsar procedure to compare classifiers on different datasets is to use a non-parametric test, one that does not rely on the assumption of normality of the data. We are not making an assumption that error rates are normally distributed, but we are making an assumption that the priors for the coefficients of the dependency on the dataset (given an algorithm) and on the algorithm (given a dataset) are drawn from a Gaussian distribution. C discusses that those assumptions are somewhat reasonable, but in the future, the community should probably define a better model for the Bayesian analysis of error rates. D shows that there was convergence for the Monte Carlo simulations that generated the results in Table 3. Finally, we also ran the robust form of the model. Table 8 shows the probability that the differences between the top six algorithms are within the limit of practical relevance. Notice that one can be 100% sure of the equivalence of the top three algorithms using the robust model. Also notice that even nnet which was significantly different from the top three, is with high probability equivalent to at least svmRadial and gbm. There is no contradiction in both statements: the NHST claims that the difference is “real” while the Bayesian ANOVA claims that that difference (although real) it is not important. But nevertheless, the robust model seems to make much stronger claims of equivalence than the non-robust model, and as a precaution, we preferred the latter model. The stronger results of the robust model can be explained by shrinkage (Kruschke, 2014)
; since it allows more outliers, the mean of each distribution of the algorithms’ coefficient would be “free” to move closer together, and thus the higher probabilities that the differences are below the threshold of irrelevance.
E displays the full table for the pairwise probabilities that the differences between the algorithms are within the limit of irrelevance, and also discusses the convergency and model checking of the robust model.
Our results are in general compatible with those in Fernández-Delgado et al. (2014). Random forest is the best ranking algorithm in both experiments; gradient boosting machines which was not included in Fernández-Delgado et al. (2014) performs well, as reported in various blogs, RBF SVM also performs well. The divergence starts with the next best classifications algorithms: Fernández-Delgado et al. (2014) lists polynomial kernel SVM svmPoly and extreme learning machines elm as the next best families, but in our case svmPoly was equivalent to 1-hidden layer neural nets nnet, and a bagging of knn rknn. The differences between these algorithms was also below our practical limit of relevance. elm did perform worse than nnet and rknn.
5.1 Limits on this research
One limit of this research is that its conclusions should only be generalized to datasets “similar” to the ones used. In particular, our datasets did not include very large, or very sparse, or datasets with much more features than datapoints (as it is common in some bioinformatics applications). Furthermore, our experiments were performed on binary classification tasks. Given these restrictions on the data, if one can assume that the datasets in UCI repository are samples of “real life” problems, our results should be generalizable.
A second limit of this research is based on our decisions regarding the hyperparameter search for each algorithm. There is very little research on the range of possible or useful values of hyperparameters for any of the algorithms discussed, so our decisions are debatable. And if our choices were particularly bad, an algorithm may have been unfairly ranked in the comparisons.
The reader should be aware of the limited usefulness for the timing results. We used “standard” implementations of the algorithms in R. All three implementations were written in some compiled language and linked to R, but it may be that only libSvm, which is the base of the SVM R implementation, has been under current development, and the execution time of the SVM may be due to that. There has been more recent implementations of random forest (Wright and Ziegler, 2015) and gradient boosting machines (Distributed (Deep) Machine Learning Community, 2016) which claim both a faster execution time and shorter memory footprint. On the other hand, incremental and online solvers for SVM (Bordes et al., 2005; Shalev-Shwartz et al., 2011) may further tip the scale in favor of SVM.
5.2 Future research
Given the result that rf, svmRadial, and gbm are likely the best performing algorithms, it is interesting to explore further variations and different implementations of these algorithms. We discussed that more modern implementations the algorithms may alter the timing results. But variations of the algorithms may also alter the ranking and the apparent equivalence of the three algorithms. Variations of Random Forest, such as Rotation Forest (Rodriguez et al., 2006), Extremely Randomized Forest (Geurts et al., 2006), random forest of conditional inference trees (Hothorn et al., 2006; Strobl et al., 2007), should be compared with the standard algorithm. Least square SVM (Suykens and Vandewalle, 1999) should be compared with the standard SVM, and different models of boosting such as AdaBoost, LogiBoost, BrownBoost, should be compared with gbm.
There has been a large number of published research on different methods and algorithms for hyperparameter selection in RBF SVM, but almost no research in hyperparameter selection for Random Forests and Gradient Boosting Machines. Hyperparameter selection is a very important and computationally expensive step in selecting an algorithm for a particular problem, and further understanding of how to improve this process is needed, specially for those two algorithms.
We have shown that random forests, RBF SVM, and gradient boosting machines are classification algorithm that most likely will result in the highest accuracy and that it is likely that there will be no important difference in error rate among these three algorithms, specially between random forest and RBF SVM. In terms of training and testing execution times, SVM with a RBF kernel is faster then the two other competitors.
We believe that this paper also makes important methodological contributions for machine learning research, mainly in the definition of a threshold of irrelevance, below which, changes in error rate should be considered as of no practical significance. We argued that this threshold should be 0.0112. Another important methodological contribution is the use of a Bayesian Analysis of Variance method to verify that the differences among the three top performing algorithms, was very likely smaller than the 0.0112 threshold of practical irrelevance, and we showed that the Bayesian model used is appropriate to be used in comparisons of error rates among different algorithms.
Bordes et al. (2005)
Bordes, A., Ertekin, S., Weston, J., Bottou, L., 2005. Fast kernel classifiers with online and active learning. The Journal of Machine Learning Research 6, 1579–1619.
- Brooks and Gelman (1998) Brooks, S., Gelman, A., 1998. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 7 (4), 434–455.
Clemmensen, L., 2012. sparseLDA: Sparse Discriminant Analysis. R package
- Demsar (2006) Demsar, J., 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, 1–30.
Dietterich and Bakiri (1995)
Dietterich, T., Bakiri, G., 1995. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 263–286.
Distributed (Deep) Machine Learning Community (2016)
Distributed (Deep) Machine Learning Community, 2016. xgboost: Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more.https://github.com/dmlc/xgboost.
- Engel (1988) Engel, J., 1988. Polytomous logistic regression. Statistica Neerlandica 42 (4), 233–252.
- Fernández-Delgado et al. (2014) Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15, 3133–3181.
- Franc et al. (2002)
- Friedman et al. (2010) Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 (1), 1.
- Friedman (2001) Friedman, J. H., 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189–1232.
Gelman et al. (1996)
Gelman, A., Meng, X.-L., Stern, H., 1996. Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 733–760.
- Geurts et al. (2006) Geurts, P., Ernst, D., Wehenkel, L., 2006. Extremely randomized trees. Machine Learning 63 (1), 3–42.
Gosso, A., 2012. elmNN: Implementation of ELM (Extreme Learning Machine)
algorithm for SLFN (Single Hidden Layer Feedforward Neural Networks). R
package version 1.0.
- Hastie et al. (2004) Hastie, T., Rosset, S., Tibshirani, R., Zhu, J., 2004. The entire regularization path for the support vector machine. The Journal of Machine Learning Research 5, 1391–1415.
- Hastie et al. (1998) Hastie, T., Tibshirani, R., et al., 1998. Classification by pairwise coupling. The Annals of Statistics 26 (2), 451–471.
- Hothorn et al. (2006) Hothorn, T., Hornik, K., Zeileis, A., 2006. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15 (3), 651–674.
- Huang et al. (2006) Huang, G.-B., Zhu, Q.-Y., Siew, C.-K., 2006. Extreme learning machine: theory and applications. Neurocomputing 70 (1), 489–501.
- Kohonen (1995) Kohonen, T., 1995. Learning Vector Quantization. Springer.
- Kruschke (2014) Kruschke, J., 2014. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press.
Li, S., 2015. rknn: Random KNN Classification and Regression. R package version
- Liaw and Wiener (2002) Liaw, A., Wiener, M., 2002. Classification and regression by randomforest. R News 2 (3), 18–22.
Meyer et al. (2014)
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., 2014.
e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R
package version 1.6-4.
Ridgeway, G., 2013. gbm: Generalized Boosted Regression Models. R package
- Rodriguez et al. (2006) Rodriguez, J. J., Kuncheva, L. I., Alonso, C. J., 2006. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (10), 1619–1630.
- Shalev-Shwartz et al. (2011) Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A., 2011. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming 127 (1), 3–30.
- Strobl et al. (2007) Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T., 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8 (25).
- Suykens and Vandewalle (1999) Suykens, J. A., Vandewalle, J., 1999. Least squares support vector machine classifiers. Neural Processing Letters 9 (3), 293–300.
Venables and Ripley (2002)
Venables, W. N., Ripley, B. D., 2002. Modern Applied Statistics with S, 4th
Edition. Springer, New York, iSBN 0-387-95457-0.
Wang, Z., 2014. bst: Gradient Boosting. R package version 0.3-4.
- Weihs et al. (2005) Weihs, C., Ligges, U., Luebke, K., Raabe, N., 2005. klar analyzing german business cycles. In: Baier, D., Decker, R., Schmidt-Thieme, L. (Eds.), Data Analysis and Decision Support. Springer-Verlag, Berlin, pp. 335–343.
- Wolpert (1992) Wolpert, D. H., 1992. Stacked generalization. Neural Networks 5 (2), 241–259.
- Wright and Ziegler (2015) Wright, M. N., Ziegler, A., 2015. ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv:1508.04409.
- Zajac (2016) Zajac, Z., 2016. What is better: gradient-boosted trees, or a random forest? http://fastml.com/what-is-better-gradient-boosted-trees-or-random-forest/.
- Zou and Hastie (2005) Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 67 (2), 301–320.
Appendix A Datasets
The table below list the characteristics of all datasets, ordered by size. The name of the dataset is the same as the ones used in Fernández-Delgado et al. (2014). The size refers to one half of the dataset. The nfeat column is the number of features of the datasets; ndat the number of data; prop the proportion of data of the positive class. Notes has the following values:
Appendix B Full p-value tables for the pairwise comparison of the 1-train-test and per hyperparameter costs
Appendix C Bayesian model verification
Figure 2 plots the histogram of the error rates across the 115 datasets for the different algoritms, superimposed with the best fit Gaussian. One can argue that the distribution of error rates can reasonably be considered as normal, which would match the assumptions of the equation 2e. Similarly, Figure 3 is the distribution of error rates for 20 random datasets, and again one argue that the assumption in equation 2f is reasonable.
A more formal model verification is the procedure of posterior predictive check proposed by Gelman et al. (1996). The procedure calculates how unlikely is the true data when compared to data generated by the model, given the posterior distribution of the hyperparameters of the model. The data (true and generated) are summarized by the discrepancy (Gelman et al., 1996).
Figure 4 shows the relation between the discrepancies of the true data and that of the generated data, for 1667 of the points generated by the MCMC algorithm, following the posterior of the parameters of the model. The probability of that the real data (or data with even more extreme discrepancies) were generated by the model is the proportion of the points above the line. As one can see, that probability is around 0.5, and the real data is clearly among the most likely of the data generated by the model. Thus, under this criterion, the model is appropriate. But we must also point out that 3% of the data generated was negative, which is at least esthetically unpleasant.
Appendix D Convergence of the MCMC
There was no attempts to optimize the number of simulations of the MCMC algorithm. We used JAGS, with 5000 burnin steps and 5000 adaptative steps. Finally we ran 100000 total interactions on 4 separated chains.
Below is the Gelman and Rubin diagnostic (Brooks and Gelman, 1998) which compares the variance within and between chains, as reported by the function gelman.diag from the R package coda, where the variables saved are the ones in Equation 2b: b0 is and b1[a] are the for each algorithm, b2[d] are the for each dataset and ySigma is from Equation 2a, a1SD and a2SD are and from 2g and 2h. Not all lines that refer to the b2[d] variables are shown, but they all have the same values.
Potential scale reduction factors: Point est. Upper C.I. b0 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b1 1 1 b2 1 1 b2 1 1 b2 1 1 ... b2 1 1 b2 1 1 b2 1 1 ySigma 1 1 a1SD 1 1 a2SD 1 1 Multivariate psrf 1
Values between 1 and 1.1 suggest convergence of the interactions.
The effective sizes of the interactions shows that there is no problem of high autocorrelation. Again not all values for the b2[d] variables are shown.
b0 b1 b1 b1 b1 b1 b1 b1 100000.00 101017.13 98356.57 99052.19 99172.48 99820.25 99350.88 100262.05 b1 b1 b1 b1 b1 b1 b1 b2 100541.69 98992.43 98427.81 100000.00 97006.48 97467.65 94816.02 100000.00 b2 b2 b2 b2 b2 b2 b2 b2 99043.72 97966.91 99380.08 100000.00 99530.18 101053.74 100087.01 100000.00 ... b2 b2 ySigma a1SD a2SD 100204.87 100216.75 55353.48 16108.44 51977.11
Appendix E Results with the robust Bayesian model
This section displays the results of the Bayesian analysis based on the robust model. Table 12 is the full probability table from the Bayesian ANOVA.
We cannot perform the verification of the model using posterior predictive check because the discrepancy needs the variance of the data. Under the robust model, the variance of the data depends also on the degrees of freedom of the student-t distribution, and in the case of the robust simulations is 1.12, and the variance of the student-t distribution is not defined for degrees of freedom below 2.