1 Introduction
Many ml algorithms able to deal with classification tasks can be found in the literature. Although high predictive accuracy is the most frequently used measure to evaluate these algorithms, in many applications, easy interpretation of the induced models is also an important requirement. Good predictive performance and model interpretability are found in one of the most successful set of classification algorithms: dt induction algorithms Maimon:2014 .
When applied to a dataset, these algorithms induce a model represented by a set of rules in a treelike structure (as illustrated in Figure 1). This structure elucidates how the induced model predicts the class of a new instance, more clearly than many other model representations, such as an ann Haykin:2007 or svm Abe:2005 . As a result, dt induction algorithms are among the most frequently used ml algorithms for classification tasks Wu:2009 ; Jankowski:2014 .
dt induction algorithms have several other advantages over many ml algorithms, such as robustness to noise, tolerance against missing information, handling of irrelevant and redundant predictive attribute values, and low computational cost Maimon:2014 . Their importance is demonstrated by the wide range of wellknown algorithms proposed in the literature, such as Breiman et al.'s cart Breiman:1984 and Quinlan's C4.5 algorithm Quinlan:1993 , as well as some hybridvariants of them, like nbtree Kohavi:1996 , lmt Landwehr:2005 and ctree Hothorn:2006 .
Similarly to most ml algorithms, dt induction algorithms have hyperparameters whose values must be set. Due to the high number of possible configurations, and their large influence on the predictive performance of the induced models, hyperparameter tuning is often warranted Bergstra:2011 ; Pilat:2013 ; Massimo:2016 ; Padierna:2017 . The tuning task is usually investigated for “blackbox” algorithms, such as ann and svm, but not for dt. There are some prior studies investigating the evolutionary design of new dt induction algorithms Barros:2012 ; Barros:2015 , but only few on hyperparameter tuning for them Reif:2011 ; Molina:2012 ; Reif:2014 .
This paper investigates the effects of the hyperparameter tuning on the predictive performance of dt induction algorithms, as well as the impact hyperparameters have on the final predictive performance of the induced models. For such, three dt induction algorithms were chosen as study cases: two of the most popular algorithms in Machine Learning Wu:2009  the J48 algorithm, a WEKA Witten:2005 implementation for the Quinlan‘s C4.5 Quinlan:1993 ; the Breiman et al.'s cart algorithm Breiman:1984 ; and the algorithm ”ctree” Hothorn:2006 , a more recent implementation that embeds statistical tests to define whether a split must occur (similar to CHAID) Kass:1980 .
A total of six different hyperparameter tuning techniques (following different learning biases) were selected: a simple rs, three commonly used metaheuristics  ga
Goldberg:1989 , pso Kennedy:1995 , and eda Hauschild:2011 , smbo Snoek:2012 , and irace Birattari:2010 ^{1}^{1}1These techniques will be described on the next sections.. Experiments were carried out with a large number of heterogeneous datasets, and the experimental results obtained by these optimization techniques are compared with those obtained using the default hyperparameter values recommended for C4.5, CART and CTree.In many situations, the analysis of the global effect of a single hyperparameter, or interactions between different hyperparameters, may provide valuable insights. Hence, we also assess the relative importance of dt hyperparameters, measured using a recent functional ANOVA framework Hutter:2014 .
In all, the main contributions of this study are:

Largescale comparison of different hyperparameter tuning techniques for dt induction algorithms;

Comprehensive analysis of the effect of hyperparameters on the predictive performance of the induced models and the relationship between them;
The current study also extends a previous investigation Mantovani:2016
. This extended version reviews previous studies performing hyperparameter tuning of C4.5 (J48); includes two additional tree algorithms  cart and CTree; includes two stateofart optimization techniques in the experiments (smbo and irace); presents a more detailed methodology (with all implementation choices), and improves the experimental analysis, mainly the experiments considering the relative importance of the dt algorithm hyperparameters. All the code generated in this study is available to reproduce our analysis  and extend it to other classifiers. All experiments are also available on OpenML
Vanschoren:2014 .The remainder of this paper is structured as follows: Section 2 covers related work on hyperparameter tuning of dt induction algorithms, and Section 3 introduces hyperparameter tuning in more detail. Section 4 describes our experimental methodology, and the setup of the tuning techniques used, after which Section 5 analyses the results. Section 6 validates the results from this study. Finally, Section 7 summarizes our findings and highlights future avenues of research.
2 Related work
A large number of ML studies investigate the effect of hyperparameter tuning on the predictive performance of classification algorithms. Most of them deal with the tuning of “blackbox” algorithms, such as svm Gomes:2012 and ann Bergstra:2012 ; or ensemble algorithms, such as rf Reif:2012 ; Huang:2016 and Boosting Trees Eggensperger:2015 ; Wang:2015 . They often tune the hyperparameters by using simple techniques, such as ps Eitrich:2006 and rs Bergstra:2012 , but also more sophisticated ones, such as metaheuristics Padierna:2017 ; Gomes:2012 ; GasconMoreno:2011 ; Nakamura:2014 ; Ridd:2014 , smbo Bergstra:2011 ; Bardenet:2013 , racing algorithms Lang:2013 ; Miranda:2014 and mtl Feurer:2015 . However, when considering dt induction algorithms, there are far fewer studies available.
Recent work has also used metaheuristics to design new dt induction algorithms combining components of existing ones Barros:2015 ; Podgorelec:2015 . The algorithms created are restricted by the existing components, and since they have to optimize the algorithm and its hyperparameters, they have a much larger search space and computational cost. Since this study focuses on hyperparameter tuning, this section does not cover dt induction algorithm design.
2.1 C4.5/J48 hyperparameter tuning
Table 1 summarizes studies performing hyperparameter tuning for the C4.5/J48 dt induction algorithm. For each study, the table presents which hyperparameters were investigated (following the J48 nomenclature also presented in Table 4^{2}^{2}2The original J48 nomenclature may also be checked at http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html.), which tuning techniques were explored, and the number and source of datasets used in the experiments. Empty fields in the table mean that the procedures used in that specific study could not be completely identified.
Reference  Hyperparameter  Tuning  Number of  

C  M  N  O  R  B  S  A  J  U  Technique  Datasets  
Schauerhuber et. al. (2008) Schauerhuber:2008  GS  18 (uci)  
Sureka & Indukuri (2008) Sureka:2008  ga  
Stiglic et. al. (2012) Stiglic:2012  vtj48  71 (uci)  
Lin & Chen (2012) Lin:2012  SS  23 (uci)  
Ma (2012) Ma:2012  gp  70 (uci)  
AutoWEKA Thornton:2013 ; Kotthoff:2016  smbo  21  
Molina et. al. (2012) Molina:2012  gs  14  
Sun & Pfahringer (2013) Sun:2013  pso  466  
Reif et. al. (2014) Reif:2014  gs  54 (uci)  
Sabharwal et. al. (2016) Sabharwal:2016  DAUP  2 artificial  
4 realworld  
Tantithamthavorn et. al. (2016) Tantithamthavorn:2016  caret  18  
Delgado et. al. (2014) Delgado:2014  121 (uci)  
Wainberg et. al. (2016) Wainberg:2016 
Schauerhuber et. al. Schauerhuber:2008 presented a benchmark of four different opensource dt induction algorithm implementations, one being J48. In this study, authors assessed the algorithms performances on classification datasets from the uci repository. The authors tuned two hyperparameters: the pruning confidence (C) and the minimum number of instances per leaf (M).
Sureka & Indukuri Sureka:2008 used a ga (see Section 3.3) to recommend an algorithm and its best hyperparameter values for a problem. They used a binary representation to encode a wider hyperparameter space, including Bayes, Rules, Network and Treebased algorithms, including J48. However, the authors do not provide more information about which hyperparameter, ranges, datasets or evaluation procedures were used to assess the hyperparameter settings. Experiments also showed that the algorithm can find good solutions, but requires massive computational resources to evaluate all possible models.
Stiglic et. al. Stiglic:2012 presented a study tuning a vtj48, i.e., J48 with predefined visual boundaries. They developed a new adapted binary search technique to perform the tuning of four J48 hyperparameters: the pruning confidence (C); the minimum number of instances per leaf (M); the use of binary splits (B) and subtree raising (S). Experimental results on uci Bache:2013 and bioinformatics datasets demonstrated a significant increase in accuracy in visually tuned dts, when compared with defaults. In contrast to classical ml datasets, there were higher gains in bioinformatics datasets.
Lin & Chen Lin:2012 proposed a novel ssbased algorithm to acquire optimal hyperparameter settings, and to select a subset of features that results in better classification performance. Experiments with uci datasets demonstrated that the hyperparameter settings for C4.5 algorithm obtained by the new approach, when tuning the ‘C’ and ‘M
’ hyperparameters, were better than those obtained by baselines (defaults, simple ga and a greedy combination of them). When feature selection is considered, classification accuracy rates on most datasets are increased.
Ma Ma:2012 leveraged the gp algorithm to optimize hyperparameters for some ml algorithms (including C4.5 and its hyperparameters ‘C’ and ‘M’) for uci classification and regression datasets. gps were compared with gs and rs methods (see Section 3.1). gps found solutions faster than both baselines with comparably high performances. However, compared specifically to rs, gps seems to be better for more complex problems, while rs is sufficient for simpler ones.
Sabharwal et. al. Sabharwal:2016 proposed a method to sequentially allocate small data batches to selected ml classifiers. The method, called “Data Allocation using Upper Bounds” (DAUP), tries to project an optimistic upper bound of the accuracy obtained by a classifier in the full dataset, using recent evaluations of this classifier on small data batches. Experiments evaluated the technique on classification datasets and more than algorithms with different hyperparameters, including C4.5 and its ‘C’ and ‘M ’ hyperparameters. The proposed method was able to select near optimal classifiers with a very low computational cost compared to full training of all classifiers.
In Tantithamthavorn et. al. Tantithamthavorn:2016 , the authors investigated the performance of prediction models when tuning hyperparameters using “caret”^{3}^{3}3https://cran.rproject.org/web/packages/caret/index.html caret:2016 , a ml tool. A set of ML algorithms, including J48 and its ‘C’ hyperparameter, were tuned on proprietary and public datasets. In a comparison with defaults from caret using the AUC^{4}^{4}4Area under the ROC curve measure, the tuning produced better results.
Wainberg et. al. Wainberg:2016 reproduced the benchmark experiments described in Delgado:2014 . They evaluated classifiers from different learning groups on datasets from uci. The hyperparameters of the J48 algorithm were manually tuned.
Other studies used hyperparameter tuning methods to generate mtl systems Molina:2012 ; Reif:2014 ; Thornton:2013 ; Kotthoff:2016 ; Sun:2013 . These studies search the hyperparameter spaces to describe the behavior of ml algorithms in a set of problems, and later recommend hyperparameter values for new problems. For example, Molina et. al. Molina:2012 tuned two hyperparameters of the J48 algorithm (‘C’ and ‘M’) in a case study with educational datasets, using gs. They also used a set of metafeatures to recommend the most promising set of <algorithm, hyperparameters> pairs for each problem. The proposed approach, however, did not improve the performance of the dts with defaults.
Sun & Pfahringer Sun:2013 also used hyperparameter tuning in the context of mtl. The authors proposed a new metalearner for algorithm recommendation, and a feature generator to construct the datasets used in experiments. They searched ml algorithm hyperparameter spaces, one of them the C4.5 and its ‘B’ hyperparameter. The pso technique (see Section 3.4) was used to generate a metadatabase for a recommendation experiment. Similarly, Reif et. al. Reif:2014 implemented an opensource mtl system to predict accuracies of target classifiers, one of them the C5.0 algorithm (a version of the C4.5), with its pruning confidence (C) tuned by gs.
A special case of hyperparameter tuning is the cash tool, introduced by Thornton:2013 as the AutoWEKA^{5}^{5}5http://www.cs.ubc.ca/labs/beta/Projects/autoweka/ framework, and updated recently in Kotthoff:2016 . AutoWEKA applies smbo (see Section 3.2) to select an algorithm and its hyperparameters to new problems based on a wide set of ml algorithms (including J48). In addition to the previously mentioned hyperparameters (C, M, B and S), AutoWEKA also searches for the following HP values: whether to collapse the tree (O), use of Laplace smoothing (A), use of MDL correction for the info gain criterion (J) and generation of unpruned trees (U).
2.2 CART hyperparameter tuning
Table 2 summarizes previous studies on hyperparameter tuning for the cart algorithm. For each study, the table presents which hyperparameters, tuning techniques, and the number and source of datasets explored in the experiments.
Reference  Hyperparameter  Tuning  Number of  
cp  min  min  max  weights  max  max  Technique  Datasets  
split  bucket  depth  leaf  leaf  feat  
Schauerhuber et. al. Schauerhuber:2008  GS  18 (uci)  
Sun & Pfahringer (2013) Sun:2013  pso  466  
BermudezChacon et. al.(2015) Chacon:2015  rs  29 (uci)  
sh  7 (other)  
pd  
AutoskLearn Feurer:2015 ; Feurer:2015B  smbo  140 (openml)  
Levesque et. al. (2016) Levesque:2016  smbo  18 (uci)  
Tantithamthavorn et. al. (2016) Tantithamthavorn:2016  caret  18 (various)  
Delgado et. al. (2014) Delgado:2014  121 (uci)  
Wainberg et. al. (2016) Wainberg:2016 
In Schauerhuber et. al. Schauerhuber:2008 , the authors added cart/rpart to their benchmark analysis. They manually tuned only the complexity parameter ‘cp’. Sun et. al. Sun:2013 investigated the tuning of cart hyperparameters, in particular its minsplit hyperparameter, over datasets (some of which are artificially generated) using pso. This hyperparameter controls the minimum number of instances necessary for a split to be attempted. The hyperparameter settings assessed during the search were used to feed a metalearning system. In Tantithamthavorn et. al. Tantithamthavorn:2016 , the authors did a similar study, but focused on the complexity parameter ‘cp’.
In BermudezChacon et. al. Chacon:2015 , the authors presented a hierarchical model selection framework that automatically selects the best ml algorithm for a particular dataset, optimizing its hyperparameter values. Algorithms and hyperparameters are organized in a hierarchy, and an iterative process makes the recommendation. The optimization technique used for tuning is considered a component of the framework, and three choices are available: rs, sh and pd optimization methods. The technique encapsulates a long list of algorithms, including cart and some of its hyperparameters: ‘minsplit’; the minimum number of instances in a leaf (‘minbucket’); the maximum depth of any node of the final tree (‘maxdepth’); weighted values to leaf nodes (‘weights_leaf’); the maximum number of leafs (‘maxleafs’) and the maximum number of features from dataset used in trees (‘maxfeatures’).
In Feurer et. al. Feurer:2015 ; Feurer:2015B , the authors used the SMBO approach to select and tune algorithm from the “scikit learn”^{6}^{6}6http://scikitlearn.org/ framework, hence AutoskLearn^{7}^{7}7https://github.com/automl/autosklearn. The only dt induction algorithm covered here is cart. cart with some hyperparameters manually selected was also experimentally investigated in Delgado:2014 ; Wainberg:2016 .
Levesque et. al. Levesque:2016 investigated the use of hyperparameter tuning and ensemble learning for tuning cart hyperparameters when models induced by cart were part of an ensemble, using smbo. Four hyperparameters were tuned in the process: ‘minsplit’, ‘minbucket’, ‘maxdepth’ and the ‘maxleaf’. The tuning resulted in a significant improvement in generalization accuracy when compared with the Single Best Model Ensemble and Greedy Ensemble Construction techniques.
2.3 CTree hyperparameter tuning
Table 3 summarizes previous studies on hyperparameter tuning for the ctree algorithm Hothorn:2006 . For each study, the table presents which hyperparameters were investigated, which tuning techniques were explored, and the number (and source) of datasets used in the experiments. Studies at the table with no technique specified used a manual selection process.
Reference  Hyperparameter  Tuning  Number of  

min  min  min  stump  mtry  max  Technique  Datasets  
criterion  split  bucket  depth  
Schauerhuber et. al. (2008) Schauerhuber:2008  18 (uci)  
Delgado et. al. (2014) Delgado:2014  121 (uci)  
Wainberg et. al. (2016) Wainberg:2016  
SardaEspinoza et. al. (2017) Sarda:2017  GS  4 (private) 
Schauerhuber et. al. Schauerhuber:2008 also included the ctree algorithm in their benchmark study. In their study, only the ‘mincriterion’ hyperparameter is manually tuned for 18 uci datasets. This hyperparameter defines the value of the statistic test () that must be exceeded for a split to occur.
A ctree implementation is also explored in the benchmark studies presented by Delgado:2014 ; Wainberg:2016 . Two hyperparameters are tuned manually: the ‘mincriterion’ and the maximum tree depth (‘maxdepth)’. Experiments were performed with a total of uci heterogeneous datasets.
SardaEspinoza et. al. Sarda:2017 applied conditional trees to extract relevant knowledge from electrical motors’ data. The final models were obtained after tuning two hyperparameters via gs: ‘mincriterion’ and ‘maxdepth’. The resulting models were applied to four different private datasets.
2.4 Literature Overview
The literature review indicates that hyperparameter tuning for dt induction algorithm could be more deeply explored. We found eleven studies investigating some tuning for the J48 algorithm, six for cart and only three for the ctree algorithm. These studies neither investigated the tuning task itself nor adopted a consistent procedure to assess candidate hyperparameter settings while searching the hyperparameter space:

some studies used hyperparameter sweeps;

some other studies used simple cv resamplings;

a few studies used nestedcv procedures, but only used an inner holdout and they did not repeat their experiments with different seeds^{8}^{8}8Since the stochastic nature of the often used tuning algorithms, experimenting with different seeds (for random generator) is desirable.; and

some studies did not even describe which experimental methodology was used.
Regarding the search space, most studies concerning C4.5/J48, cart and ctree hyperparameter tuning investigated only a small subset of the hyperparameter search spaces (as shown in Tables 1, 2 and 3). Furthermore, most of the studies did the tuning manually, used simple hyperparameter tuning techniques or searched the hyperspaces to generate metainformation for mtl and cash systems.
This paper overcomes these limitations by investigating several techniques for dt hyperparameter tuning, using a reproducible and consistent experimental methodology. It presents a comparative analysis for each of the investigated algorithms (C4.5, cart and ctree), and analyzes the importance and relationships between many hyperparameters of dt induction algorithms.
3 Hyperparameter tuning
Many applications of ml algorithms to classification tasks use hyperparameter default values suggested by ml tools, even though several studies have shown that their predictive performance mostly depends on using the right hyperparameter values Feurer:2015 ; Thornton:2013 ; Feurer:2015B . In early works, these values were tuned according to previous experiences or by trial and error. Depending on the training time available, finding a good set of values manually may be subjective and timeconsuming. In order to overcome this problem, optimization techniques are often employed to automatically look for a suitable set of hyperparameter settings Bergstra:2011 ; Bardenet:2013 .
The hyperparameter tuning process is usually treated as a blackbox optimization problem whose objective function is associated with the predictive performance of the model induced by a ml algorithm, formally defined as follows:
Let be the hyperparameter space for an algorithm , where is the set of ml algorithms. Each represents a set of possible values for the hyperparameter of () and can be usually defined by a set of constraints. Additionally, let be a set of datasets where is a dataset from . The function measures the predictive performance of the model induced by the algorithm on the dataset given a hyperparameter configuration . Without loss of generality, higher values of mean higher predictive performance.
Given , and , together with the previous definitions, the goal of a hyperparameter tuning task is to find such that
(1) 
The optimization of the hyperparameter values can be based on any performance measure , which can even be defined by multiobjective criteria. Further aspects can make the tuning more difficult, like:

hyperparameter configurations that lead to a model with high predictive performance for a given dataset may not lead to high predictive performance for other datasets;

hyperparameter values often depend on each other (as in the case of svm BenHur:2010 ). Hence, independent tune of hyperparameters may not lead to a good set of hyperparameter values;

the evaluation of a specific hyperparameter configuration, not to mention many configurations, can be subjective and very timeconsuming.
In the last decades, populationbased optimization techniques have been successfully used for hyperparameter tuning of classification algorithms Bergstra:2011 ; Bardenet:2013 . When applied to tuning, these techniques (iteratively) build a population of hyperparameter settings for which are being computed for each . By doing so, they can simultaneously explore different regions of a search space. There are various populationbased hyperparameter tuning strategies, which differ in how they update at each iteration. Some of them are briefly described next.
3.1 Random Search
rs is a simple technique that performs random trials in a search space. Its use can reduce the computational cost when there is a large number of possible settings being investigated Andradottir:2015 . Usually, rs performs its search iteratively in a predefined number of iterations. Moreover, is extended (updated) by a randomly generated hyperparameter setting in each (th) iteration of the hyperparameter tuning process. rs has obtained efficient results in optimization for dl algorithms Bergstra:2012 ; Bardenet:2013 .
3.2 Sequential Model Based Optimization
smbo Snoek:2012 ; Brochu:2010 is a sequential method that starts with a small initial population which, at each new iteration , is extended by a new hyperparameter configuration , such that the expected value of is maximal according to an induced metamodel approximating on the current population . In the experiments reported in Bergstra:2011 ; Snoek:2012 ; Bergstra:2013B , smbo performed better than gs and rs and matched or outperformed stateoftheart techniques in several hyperparameter optimization tasks.
3.3 Genetic Algorithm
Bioinspired techniques, such as a ga, based on natural processes, have also been largely used for hyperparameter tuning Gomes:2012 ; Friedrichs:2005 ; Kalos:2005 . In these techniques, the initial population , generated randomly or according to background knowledge, is changed in each iteration according to operators based on natural selection and evolution.
3.4 Particle Swarm Optimization
pso is a bioinspired technique relying on the swarming and flocking behaviors of animals Simon:2013 . In case of pso, each particle is associated with its position in the search space , a velocity and also its so far best found position . During iterations, the movements of each particle is changed according to its so far bestfound position as well as the so far bestfound position of the entire swarm (recorded through the computation).
3.5 Estimation of Distribution Algorithm
eda Hauschild:2011 lies on the boundary of ga and smbo by combining the advantages of both approaches such that the search is guided by iteratively updating an explicit probabilistic model of promising candidate solutions. In other words, the implicit crossover and mutation operators used in ga are replaced by an explicit probabilistic model .
3.6 Iterated FRace
The irace Birattari:2010 technique was designed for algorithm configuration and optimization problems Lang:2013 ; Miranda:2014 based on ’racing’. One race starts with an initial population , and iteratively selects the most promising candidates considering the hyperparameter distributions, and comparing them by statistical tests. Configurations that are statistically worse that at least one of other configuration candidates are discarded from the racing. Based on the surviving candidates, the distributions are updated. This process is repeated until a stopping criterion is reached.
4 Experimental methodology
The nested cv Cawley:2010 ; Krstajic:2014 experimental methodology employed is illustrated by Figure 2. For each dataset, data are split into outerfolds: the training folds are used by the tuning techniques to find good hyperparameter settings, while the test fold is used to assess the ‘optimal’ solution found. Internally, tuning techniques split each of the training folds into innerfolds to measure the fitness value of each new hyperparameter setting. At the end of the process, a set of optimization paths, settings, and their predictive performances are returned. During the experiments, all the tuning techniques were run on the same data partitions, with the same seeds and data to allow their comparison. In Krstajic:2014 , the authors used . However, they argued that there is no study suggesting the number of folds in the outer and inner cv loops. Here, the same value used in the original paper was used for . Due to time constraints and the size of datasets used in experiments, was adopted. Next subsections detail the subcomponents used in the tuning task.
4.1 Hyperparameter spaces
The experiments were performed considering the hyperparameter tuning of three dt induction algorithms: the ‘J48’ algorithm, a WEKA^{9}^{9}9http://www.cs.waikato.ac.nz/ml/weka/ Witten:2005 implementation of the C4.5 algorithm; the rpart implementation of the cart Breiman:1984 algorithm, and the ctree algorithm Hothorn:2006 . These algorithms were selected due to their wide acceptance and use in many ml applications Maimon:2014 ; Jankowski:2014 ; Barros:2012 . The first two listed algorithms are among the most used in Machine Learning, specially by nonexpert users Wu:2009 , and the third is a more recent implementation that uses statistical tests for splits, like the classical CHAID algorithm Kass:1980 . The correspondent hyperparameter spaces investigated are described in Table 4.
Algo  Symbol  hyperparameter  Range  Type  Default  Conditions 
J48  C  pruning confidence  real  0.25  R = False  
J48  M  minimum number of instances in a leaf  integer  2    
J48  N  number of folds for reduced  integer  3  R = True  
error pruning  
J48  O  do not collapse the tree  {False,True}  logical  False   
J48  R  use reduced error pruning  {False,True}  logical  False   
J48  B  use binary splits only  {False,True}  logical  False   
J48  S  do not perform subtree raising  {False,True}  logical  False   
J48  A  Laplace smoothing for predicted  {False,True}  logical  False   
probabilities  
J48  J  do not use MDL correction for  {False,True}  logical  False   
info gain on numeric attributes  
CART  cp  complexity parameter  real    
CART  minsplit  minimum number of instances in a  integer    
node for a split to be attempted  
CART  minbucket  minimum number of instances in a leaf  integer    
CART  maxdepth  maximum depth of any node of  integer    
the final tree  
CART  usesurrogate  how to use surrogates in the splitting  factor    
process  
CART  surrogatestyle  controls the selection of the best  factor    
surrogate  
CTree  mincriterion  the value of the statistic test  real  0.95    
(1  pvalue) to be exceed for  
a split occurrence  
CTree  minsplit  minimum sum of weights in a  integer  20    
node for a split occurrence  
CTree  minbucket  minimum sum of weights in a leaf  integer  7    
CTree  mtry  number of input variables randomly  real  0    
sampled as candidates at each node  
for random forest like algorithms 

CTree  maxdepth  maximum depth of any node of  integer  no restriction    
the final tree  
CTree  stump  a stump (a tree with three nodes  {False,True}  logical  False   
only) is to be computed  
Originally, J48 has ten tunable hyperparameters^{10}^{10}10http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html: all presented at Table 4 plus the hyperparameter ‘U’, which enables the induction of unpruned trees. Since pruned trees look for the most interpretable models without loss of predictive performance, this hyperparameter was removed from the experiments, and just pruned trees were considered. For ctree, all the statistically dependent hyperparameters were kept out, since their effects were previously studied and the default choices were robust for a wide range of problems Hothorn:2006 , thus the nonstatistically dependent hyperparameters were selected. Regarding cart, all the tunable hyperparameters in rpart were selected.
For each hyperparameter, Table 4 shows the allowed range of values, default values provided by the correspondent packages, and its constraints for setting new values. The hyperparameter values were the same used in Reif et. al. Reif:2011 . The range of the pruning confidence () hyperparameter was adapted from Reif et. al. Reif:2014 , because the algorithm internally controls the parameter values, does not allowing some values near zero or .
4.2 Datasets
The experiments were carried out using public datasets from the openml Vanschoren:2014 website^{11}^{11}11http://www.openml.org/, a free scientific platform for standardization of ml experiments, collaboration and sharing empirical results. Binary and multiclass classification datasets were selected, varying the number of attributes (D) () and examples (N) (). In all selected datasets each class (C) has at least examples, to allow the use of the stratified methodology. All datasets, with their main characteristics, are presented in Tables 7 and 8 at B.
4.3 Hyperparameter tuning techniques
Six hyperparameter tuning techniques were investigated:

three different metaheuristics: a ga Goldberg:1989 , pso Kennedy:1995 and an eda Hauschild:2011 . These techniques are often used for hyperparameter tuning of ml classification algorithms in general GasconMoreno:2011 ; Lin:2008 ; Yang:2013 ;

a simple rs technique: suggested in Bergstra:2012 as a good alternative for hyperparameter tuning replacing gs technique;

irace Birattari:2010 : a racing technique designed for algorithm configuration problems; and

a smbo Snoek:2012 technique: a state of the art technique for optimization that employs statistical and/or machine learning techniques to predict distributions over labels, and allows a a more direct and faster optimization.
Element  Method  R package 

HPtuning techniques  Random Search  mlr 
Genetic Algorithm  GA  
Particle Swarm Optimization  PSO  
Estimation of Distribution Algorithm  copulaedas  
Sequential Model Based Optimization  mlrMBO  
Iterated Frace  irace  
Decision Trees  J48 algorithm  RWeka 
CART algorithm  rpart  
CTree algorithm  party  
Inner resampling  3fold crossvalidation  mlr 
Outer resampling  10fold crossvalidation  mlr 
Optimized measure  Balanced per class accuracy  mlr 
Evaluation measure  Balanced per class accuracy,  mlr 
Optimization paths  
Budget  900 iterations  
Repetitions  30 times with different seeds   
seeds =    
Baseline  Default values (DF)  RWeka 
rpart  
party 
Table 5 summarizes the choices made to accomplish the general hyperparameter tuning techniques. Most of the experiments were implemented using the mlr R package^{12}^{12}12https://github.com/mlrorg/mlr mlr:2016 (measures, resampling strategies, tuning main processes and rs technique). The ga, pso and eda metaheuristics were implemented using the GA^{13}^{13}13https://github.com/lucascr/GAScrucca:2013 , pso^{14}^{14}14https://cran.rproject.org/web/packages/pso/index.htmlBendtsen:2012 , and copulaedas^{15}^{15}15https://github.com/yasserglez/copulaedasGonzalezFernandez:2014 R packages, respectively. The J48, cart and ctree algorithms were implemented using the RWeka^{16}^{16}16https://cran.rproject.org/web/packages/RWeka/index.htmlHornik:2009 , rpart^{17}^{17}17https://cran.rproject.org/web/packages/rpart/index.htmlrpart:2014 and party^{18}^{18}18https://cran.rproject.org/web/packages/party/index.htmlHothorn:2006 packages, respectively, wrapped into the mlr package. The SMBO technique was implemented using the mlrMBO^{19}^{19}19https://github.com/mlrorg/mlrMBO mlrMBO:2017 R package, with its rf surrogate models implemented by the randomForest^{20}^{20}20https://cran.rproject.org/web/packages/randomForest/index.html R package rf:2002 . The Irace technique was implemented using the irace^{21}^{21}21http://iridia.ulb.ac.be/irace/ irace:2016 R package.
Since the experiments handle a high number of datasets with different characteristics, many datasets may have unbalanced classes. Thus, the same predictive performance measure used during optimization as the fitness value, bac Brodersen:2010 , is used for model evaluation.
When tuning occurs in real scenarios, time is an important aspect to be considered. Sometimes the tuning process may take many hours to find good settings for a single dataset Reif:2012 ; Ridd:2014 . Thus, this work investigates whether it is possible to find the same good settings faster by using a reduced number of evaluations (budget). Based on previous results and analyses Mantovani:2016 , a budget size of evaluations was adopted in the experiments^{22}^{22}22The budget size choice is discussed with more details in Section 6..
Since all techniques are stochastic, each one was executed times for each dataset using different seed values. It gives a total of (repetitions) (outerfolds) (budget) HPsettings generated during the search process for one dataset. Besides, the default hyperparameter values provided by the ‘RWeka’, ‘rpart’ and ‘party’ packages were used as baseline for the experimental comparisons.
Technique  Parameter  Value 

RS  stopping criteria  budget size 
PSO  number of particles  10 
maximum number of iterations  90  
stopping criteria  budget size  
algorithm implementation  SPSO2007 Clerc:2012  
EDA  number of individuals  10 
maximum number of iterations  90  
stopping criteria  budget size  
EDA implementation  GCEDA  
copula function  normal  
margin function  truncnorm  
GA  number of individuals  10 
maximum number of iterations  90  
stopping criteria  budget size  
selection operator  proportional selection with linear scaling  
crossover operator  local arithmetic crossover  
crossover probability  0.8  
mutation operator  random mutation  
mutation probability  0.05  
elitism rate  0.05  
SMBO  points in the initial design  10 
initial design method  Random LHS  
surrogate model  Random Forest  
stopping criteria  budget size  
infill criteria  expected improvement  
Irace  number of instances for resampling  100 
stopping criteria  budget size  
As this paper evaluates different tuning techniques, to avoid the influence of their hyperparameter values on their performances, the authors decided to use their default values. Each tuning technique has a different set of hyperparameters, and these are specific and different considering each technique’s paradigm. In the smbo, irace and pso cases, the use of the defaults have been shown robust enough to save time and resourcesmlrMBO:2017 ; irace:2016 ; Bigiarini:2013 . For eda and ga (and evolutionary methods in general) there is no standard values for their parameters Mills:2015 . So, to keep fair comparisons, the default parameter values provided by the correspondent R packages were used. All of these values may be seen in Table 6.
The tuning techniques have an initial population with random hyperparameter settings and the same stopping criteria: the budget size. The ga, pso and eda techniques use a “realvalue” codification for the individuals/particles, thus, they were adapted to handle discrete and Boolean hyperparameters. All of them were executed sequentially in the same cluster environment. Every single job generated was executed in a dedicated core with no concurrency, and scheduled by the cluster system.
4.4 Repositories for the coding used in this study
The code for implementations used in this study are publicly available at:

Tuning procedures:https://github.com/rgmantovani/HpTuning;

Graphical analysis:https://github.com/rgmantovani/TuningAnalysis .
Instructions to run each project may be found directly at the correspondent websites. The experimental results are also available as an openml study (https://www.openml.org/s/50), where all datasets, classification tasks, algorithms/flows and results of runs for this paper can be listed and downloaded.
5 Experimental results
Next subsections present main experimental results regarding the dt implementations.
5.1 Performance analysis regarding J48 algorithm
Figure 3 presents the results obtained by the tuning techniques when applied to the J48 dt induction algorithm. Subfigure 3(a) shows the average bac values obtained by the tuning techniques and defaults over all datasets. The datasets at the xaxis are placed in decreasing order according to their predictive performances using default hyperparameter values^{23}^{23}23The corresponding dataset names may be seen in Tables 7 and 8 at B..
For each dataset, the name of the tuning technique that resulted in the best predictive performance is shown above the xaxis. The Wilcoxon pairedtest was applied to assess the statistical significance of the results obtained by this best technique when compared to the results using default values. The test was applied to the solutions obtained from the repetitions (with ). An upper green triangle () at xaxis identifies datasets where statistically significant improvements were detected after applying the hyperparameter tuning technique. On the other hand, every time a red down triangle () is presented, the use of defaults was statistically better than the use of tuning techniques.
A first look at the results shows that all tuning techniques have similar performances, with few exceptions, since most of the curves overlap. In general, there is a small difference in predictive performance regarding the default values. Higher improvements may be seen only in a small subset of datasets. When the Wilcoxon statistical pairedtest is applied comparing defaults with the best tuning technique, they show that, overall, tuned trees were better than those with default values with statistical significance in () datasets. In most of these situations, the irace, pso or smbo techniques produced the best results. Default values were significantly better in of the cases, and the remaining situations () did not present statistically significant differences (the approaches tied).
Subfigure 3(b) shows the average tree size of the final J48 induced dts. The tree size measures the number of nodes in the final tree model. It is important to mention that the interpretability of a tree is mostly dependent on its size. Thus, larger trees are usually more difficult to understand than smaller trees. Regarding the J48 dt size, in most cases, default values (dotted black line) induced trees larger than those obtained by the hyperparameters suggested by tuning techniques. This fact was true whenever default values were the best option with statistical significance. For most of the multiclass tasks with many classes (datasets most to the right at the charts), the tuned trees were also smaller than those induced using default values. Even small concerning performance, the improvements were also significant.
Looking at the peaks of improvements due to the use of hyperparameter tuning, they were reached when the dts induced using default values were much smaller than those obtained using hyperparameter tuning. This occurred for the datasets with the ids = {}. When comparing tuning techniques among themselves, significant differences only appear in these datasets. The soft computing techniques tend to produce smaller trees than the smbo and rs techniques.
To compare default setting with the solutions found during the tuning process, and also get useful insights regarding the defaults effectiveness, the J48 hyperparameter’s distributions found by the tuning techniques are presented in Figure 4^{24}^{24}24All the hyperparameters were already shown in Table 4. The numerical default values are represented by vertical dashed lines. In the J48 tuning scenario, the largest contrast may be noticed in the ‘R’ subplot: most of the obtained solutions presented ‘R=FALSE’, which disables the use of the reduce error pruning option and the hyperparameter ‘N’ (like default setting does). The M values obtained also tends to get close to the default value in most of the cases (close to
). The other Boolean hyperparameters seem not to influence the predictive performances reached during the optimization process since they present a very uniform distribution. Overall, the only hyperparameter that may contribute to generate solutions different from the default values is the
confidence pruning hyperparameter (‘C’), as indicated by Subfigure 4 (a).5.2 Performance analysis regarding CART algorithm
Figure 6 presents graphical analysis for the cart results. Different from J48, cart was more affected by hyperparameter tuning. In most of the datasets analyzed, the use of tuned values improved the predictive performance with statistical significance when compared with the use of default values in () of the cases. It must be observed that the irace and smbo were the best optimization techniques, regarding just the predictive performance of the induced models. Defaults values were better than tuned ones in of the cases. In the remaining datasets, there was no significant statistical improvement using optimized values.
Regarding the size of cart dts, whenever defaults were statistically better, the trees induced by them have similar or lower sizes than the tuned ones. However, in most of the cases, tuned hyperparameter settings induced trees statistically better and much larger than those created using default values. Even ‘defaults’ trees being simpler, they were incapable of classifying most of the problems properly. The comparison among the tuning techniques showed results different from those obtained for the J48 algorithm. The tuning techniques led to the induction of dts with similar sizes. However, the dts induced when irace was used were slightly larger, and with better predictive performance than those induced using the other optimization techniques.
Thecart hyperparameters’ distributions found by the tuning techniques can be found in Figure 6. Different from J48, cart tuned trees were obtained from values substantially different from the default values. This is more evident for the numerical hyperparameters, as shown in subfigures 6 (a) to (d). The ‘cp’, ‘minbucket’ and ‘minsplit’ values tend to be smaller than default values. For ‘maxdepth’, a wide range of values is tried, indicating a possible dependence on the input problem (dataset). However, the categorical hyperparameters’ distributions, shown in Subfigures 6 (e) and (f), are very uniform, indicating that their choices may not influence the final predictive performance.
5.3 Performance analysis regarding CTree algorithm
The results obtained in the experiments with the ctree are illustrated by Figure 8. Most of the tuning techniques presented similar results, with the exception of ga (the green line), which was clearly worse than all the other techniques regarding predictive performance. Unlike the two previous case studies, ctree predictive performance was less influenced by the hyperparameter tuning. Default values generated the best models in of the datasets. Tuned values improved the predictive performance of the induced trees in () of the datasets. For the remaining there was no statistical difference between the use of default values and values produced by tuning techniques.
Considering the size of the induced trees, tuning techniques did not generate larger or smaller trees than those induced by using default values. There are just a few exceptions, for dataset ids = {79, 57}, were tuned trees are visually larger but improved the predictive performance. Comparing tuning techniques among them, irace and pso were the best techniques considering just the predictive performance of the models, followed by the smbo technique.
Figure 8 presents the ctree hyperparameter values’ distributions found during the tuning process. Similarly to the cart scenario, all the numerical hyperparameters presented values different from the default values: some of them produced values smaller than default values (‘minbucket’, ‘minsplit’); another was similar to the default value (‘mtry’); and all the others varied in a wide range of values (‘maxdepth’, ‘mincriterion’). The categorical hyperparameter ‘stump’, which enables the induction of a tree with just one level, is mostly set as stump = FALSE, like the default setting, having no real impact on the performance differences.
5.4 Statistical comparisons between techniques
The Friedman test Demvsar:2006 , with significance levels at and
, was also used to compare the hyperparameter tuning techniques, evaluating the statistical significance of the experimental results. The null hypothesis states that all classifiers induced with the hyperparameter settings found by the tuning techniques, and the classifier induced by default values, are equivalent concerning predictive bac performance. If the null hypothesis was rejected, the Nemenyi posthoc test was applied, stating that the performances of two different techniques are significantly different if the corresponding average ranks differ by at least a cd value.
Figure 9 presents the cd diagram for the three dt induction algorithms. Techniques are connected when there is no statistically significant differences between them. Considering , Subfigure 9(a) depicts the comparison in J48 scenario. One may note that there is no statistically differences between the top two best techniques: irace and pso. Also, the models induced with default hyperparameter values obtained no statistically better results than irace, pso, smbo and rs. eda and ga obtained statistically inferior performances.
For the cart algorithm (Subfigure 9(b)), the best ranked technique over all datasets was irace, followed by rs with no statistically significant results. dts induced with default hyperparameter values obtained the worst performance, being statistically comparable only with ga and eda.
CDdiagrams for the ctree results are shown in Subfigures 9(e) and 9(f). The defaults hyperparameter values were ranked first, followed by the irace, pso and smbo techniques. However, there is no statistical differences between them. The rs and eda compose the second block of techniques. They do not present statistical differences between them but do in relation to the first group of techniques. Finally, the ga technique was statistically worst than all the other techniques.
It is worth mentioning that irace was the best tuning technique for all the algorithms. Whereas the statistical test did not show significant differences between irace and pso (J48, ctree), and between irace and rs (cart), it is easy to see that irace is the preferred technique, presenting the lowest averaging ranking. When a larger value was used (with cd ), there were no changes in J48 and ctree scenarios. However, regarding cart performances, irace statistically outperformed all the other techniques, as can be seen in Subfigure 9(d).
5.5 When to perform tuning?
A set of data complexity Orriols:2010 ; Garcia:2016 measures was used to characterize the datasets, and provide patterns that could explain when it is better to use tuned or default values. From the thirteen measures used, three were able to relate their values with the J48 hyperparameter tuning bac performances:

Fischer’s discriminant ratio (f1), f1 [0,+)  selects the attribute that best discriminates the classes: the higher the value, the higher the indicative that at least one of the dataset attributes is able to linearly separate data from different classes;

Collective feature efficiency (f4), f4 [0,+1]  considers the discriminative power of all the dataset’s attributes;

Fraction of points lying on the class boundary (n1), n1 [0,+1]  estimates the complexity of the correct hypothesis underlying the data. Higher values indicate the need for more complex boundaries to separate data.
Two of these measures (f1 and n1) try to identify the existence of at least one dataset attribute that may linearly separate classes, while f4 attempts to provide information by taking into account all the attributes available in the dataset. Considering them, some simple rules could also be observed: hyperparameter tuning is commonly recommended for multiclass problems with several classes (), for datasets with a Fischer’s discriminant ratio close to zero (), and finally, when the average number of instances in the class boundary is . In cases where a high collective feature efficiency occurs (), defaults hyperparameter values induce good models.
For cart, in addition to n1, two other measures were important:

The maximum individual attribute efficiency (f3), f3 [0,+1]  indicates the presence of attributes whose values do not overlap between classes;

The nonlinearity of the onenearest neighbor classifier (n4), n4
[0,+1]  this measure creates a test set by linear interpolation with random coefficients between pairs of randomly selected instances of the same class. Then, it returns the test error of the 1NN classifier.
Two of these measures (n1, n4) evaluate the class separability, while f3 measures the overlap in feature space from different classes. Defaults were suggested for few problems when more than points were placed in the boundaries, there was at least one attribute with a maximum individual efficiency bigger than , and a linear classifier performed quite well (). Thus, the analysis suggests that hyperparameter tuning is recommended especially for multiclass problems, and for those without a clear linear decision boundary to separate data instances (they are more complex).
Regarding ctree, a different set of measures was considered:

Average intra/inter class nearest neighbor distances (n2), n2 [0,+)  the average intraclass and interclass distances ratio used by a kNN algorithm to classify data examples. Low values indicate that examples from the same class lay closely in the feature space, while high values indicate that examples from the same class are dispersed;

Training error of a linear classifier (l2), l2 [0,+1]  the predictive performance of a linear classifier for the training data. The lower the value, the closer the problem is to be linearly separable.
The measures n2 and l2 are also related to the problem classes separability. Tuning is usually recommended when data from the same class are disperse (), and when a linear classifier is not able to classify examples with a training error (hard problems). For the other situations, default values are recommended.
5.6 Runtime analysis
Running time is also an important aspect to be considered when performing experimental analyses. Figures 11 to 13 show the average tuning, training and testing times spent by the techniques when performing the hyperparameter tuning of the dt induction algorithms.
Tuning and testing times are related to the optimization process. The first measures the time required by the techniques to find good hyperparameter settings considering the time budget size. The second calculates the time required for assessing hyperparameter settings recommended by the tuning techniques (illustrated by the outer loop of Figure 2). The training time measures the time required for inducing dts with the suggested hyperparameters using all the datasets’ instances. The idea is to reproduce how models would perform in a practical scenario.
The values in the yaxis of the Figures are in seconds but were scaled with a transformation due to their discrepancy. Each curve with a different color represents a tuning technique. Since there is no tuning with defaults, there is no black dotted curve in the tuning subcharts.
5.6.1 J48 runtime
Figure 11 presents the runtime analysis for J48. Considering the tuning time, the metaheuristics (pso, ga, eda) are the fastest tuning techniques. They are populationbased techniques, so they benefit from population coding structure to speed up the convergence of their computation and tend to a common solution. RS and irace are in the middle. While the former technique simply randomly searches the space, the latter statistically compares many candidates in several rounds. That may explain why they require more running time than populationbased techniques.
Finally, the smbo technique presented the highest optimization/tuning time. The main reason is its inner subprocesses. After evaluating the initial points, the technique fits a rf regression model on the available data. Next, it queries the model to propose a new candidate hyperparameter solution using an acquisition function (or infill criteria). This function searches for a point at the hyperspace which yields the best infill value (the expected improvement) and then adds this value to the model for the next iteration. By checking the technique executions, it was observed that these steps are its main bottleneck, reflected directly in the final runtime.
The test runtime scale is too small, so in practice, there are no significant differences in the processing costs of the optimization techniques. Usually, tuned trees are assessed faster than those induced using default values, because tuning techniques induce smaller trees than the ones induced with default hyperparameter values (see Figure 3). Regarding training costs, training using default settings is faster than using tuned hyperparameter values. It may be due to the Boolean hyperparameters. They enable/disable some transformations that would require more time to handle data. When default hyperparameter settings are used, all of these transformations are disabled.
5.6.2 CART runtime
Figure 12 presents the same analysis for the cart algorithm. In general, running time results using cart provided similar insights to those obtained in the analysis of the J48 results. smbo was also the technique with the highest processing cost, i.e., it required more time to consume the budget of possible evaluations (as previously discussed). The other techniques have similar cost curves, with oscillating values depending on the dataset characteristics. For J48, for example, irace and rs required more time than the metaheuristics.
When evaluating the hyperparameter settings testing the induced dts, models induced with default hyperparameter values required more time to be assessed than those induced with the recommended tuned settings. This fact occurred every time dts induced with default values presented a predictive performance statistically better than models induced with tuned hyperparameter settings. Regarding the training time, hyperparameter tuned dts spent more time to induce the models. Since default hyperparameter values generated smaller trees, the test instances need to follow less internal nodes to be labeled with one of the classes.
5.6.3 CTree runtime
Figure 13 presents the running time analysis for the ctree algorithm. Similarly to scenarios of the previous algorithms, the smbo technique was the most timeconsuming technique to evaluate the defined budget size. The other techniques presented similar behavior, varying slightly depending on the problem under optimization. There are at least five datasets were all the techniques spent a long time to optimize the hyperparameters: they may be observed at data ids = {57, 64, 73, 74, 75}. All of them are multiclass classification tasks with at least classes, implying a difficulty that ctree may have to solve classification tasks with many classes.
Training models with default values required less time than using hyperparameter tuned solutions. By default, ctree does not apply any random selection of the input features during when training (). All the other numerical hyperparameters tend to present values smaller than default values, in theory, producing smaller trees. However, this is not seen in practice. Tree sizes are very similar (tuned vs default) and ‘mtry’ values might explain the difference. Regarding the test, the runtime scale is too small, so there are no real differences when evaluating settings found by using the tuning.
5.7 Convergence of the tuning techniques
Regarding the convergence of the tuning techniques, the boxplots in Figure 13
show the minimum, maximum and three quartiles for the number of evaluations assessed until the best solution was reached. The yaxis shows the number of evaluations, while the xaxis indicates the tuning techniques. Even using a budget of
iterations, all tuning techniques required at most steps in the three case studies. Except for irace, which required the largest number of candidates to converge, it is still possible to say that most of the good hyperparameter settings were reached between in first iterations for cart and J48 (as already observed in Mantovani:2016 ).The exception here is the ctree algorithm, since it required more iterations than J48 and cart. Looking back to the tuning results at Figure (a)a, default values provided the best solution in almost 40% of the datasets, and the difficulty to find good hyperparameter settings that would outperform them is reflected in Figure 13(c).
Boxplots in Figure 13 also suggest that irace requires more evaluations than the rs technique. Looking in details, irace is based on three steps: (1) sampling new hyperparameter configurations according to a particular distribution^{25}^{25}25The distributions are independently for each hyperparameter.; (2) selecting the best set of configurations by means of racing, and (3) updating the sampling distributions towards the optimal region Lopez:2016 .
The race procedure starts with a finite set of candidates, and, at each step, discards hyperparameter settings that perform statistically worse than at least another. This process continues with the survivors. In the first iteration, this initial set of candidates is generated from hyperparameter distributions. The authors Lopez:2016 emphasize that the first elimination process is fundamental, so there are some of instances () that must be seen before performing statistical tests. Therefore, new statistical comparisons are performed after new instances are assessed. By default, irace suggests (as detailed in Table 6) and ). These values were defined after being tuned and studied for different optimization scenarios Perez:2014 .
Internally, the technique estimates its racing hyperparameters based on the budget and target hyperspace. The number of races () depends on the number of hyperparameters, while each race has a proper budget () limited by the iteration index and a number of evaluations still available^{26}^{26}26For further details, please consult the irace’s manual irace:2011 .. Thus, irace works in such a way that the number of candidate settings decreases with the number of iterations, which means more evaluations per configuration will be performed in late iterations.
Therefore, this difference concerning evaluations is better explained by the default value of , which increases the minimum number of evaluations required by the technique. The inner racing hyperparameters also influence, since they will control the number of races, requiring more statistical tests (and evaluations) in late iterations. However, even evaluating more hyperparameter candidates than the rs technique, irace does not require an additional time (as may be seen in Figures 11 to 13, except for some datasets and the J48 algorithm). Moreover, it might be covering different regions of the hyperspace, which is indicated by the results obtained and illustrated by figures 9.
Considering just the number of hyperparameters assessed during the search, although the runtime analysis showed that smbo is the most costly, it was able to find good solutions assessing a smaller number of candidates than irace (the technique that resulted in dts with the best predictive performance). This occurred for all the algorithms, suggesting that with different stopping criteria (early convergence), even smbo could be a reliable choice.
The pso technique was able to find good hyperparameter solutions in J48 and cart scenarios with less than iterations. Based on the statistical results from figure 9, pso was often among the best techniques for all the three scenarios. In some cases, depending on the statistical test, it was not statistically different from the best technique (irace). Thus, it may be a good alternative to fast obtain good solutions.
5.8 Hyperparameters’ importance analysis
Statistical analysis was also used to understand how different hyperparameters affect each other, and the dt induction algorithm performances. AN approach to evaluate how the hyperparameters are affecting the performance of the induced models when different tuning techniques are performed is the use of fANOVA (Functional ANOVA framework)^{27}^{27}27https://github.com/automl/fanova, introduced in Hutter:2014
. In that paper, the authors present a lineartime algorithm for computing marginal predictions and quantify the importance of single hyperparameters and interactions between them. The key idea is to generate regression trees that predict the performance of hyperparameter settings and apply the variance decomposition framework directly to the trees in these forests.
In the source article, the authors ran fANOVA with smbo hyperparameter settings over some scenarios, but never with more than hyperparameter settings. Here, a single execution of irace generates evaluations. Thus, experiments using all techniques would have a high computational cost. Since irace was the best technique overall in both algorithms, it was used to provide the hyperparameter settings to this analysis. In the experiments, hyperparameter settings from repetitions were used and more memory was allocated to the fANOVA code.
Figure LABEL:fig:fanova_params shows the results for dt induction algorithms. In the figure, the xaxis shows all datasets while yaxis presents the hyperparameters importance regarding fANOVA. The larger the importance of a hyperparameter (or pair of them), the darker its corresponding square, i.e., more important is the hyperparameter for inducing trees in the dataset (scaled between zero and one).
In the figure, any single hyperparameter (or combination of them), whose contribution to the performance of the final models was lower than , was removed. Applying this filter substantially reduced the hyperparameters in focus, but even so, most of the rows in the heatmap are almost white (light red). This analysis shows that most of the combinations have little contribution to the performance of the induced dts.
In Subfigure LABEL:fig:fanova_params(a), fANOVA indicates that most of the J48 performances were influenced by M hyperparameter values: when not alone, in combination with another hyperparameter (R, N, C). For cart, the ‘minbucket’ and ‘minsplit’ hyperparameters are the main responsible for the performance of the induced dts, as may be seen in Figure LABEL:fig:fanova_params(b).
For ctree, seven of the fANOVA’s jobs produced errors when executing. In these situations, a white column is presented at the heatmap. Regarding the analysis, the hyperparameters ‘minbucket’ and ‘minsplit’ are the most important, similarly with the cart’s chart. On the other hand, they have less strength to predict marginal distributions. It reinforces previous findings describing ctree as less sensitive to tuning.
These findings enforce what was discussed in the previous subsection: although each one of the analysis may point out a different important hyperparameter, the same subset of hyperparameters seems to influence the final performance of the induced dt.
6 Threats to Validity
In an empirical study design, methodological choices may impact the results obtained in the experiments. Next, the threats that may impact the results from this study are discussed.
6.1 Construct validity
The datasets used in the experiments were selected to cover a wide range of classification tasks, with different characteristics. They were used in their original versions, i.e., no preprocessing was required, since dt are able to handle any missing information or data from different types. The only restriction adopted ensures that all classes in the datasets must have at least observations. Thus, stratification with outer folds can be applied. Of course, other datasets may be used to expand data collection, if they obey the ‘stratified’ criterion. However, the authors believe that addition of datasets will not substantially change the overall behavior of tuning on the algorithms investigated.
Regarding the dt induction algorithms, cart and J48 are among the most popular algorithms used in data mining Wu:2009 . The ctree algorithm works similarly to the traditional CHAID algorithm, using statistical tests, but provides a more recent implementation which handles different types of data attributes^{28}^{28}28The CHAID algorithm handles just categorical data attributes.. Experiments were focused on these algorithms due to the interpretability of their induced models and widespread use. All of them generate simple models, are robust for specific domains, and allow nonexperts users to understand how the classification decision is made. The same experimental methodology and analyses can be applied to any other ml algorithm.
Since a wide variety of datasets compose the data collection, some of them may be imbalanced. Thus, the bac performance measure Brodersen:2010 was used as fitness function during the optimization process. Therefore, class distributions are being considered when assessing a candidate solution. The same performance measure is used to evaluate the final solutions returned by the tuning techniques. Other predictive performance measures can generate different results, depending on how they deal with data imbalance.
The experimental methodology described in Section 4 considers the tuning techniques that have been used in related literature Feurer:2015 ; Sureka:2008 ; Sun:2013 ; Kotthoff:2016 . The exceptions are the eda and irace techniques, which have been explored recently for hyperparameter tuning of other ml algorithms, like svm Padierna:2017 ; Miranda:2014 . Since there is a lack of studies investigating these techniques for dt (see Section 2.4), they were added to the experimental setup.
6.2 Internal validity
Krstajic et. al. Krstajic:2014 compared different resampling strategies for selecting and assessing the predictive performance of regression/classification models induced by ml algorithms. In Cawley & Talbot Cawley:2010 the authors also discuss the overfitting in the evaluation methodologies when assessing ml algorithms. They describe a socalled “unbiased performance evaluation methodology”, which correctly accounts for any overfitting that may occur in the model selection. The internal protocol described by the authors performs the model’s selection independently within each fold of the resampling procedure. In fact, most of the current studies on hyperparameter tuning have adopted nestedcv, including important autoML tools, like AutoWEKA^{29}^{29}29http://www.cs.ubc.ca/labs/beta/Projects/autoweka/ Thornton:2013 ; Kotthoff:2016 and AutoskLearn^{30}^{30}30https://github.com/automl/autosklearn Feurer:2015 ; Feurer:2015B . Since this paper aims to assess dt induction algorithms optimized by hyperparameter tuning techniques, the nested cv methodology is the best choice and was adopted in the experiments.
In the experiments carried out for this study, all the default settings provided by the implementations of the tuning techniques were used. In fact, most of these default values have been evaluated in benchmark studies and reported to provide good predictive performance mlrMBO:2017 ; Perez:2014 , while others (like pso’s) showed to be robust in a high number of datasets. For eda and ga, there is no standard choice for their parameter values Mills:2015 , and even adapting both to handle our mixed hyperparameter spaces properly they performed poorly. It suggests that a fine tuning of their parameters would be needed. Since this would considerably increase the cost of experiments by adding a new tuning level (the tuning of tuning techniques), and most of the techniques performed well with default values, this additional tuning was not assessed in this study.
The use of a larger budget, with evaluations for dt tuning, was investigated in Mantovani:2016 . The experimental results suggested that all the considered techniques required only evaluations to converge. The convergence here means the tuning techniques could not improve their predictive performance more than until the budget was consumed. Actually, in most cases, the tuning reached its maximum performance after steps. Thus, a budget size of evaluations was therefore deemed sufficient. Results obtained with this budget value showed that the exploration made in hyperparameter spaces led to statistically significant improvements in most cases.
6.3 External validity
Section 5.4 presented statistical comparisons between tuning techniques. In Demvsar:2006 , Demšar discusses the issue of statistical tests for comparisons of several techniques on multiple datasets reviewing several statistical methodologies. The method proposed as more suitable is the nonparametric analog version of ANOVA, i.e. the Friedman test, along with the corresponding Nemenyi posthoc test. The Friedman test ranks all the methods separately for each dataset and uses the average ranks to test whether all techniques are equivalent. In case of differences, the Nemenyi test performs all the pairwise comparisons between the techniques and identifies the presence of significant differences. Thus, the Friedman ranking test followed by the Nemenyi posthoc test was used to evaluate experimental results from this study.
Some recent studies raised concerns that a FriedmanNemenyi test produces overlapping groups Tantithamthavorn:2017 . They recommend the use of the ScottKnott Effect Size Difference test to produce nonoverlapping groups. Using the ScottKnott ESD test, under its assumptions, the analysis of the experimental results did not change.The main effect was to generate clean groups, while in the Friedman test a CDdiagram is required to interpret results. In general, there is no silver bullet, and each test will have its pros and cons.
The budget size adopted can directly influence the performance of the metaheuristics, specially ga and eda. In Hauschild:2011 the authors recommend to use at least individuals to build a reliable eda model, suggestion followed in Mantovani:2016 . In this extended version, the budget size was reduced, supported by prior analyses, and tuning techniques adapted to work with the reduced number of evaluations. Increasing the population size would also increase both the number of iterations and the budget size. However, it has already been experimentally shown that just a small number of evaluations provides good predictive performance values Mantovani:2016 . It is important to highlight that even using a small population the pso technique reached robust results in a wide variety of tasks considering the three dt algorithms investigated. At this point, the poor performance values obtained by ga and eda can be considered a limitation: they do not search properly space under this budget restriction.
7 Conclusions
This paper investigated the effects of hyperparameter tuning on the predictive performance of dt induction algorithms, as well the impact hyperparameters have on the induced models’ performances. For this purpose, three different dt implementations were chosen as study cases: two of the most popular algorithms in ml  the J48 and cart algorithms, and the ctree algorithm, a more recent implementation similar to classical CHAID algorithm. An experimental analysis regarding the sensitivity of their hyperparameters was also presented. Experiments were carried out with public openml datasets and six different tuning techniques. The performances of dt induced using these techniques were also compared with dts generated with the default hyperparameter values (provided by the correspondent R packages). The main findings are summarized below.
7.1 Tuning of J48
In general, hyperparameter tuning for J48 produced modest improvements when compared to the RWeka default values: the trees induced with tuned hyperparameter settings reached performances similar to those obtained by defaults. Statistically significant improvements were detected in only onethird of the datasets, often those datasets where the default values produced very shallow trees.
The J48 Boolean hyperparameters are responsible for enabling/disabling some data transformation processes. In default settings, all of these hyperparameters are disabled. So, enabling them requires more time to induce and assess trees (which can be noted in the runtime analysis and charts in Section 5.6). Furthermore, the relative hyperparameter importance results (via fANOVA analysis) showed that these Boolean hyperparameters are irrelevant for most datasets. Only a subset of hyperparameters (R, C, N, M) contributes actively to the performance of the final dts.
Most of the related studies which performed some tuning for J48 tried different values for the complexity parameter (C), but none of them tried hyperparameter tuning using reduced error pruning: enabling ‘R’ and changing ‘N’ values. The use of ‘R’ and ‘N’ options may be a solution when tuning only ‘C’ does not sufficiently improve performance (as indicated by fANOVA analysis).
None of the related work used the irace technique: they focused more on smbo, pso or another tuning technique. smbo is often used with an early stopping criterion (a budget) since it is the slowest technique. However, it typically converged after relatively few iterations. If it is desirable to obtain good solutions faster, pso might be recommended. However, for the J48 algorithm, the best technique concerning performance is irace: it was better ranked, evaluated more candidates, and did not consume a lot of runtime.
The J48 default hyperparameter values were good for a significant number of datasets. This behavior can be since the defaults used by RWeka were chosen to be the best overall values performing on the uci ml repository Bache:2013 datasets.
7.2 Tuning of CART
Surprisingly, cart was much more sensitive to hyperparameter tuning than J48. Statistically significant improvements were reached in two thirds of the datasets, most of them with a highperformance gain. Most of the hyperparameters control the number of instances in nodes/leaves used for splitting. These hyperparameters directly affect the size and depth of the trees. The experimental analyses showed that default settings induced shallow and small trees for most of the problems. These trees did no obtain good predictive performances. Where the defaults did grow large trees, the performance was similar to the optimized performance. In general, cart’s default hyperparameter values induced trees which are on average smaller than those produced by J48 under default settings. One reason that may also explain the poor cart’s default performances would be the case that J48 hyperparameters were pretuned on uci datasets while the cart ones were not.
Our relative importance analysis indicated that hyperparameters such as ‘minsplit’ and ‘minbucket’ are the most responsible for the performance of the final trees. In the related literature, just two of the five works investigated the tuning of both. Even so, they used rs and smbo as tuning techniques. Experiments showed that for cart hyperparameter tuning, the irace technique significantly outperformed all the other ones (especially with ). It evaluated a higher number of candidates during the search, and its running time was comparable to that of the metaheuristics. Thus, irace would be a good choice and might be further explored in future research.
7.3 Tuning of CTree
The tuning of ctree was a new task contribution from this study, considering the related results: none of them evaluated more than two hyperparameters before. The algorithm proved to be the least sensitive to the hyperparameter tuning process, setting up a third case distinct from the previous two. Statistically improvements were observed in just a quarter of the datasets. Default values were also statistically better in of the situations.
Similar to cart, most of its hyperparameters control the number of data examples in the node used for splitting (but in a statistical approach). Consequently, they control the size and depth of the induced trees. During the optimization of the hyperparameters, the tuning techniques found a wide range of hyperparameters values that differ from default settings (usually smaller). However, trees sizes did not show any visible difference, with irace, pso and smbo curves almost overlapping for all the datasets. It suggests that different from J48 and cart, another characteristic rather than the tree size could influence the final predictive performances.
The hyperparameter importance analysis also indicated that few of the hyperparameters studied are responsible in some way for the predictive performance of the final trees. Experiments also showed that irace would be the best hyperparameter tuning technique, being better ranked than other tuning techniques and presenting a running time comparable to the other metaheuristics.
7.4 General scenario
In this analysis, we hypothesized that dataset complexity could explain when to use each tuning approach. It can be assumed that the more complex (difficult to classify) a dataset is, the more a dt algorithm will benefit from hyperparameter tuning. Thus, to understand when to use each approach, and be able to recommend when to tune the hyperparameter or use the defaults values, each dataset was described by characteristics obtained by extracting a set of complexity measures, which suggest how difficult a dataset is for a classification task.
We observed that hyperparameter tuning provides best results for datasets with many classes (), and when there are nonlinear decision boundaries. On the other hand, defaults seem to be adequate for simple classification problems, where there is a higher separability between the classes.
Considering the algorithms investigated in this study, each one presented a different behavior under tuning. In general, it was possible to observe that the default hyperparameter values are suitable for a large range of datasets, but a fixed value would not be suitable for all the data classification tasks. It justifies and motivates the development of recommender systems able to suggest the most appropriate hyperparameter setting for a new problem.
7.5 Future Work
Our findings also point out to some future research directions. The data complexity characteristics provided some useful insight regarding in which situations tuning or defaults should be used. However, it would be possible to make more accurate suggestions exploring more concepts from the metalearning field.
It would obviously also be interesting to explore other ML algorithms and their hyperparameters: not only dt induction algorithms, but many classifiers from different learning paradigms. The code developed in this study, which is publicly available, is easily extendable and may be adapted to cover a wider range of algorithms. The same can be said for the analysis.
All collected hyperparameter information might be leveraged in a recommendation framework to suggest hyperparameter settings. When integrated with openml, this framework could have great scientific (and societal) impact. The authors have already begun work in this direction.
Acknowledgments
The authors would like to thank CAPES and CNPq (Brazilian Agencies) for their financial support, specially to the grants #2012/231149, #2013/073750 and #2015/039860 from São Paulo Research Foundation (FAPESP).
EFOP3.6.3VEKOP16201700001: Talent Management in Autonomous Vehicle Control Technologies – The Project is supported by the Hungarian Government and cofinanced by the European Social Fund.
References
 (1) Lior Rokach and Oded Maimon. Data Mining With Decision Trees: Theory and Applications. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2nd edition, 2014.
 (2) Simon Haykin. Neural Networks: A Comprehensive Foundation (3rd Edition). PrenticeHall, Inc., Upper Saddle River, NJ, USA, 2007.
 (3) Shigeo Abe. Support Vector Machines for Pattern Classification. Springer London, Secaucus, NJ, USA, 2005.
 (4) Xindong Wu and Vipin Kumar. The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC, 1st edition, 2009.
 (5) Dariusz Jankowski and Konrad Jackowski. Evolutionary algorithm for decision tree induction. In Khalid Saeed and Václav Snášel, editors, Computer Information Systems and Industrial Management, volume 8838 of Lecture Notes in Computer Science, pages 23–32. Springer Berlin Heidelberg, 2014.
 (6) PangNing Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, (First Edition). AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA, 1 edition, 2005.
 (7) L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Chapman & Hall (Wadsworth, Inc.), 1984.
 (8) J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.

(9)
Ron Kohavi.
Scaling up the accuracy of naivebayes classifiers: A decisiontree hybrid.
In Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996.  (10) Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 95(12):161–205, 2005.
 (11) Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651–674, 2006.
 (12) James S. Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2546–2554. Curran Associates, Inc., 2011.
 (13) Martin Pilát and Roman Neruda. Multiobjectivization and Surrogate Modelling for Neural Network Hyperparameters Tuning, pages 61–66. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
 (14) Carlo M. Massimo, Nicolò Navarin, and Alessandro Sperduti. HyperParameter Tuning for Graph Kernels via Multiple Kernel Learning, pages 214–223. Springer International Publishing, Cham, 2016.
 (15) Luis Carlos Padierna, Martín Carpio, Alfonso Rojas, Héctor Puga, Rosario Baltazar, and Héctor Fraire. HyperParameter Tuning for Support Vector Machines by Estimation of Distribution Algorithms, pages 787–800. Springer International Publishing, Cham, 2017.
 (16) R.C. Barros, M.P. Basgalupp, A.C.P.L.F. de Carvalho, and A.A. Freitas. A survey of evolutionary algorithms for decisiontree induction. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(3):291–312, May 2012.
 (17) Rodrigo C. Barros, André C. P. L. F. de Carvalho, and Alex Alves Freitas. Automatic Design of DecisionTree Induction Algorithms. Springer Briefs in Computer Science. Springer, 2015.

(18)
Matthias Reif, Faisal Shafait, and Andreas Dengel.
Prediction of classifier training time including parameter
optimization.
In Joscha Bach and Stefan Edelkamp, editors,
KI 2011: Advances in Artificial Intelligence
, volume 7006 of Lecture Notes in Computer Science, pages 260–271. Springer Berlin Heidelberg, 2011.  (19) M. M. Molina, J. M. Luna, C. Romero, and S. Ventura. Metalearning approach for automatic parameter tuning: A case study with educational datasets. In Proceedings of the 5th International Conference on Educational Data Mining, EDM 2012, pages 180–183, 2012.
 (20) Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel. Automatic classifier selection for nonexperts. Pattern Analysis and Applications, 17(1):83–96, 2014.
 (21) Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2005.
 (22) Gordon V. Kass. An exploratory technique for investigating large quantities of categorical data applied statistics. Applied Statistics, 30(2):119–127, 1980.
 (23) D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley, 1989.
 (24) James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks, volume 4, pages 1942 – 1948, Perth, Australia, 1995.

(25)
Mark Hauschild and Martin Pelikan.
An introduction and survey of estimation of distribution algorithms.
Swarm and Evolutionary Computation
, 1(3):111 – 128, 2011.  (26) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012.
 (27) Mauro Birattari, Zhi Yuan, Prasanna Balaprakash, and Thomas Stützle. FRace and Iterated FRace: An Overview, pages 311–336. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
 (28) Frank Hutter, Holger Hoos, and Kevin LeytonBrown. An efficient approach for assessing hyperparameter importance. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 754–762, 2014.
 (29) Rafael Gomes Mantovani, Tomás Horváth, Ricardo Cerri, Joaquin Vanschoren, and André C. P. L. F. de Carvalho. Hyperparameter tuning of a decision tree induction algorithm. In 5th Brazilian Conference on Intelligent Systems, BRACIS 2016, Recife, Brazil, October 912, 2016, pages 37–42. IEEE Computer Society, 2016.
 (30) Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explor. Newsl., 15(2):49–60, 2014.
 (31) Taciana A. F. Gomes, Ricardo B. C. Prudêncio, Carlos Soares, André L. D. Rossi, and nd André C. P. L. F. de Carvalho. Combining metalearning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1):3–13, 2012.
 (32) James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. J. Mach. Learn. Res., 13:281–305, March 2012.
 (33) Matthias Reif, Faisal Shafait, and Andreas Dengel. Metalearning for evolutionary parameter optimization of classifiers. Machine Learning, 87:357–380, 2012.
 (34) Barbara F.F. Huang and Paul C. Boutros. The parameter sensitivity of random forests. BMC Bioinformatics, 17(1):331, 2016.
 (35) Katharina Eggensperger, Frank Hutter, Holger H. Hoos, and Kevin LeytonBrown. Efficient benchmarking of hyperparameter optimizers via surrogates. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI’15, pages 1114–1120. AAAI Press, 2015.

(36)
Lidan Wang, Minwei Feng, Bowen Zhou, Bing Xiang, and Sridhar Mahadevan.
Efficient hyperparameter optimization for NLP applications.
In Lluís Màrquez, Chris CallisonBurch, Jian Su,
Daniele Pighin, and Yuval Marton, editors,
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 1721, 2015
, pages 2112–2117. The Association for Computational Linguistics, 2015.  (37) Tatjana Eitrich and Bruno Lang. Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of Comp. and Applied Mathematics, 196(2):425–436, 2006.
 (38) J. GascónMoreno, S. SalcedoSanz, E. G. OrtizGarcía, L. CarroCalvo, B. SaavedraMoreno, and J. A. PortillaFigueras. A binaryencoded tabulist genetic algorithm for fast support vector regression hyperparameters tuning. In International Conference on Intelligent Systems Design and Applications, pages 1253–1257, 2011.
 (39) Munehiro Nakamura, Atsushi Otsuka, and Haruhiko Kimura. Automatic selection of classification algorithms for nonexperts using metafeatures. ChinaUSA Business Review, 13(3):199–205, 2014.
 (40) Parker Ridd and Christophe GiraudCarrier. Using metalearning to predict when parameter optimization is likely to improve classification accuracy. In Joaquin Vanschoren, Pavel Brazdil, Carlos Soares, and Lars Kotthoff, editors, Metalearning and Algorithm Selection Workshop at ECAI 2014, pages 18–23, August 2014.
 (41) Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michèle Sebag. Collaborative hyperparameter tuning. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML13), volume 28, pages 199–207. JMLR Workshop and Conference Proceedings, 2013.
 (42) M. Lang, H. Kotthaus, P. Marwedel, C. Weihs, J. Rahnenführer, and B. Bischl. Automatic model selection for highdimensional survival analysis. Journal of Statistical Computation and Simulation, 85(1):62–76, 2015.
 (43) P.B.C. Miranda, R.M. Silva, and R.B. Prudêncio. Finetuning of support vector machine parameters using racing algorithms. In Proceedings of the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2014, pages 325–330, 2014.
 (44) Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing bayesian hyperparameter optimization via metalearning. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI’15, pages 1128–1135. AAAI Press, 2015.
 (45) Vili Podgorelec, Saso Karakatic, Rodrigo C. Barros, and Márcio P. Basgalupp. Evolving balanced decision trees with a multipopulation genetic algorithm. In IEEE Congress on Evolutionary Computation, CEC 2015, Sendai, Japan, May 2528, 2015, pages 54–61. IEEE, 2015.
 (46) Michael Schauerhuber, Achim Zeileis, David Meyer, and Kurt Hornik. Benchmarking OpenSource Tree Learners in R/RWeka, pages 389–396. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
 (47) Ashish Sureka and Kishore Varma Indukuri. Using Genetic Algorithms for Parameter Optimization in Building Predictive Data Mining Models, pages 260–271. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
 (48) Gregor Stiglic, Simon Kocbek, Igor Pernek, and Peter Kokol. Comprehensive decision tree models in bioinformatics. PLOS ONE, 7(3):1–13, 03 2012.
 (49) ShihWei Lin and ShihChieh Chen. Parameter determination and feature selection for c4.5 algorithm using scatter search approach. Soft Computing, 16(1):63–75, jan 2012.
 (50) J. Ma. Parameter Tuning Using Gaussian Processes. Master’s thesis, University of Waikato, New Zealand, 2012.
 (51) C. Thornton, F. Hutter, H. H. Hoos, and K. LeytonBrown. AutoWEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proc. of KDD2013, pages 847–855, 2013.
 (52) Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin LeytonBrown. Autoweka 2.0: Automatic model selection and hyperparameter optimization in weka. Journal of Machine Learning Research, 17:1–5, 2016.
 (53) Quan Sun and Bernhard Pfahringer. Pairwise metarules for better metalearningbased algorithm ranking. Mach. Learn., 93(1):141–161, oct 2013.
 (54) Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. Selecting nearoptimal learners via incremental data allocation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2007–2015. AAAI Press, 2016.
 (55) Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. Automated parameter optimization of classification techniques for defect prediction models. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 321–332, New York, NY, USA, 2016. ACM.
 (56) Manuel FernándezDelgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
 (57) Michael Wainberg, Babak Alipanahi, and Brendan J. Frey. Are random forests truly the best classifiers? Journal of Machine Learning Research, 17(110):1–5, 2016.
 (58) K. Bache and M. Lichman. UCI machine learning repository, 2013.
 (59) Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, and Can Candan. caret: Classification and Regression Training, 2016. R package version 6.071.
 (60) Róger BermúdezChacón, Gaston H. Gonnet, and Kevin Smith. Automatic problemspecific hyperparameter optimization and model selection for supervised machine learning: Technical Report. Technical report, Zürich, 2015.
 (61) Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2944–2952. Curran Associates, Inc., 2015.
 (62) JulienCharles Lévesque, Christian Gagné, and Robert Sabourin. Bayesian hyperparameter optimization for ensemble learning. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, UAI’16, pages 437–446, Arlington, Virginia, United States, 2016. AUAI Press.
 (63) Alexis SardáEspinosa, Subanatarajan Subbiah, and Thomas BartzBeielstein. Conditional inference trees for knowledge extraction from motor health condition data. Engineering Applications of Artificial Intelligence, 62:26 – 37, 2017.
 (64) Asa BenHur and Jason Weston. A user’s guide to support vector machines. In Data Mining Techniques for the Life Sciences, volume 609 of Methods in Molecular Biology, pages 223–239. Humana Press, 2010.
 (65) Sigrun Andradottir. A review of random search methods. In Michael C Fu, editor, Handbook of Simulation Optimization, volume 216 of International Series in Operations Research & Management Science, pages 277–292. Springer New York, 2015.
 (66) Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.
 (67) J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proc. 30th Intern. Conf. on Machine Learning, pages 1–9, 2013.
 (68) Frauke Friedrichs and Christian Igel. Evolutionary tuning of multiple svm parameters. Neurocomput., 64:107–117, 2005.
 (69) Alex Kalos. Automated neural network structure determination via discrete particle swarm optimization (for nonlinear time series models). In Proceedings of the 5th WSEAS International Conference on Simulation, Modelling and Optimization, SMO’05, pages 325–331. World Scientific and Engineering Academy and Society (WSEAS), 2005.
 (70) Dan Simon. Evolutionary Optimization Algorithms. Wiley, first edition, 2013.
 (71) Gavin C. Cawley and Nicola L. C. Talbot. On overfitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11:2079–2107, 2010.
 (72) Damjan Krstajic, Ljubomir J. Buturovic, David E. Leahy, and Simon Thomas. Crossvalidation pitfalls when selecting and assessing regression and classification models. Journal of Cheminformatics, 6(1):1–15, 2014.
 (73) ShihWei Lin, KuoChing Ying, ShihChieh Chen, and ZneJung Lee. Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Systems with Applications, 35(4):1817–1824, 2008.
 (74) XinShe Yang, Zhihua Cui, Renbin Xiao, Amir Hossein Gandomi, and Mehmet Karamanoglu. Swarm Intelligence and BioInspired Computation: Theory and Applications. Elsevier Science Publishers B. V., 1st edition, 2013.
 (75) Bernd Bischl, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016.
 (76) Luca Scrucca. Ga: A package for genetic algorithms in r. Journal of Statistical Software, 53(1):1–37, 2013.
 (77) Claus Bendtsen. pso: Particle Swarm Optimization, 2012. R package version 1.0.3.
 (78) Yasser GonzalezFernandez and Marta Soto. copulaedas: An R package for estimation of distribution algorithms based on copulas. Journal of Statistical Software, 58(9):1–34, 2014.
 (79) Kurt Hornik, Christian Buchta, and Achim Zeileis. Opensource machine learning: R meets Weka. Computational Statistics, 24(2):225–232, 2009.
 (80) Terry Therneau, Beth Atkinson, and Brian Ripley. rpart: Recursive Partitioning and Regression Trees, 2015. R package version 4.110.
 (81) Bernd Bischl, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. mlrMBO: A Modular Framework for ModelBased Optimization of Expensive BlackBox Functions.
 (82) Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.
 (83) Manuel LópezIbáñez, Jérémie DuboisLacoste, Leslie Pérez Cáceres, Mauro Birattari, and Thomas Stüetzle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43 – 58, 2016.

(84)
Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M.
Buhmann.
The balanced accuracy and its posterior distribution.
In
Proceedings of the 2010 20th International Conference on Pattern Recognition
, pages 3121–3124. IEEE Computer Society, 2010.  (85) Maurice Clerc. Standard partcile swarm optimization. 15 pages, September 2012.
 (86) Mauricio ZambranoBigiarini, Maurice Clerc, and Rodrigo Rojas. Standard particle swarm optimisation 2011 at CEC2013: A baseline for future PSO improvements. In Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2013, Cancun, Mexico, June 2023, 2013, pages 2337–2344. IEEE, 2013.
 (87) K. L. Mills, J. J. Filliben, and A. L. Haines. Determining relative importance and effective settings for genetic algorithm control parameters. Evol. Comput., 23(2):309–342, June 2015.
 (88) Janez Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006.
 (89) A. OrriolsPuig, N. Macia, and T. K. Ho. Documentation for the data complexity library in c++. Technical report, La Salle  Universitat Ramon Llull, Barcelona, Spain, 2010.
 (90) Luís P.F. Garcia, André C.P.L.F. de Carvalho, and Ana C. Lorena. Noise detection in the metalearning level. Neurocomputing, 176:14–25, 2016.
 (91) Manuel LópezIbáñez, Jérémie DuboisLacoste, Leslie Pérez Cáceres, Mauro Birattari, and Thomas Stützle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43 – 58, 2016.
 (92) Leslie Pérez Cáceres, Manuel LópezIbáñez, and Thomas Stützle. An Analysis of Parameters of irace, pages 37–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.
 (93) Manuel LópezIbáñez, Jérémie DuboisLacoste, Thomas Stützle, and Mauro Birattari. The irace package, iterated race for automatic algorithm configuration. Technical Report TR/IRIDIA/2011004, IRIDIA, Université Libre de Bruxelles, Belgium, 2011.
 (94) Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. An empirical comparison of model validation techniques for defect prediction models. 43(1):1–18, 2017.
Appendix A List of abbreviations used in the paper
[type=, title=]
Appendix B List of OpenML datasets used for experiments
Nro  OpenML name  OpenML did  D  N  C  %MajC 
1  acuteinflammations  1455  6  120  2  0.58 
2  analcatdata_authorship  458  70  841  4  0.38 
3  analcatdata_boxing1  448  3  120  2  0.65 
4  analcatdata_boxing2  444  3  132  2  0.54 
5  analcatdata_creditscore  461  6  100  2  0.73 
6  analcatdata_dmft  469  4  797  6  0.19 
7  analcatdata_germangss  475  5  400  4  0.25 
8  analcatdata_lawsuit  450  4  264  2  0.93 
9  appendicits  1456  7  106  2  0.80 
10  artificialcharacters  1459  7  10218  10  0.14 
11  autoUnivau11000  1547  20  1000  2  0.74 
12  autoUnivau42500  1548  100  2500  3  0.47 
13  autoUnivau61000  1555  40  1000  8  0.24 
14  autoUnivau6750  1549  40  750  8  0.22 
15  autoUnivau6400  1551  40  400  8  0.28 
16  autoUnivau71100  1552  12  1100  5  0.28 
17  autoUnivau7700  1553  12  700  3  0.35 
18  autoUnivau7500  1554  12  500  5  0.38 
19  backache  463  31  180  2  0.86 
20  balancescale  11  4  625  3  0.46 
21  banana  1460  2  5300  2  0.55 
22  bankmarketing  1461  16  45211  2  0.88 
23  banknoteauthentication  1462  4  1372  2  0.56 
24  bloodtransfusionservicecenter  1464  4  748  2  0.76 
25  breastw  15  9  699  2  0.66 
26  breasttissue  1465  9  106  6  0.21 
27  livedisorders  8  6  345  2  0.58 
28  car  21  6  1728  4  0.70 
29  cardiotocography v.2 (version 2)  1560  35  2126  3  0.78 
30  climatemodelsimulationcrashes  1467  20  540  2  0.91 
31  cloud  210  6  108  4  0.30 
32  cmc  23  9  1473  3  0.43 
33  sonar  40  60  208  2  0.53 
34  vowel  307  13  990  11  0.09 
35  dermatology  35  34  366  6  0.31 
36  fertility  1473  9  100  2  0.88 
37  firstordertheoremproving  1475  51  6118  6  0.42 
38  solarflare  173  12  1389  6  0.29 
39  haberman  43  3  306  2  0.74 
40  hayesroth  329  4  160  3  0.41 
41  heartc  49  13  303  5  0.54 
42  hearth  51  13  294  2  0.64 
43  heartlongbeach  1512  13  200  5  0.28 
44  hearth v.3 (version 3)  1565  13  294  5  0.64 
45  hepatitis  55  19  155  2  0.79 
46  hillvalley  1479  100  1212  2  0.50 
47  colic  25  27  300  2  0.64 
Nro  OpenML name  OpenML did  D  N  C  %MajC 
48  ilpd  1480  10  583  2  0.71 
49  ionosphere  59  33  351  2  0.64 
50  iris  61  4  150  3  0.33 
51  krvckp  3  36  3196  2  0.52 
52  LEDdisplaydomain7digit  40496  7  500  10  0.11 
53  lsvt  1484  310  126  2  0.67 
54  mammography  310  5  961  2  0.54 
55  meta  566  21  528  24  0.04 
56  mfeatfourier  14  76  2000  10  0.10 
57  micromass  1514  1300  360  10  0.10 
58  molecularbiologypromoters  164  57  106  2  0.50 
59  splice  46  62  3190  3  0.52 
60  monksproblems1  333  6  556  2  0.50 
61  monksproblems2  334  6  601  2  0.66 
62  monksproblems3  335  6  554  2  0.52 
63  librasmove v.2  40736  90  360  15  0.07 
64  mfeatfactors  12  217  2000  10  0.10 
65  mushroom  24  21  8124  2  0.52 
66  nursery (v.3)  1568  9  12958  4  0.33 
67  optdigits  28  62  5620  10  0.10 
68  ozonelevel8hr  1487  72  2534  2  0.94 
69  ozone_level v.2  40735  72  2536  2  0.97 
70  pageblocks  30  10  5473  5  0.90 
71  parkinsons  1488  22  195  2  0.75 
72  phoneme  1489  5  5404  2  0.71 
73  onehundredplantsmargin  1491  65  1600  100  0.01 
74  onehundredplantsshape  1492  65  1600  100  0.01 
75  onehundredplantstexture  1493  65  1599  100  0.01 
76  wallrobotnavigation v.3 (version 3)  1526  4  5456  4  0.40 
77  saheart  1498  9  462  2  0.65 
78  seeds  1499  7  210  3  0.33 
79  semeion  1501  257  1593  10  0.10 
80  creditg  31  20  1000  2  0.70 
81  heartstatlog  53  13  270  2  0.56 
82  segment  36  18  2310  7  0.14 
83  satellite_image v.2  40734  36  2859  6  0.30 
84  vehicle  54  18  846  4  0.26 
85  steelplatesfault  1504  33  1941  2  0.65 
86  tae  48  5  151  3  0.34 
87  texture  40499  40  5500  11  0.09 
88  thoracicsurgery  1506  16  470  2  0.85 
89  thyroidallbp  40474  26  2800  5  0.58 
90  thyroidallhyper  40475  26  2800  5  0.58 
91  userknowledge  1508  6  403  5  0.32 
92  vertebracolumn  1523  6  310  3  0.48 
93  wine  187  14  178  3  0.39 
94  yeast (version v.7)  40733  8  1484  4  0.36 