An empirical study on hyperparameter tuning of decision trees

12/05/2018 ∙ by Rafael Gomes Mantovani, et al. ∙ Eötvös Loránd University State University of Londrina Universidade de São Paulo TU Eindhoven UFSCar 18

Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations, and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive accuracy. However, we lack insight into how to efficiently explore this vast space of configurations: which are the best optimization techniques, how should we use them, and how significant is their effect on predictive or runtime performance? This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning on three Decision Tree induction algorithms, CART, C4.5 and CTree. These algorithms were selected because they are based on similar principles, have presented a high predictive performance in several previous works and induce interpretable classification models. Additionally, they contain many interacting hyperparameters to be adjusted. Experiments were carried out with different tuning strategies to induce models and evaluate the relevance of hyperparameters using 94 classification datasets from OpenML. Experimental results indicate that hyperparameter tuning provides statistically significant improvements for C4.5 and CTree in only one-third of the datasets, and in most of the datasets for CART. Different tree algorithms may present different tuning scenarios, but in general, the tuning techniques required relatively few iterations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we find that tuning a specific small subset of hyperparameters contributes most of the achievable optimal predictive performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 15

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many ml algorithms able to deal with classification tasks can be found in the literature. Although high predictive accuracy is the most frequently used measure to evaluate these algorithms, in many applications, easy interpretation of the induced models is also an important requirement. Good predictive performance and model interpretability are found in one of the most successful set of classification algorithms: dt induction algorithms Maimon:2014 .

When applied to a dataset, these algorithms induce a model represented by a set of rules in a tree-like structure (as illustrated in Figure 1). This structure elucidates how the induced model predicts the class of a new instance, more clearly than many other model representations, such as an ann Haykin:2007 or svm Abe:2005 . As a result, dt induction algorithms are among the most frequently used ml algorithms for classification tasks  Wu:2009 ; Jankowski:2014 .

Figure 1: Example of a decision tree. Unlabeled data is provided to the tree, that iteratively selects the most promising attribute until a leaf is reached. At the end, the class is recommended. Adapted from Tan:2005 .

dt induction algorithms have several other advantages over many ml algorithms, such as robustness to noise, tolerance against missing information, handling of irrelevant and redundant predictive attribute values, and low computational cost Maimon:2014 . Their importance is demonstrated by the wide range of well-known algorithms proposed in the literature, such as Breiman et al.'s cart Breiman:1984 and Quinlan's C4.5 algorithm Quinlan:1993 , as well as some hybrid-variants of them, like nbtree Kohavi:1996 , lmt Landwehr:2005 and ctree Hothorn:2006 .

Similarly to most ml algorithms, dt induction algorithms have hyperparameters whose values must be set. Due to the high number of possible configurations, and their large influence on the predictive performance of the induced models, hyperparameter tuning is often warranted Bergstra:2011 ; Pilat:2013 ; Massimo:2016 ; Padierna:2017 . The tuning task is usually investigated for “black-box” algorithms, such as ann and svm, but not for dt. There are some prior studies investigating the evolutionary design of new dt induction algorithms Barros:2012 ; Barros:2015 , but only few on hyperparameter tuning for them Reif:2011 ; Molina:2012 ; Reif:2014 .

This paper investigates the effects of the hyperparameter tuning on the predictive performance of dt induction algorithms, as well as the impact hyperparameters have on the final predictive performance of the induced models. For such, three dt induction algorithms were chosen as study cases: two of the most popular algorithms in Machine Learning Wu:2009 - the J48 algorithm, a WEKA Witten:2005 implementation for the Quinlan‘s C4.5 Quinlan:1993 ; the Breiman et al.'s cart algorithm Breiman:1984 ; and the algorithm ”ctree” Hothorn:2006 , a more recent implementation that embeds statistical tests to define whether a split must occur (similar to CHAID) Kass:1980 .

A total of six different hyperparameter tuning techniques (following different learning biases) were selected: a simple rs, three commonly used meta-heuristics - ga 

Goldberg:1989 , pso Kennedy:1995 , and eda Hauschild:2011 , smbo  Snoek:2012 , and irace Birattari:2010 111These techniques will be described on the next sections.. Experiments were carried out with a large number of heterogeneous datasets, and the experimental results obtained by these optimization techniques are compared with those obtained using the default hyperparameter values recommended for C4.5, CART and CTree.

In many situations, the analysis of the global effect of a single hyperparameter, or interactions between different hyperparameters, may provide valuable insights. Hence, we also assess the relative importance of dt hyperparameters, measured using a recent functional ANOVA framework Hutter:2014 .

In all, the main contributions of this study are:

  • Large-scale comparison of different hyperparameter tuning techniques for dt induction algorithms;

  • Comprehensive analysis of the effect of hyperparameters on the predictive performance of the induced models and the relationship between them;

The current study also extends a previous investigation Mantovani:2016

. This extended version reviews previous studies performing hyperparameter tuning of C4.5 (J48); includes two additional tree algorithms - cart and CTree; includes two state-of-art optimization techniques in the experiments (smbo and irace); presents a more detailed methodology (with all implementation choices), and improves the experimental analysis, mainly the experiments considering the relative importance of the dt algorithm hyperparameters. All the code generated in this study is available to reproduce our analysis - and extend it to other classifiers. All experiments are also available on OpenML 

Vanschoren:2014 .

The remainder of this paper is structured as follows: Section 2 covers related work on hyperparameter tuning of dt induction algorithms, and Section 3 introduces hyperparameter tuning in more detail. Section 4 describes our experimental methodology, and the setup of the tuning techniques used, after which Section 5 analyses the results. Section 6 validates the results from this study. Finally, Section 7 summarizes our findings and highlights future avenues of research.

2 Related work

A large number of ML studies investigate the effect of hyperparameter tuning on the predictive performance of classification algorithms. Most of them deal with the tuning of “black-box” algorithms, such as svm Gomes:2012 and ann Bergstra:2012 ; or ensemble algorithms, such as rf Reif:2012 ; Huang:2016 and Boosting Trees Eggensperger:2015 ; Wang:2015 . They often tune the hyperparameters by using simple techniques, such as ps Eitrich:2006 and rs Bergstra:2012 , but also more sophisticated ones, such as meta-heuristics Padierna:2017 ; Gomes:2012 ; Gascon-Moreno:2011 ; Nakamura:2014 ; Ridd:2014 , smbo Bergstra:2011 ; Bardenet:2013 , racing algorithms Lang:2013 ; Miranda:2014 and mtl Feurer:2015 . However, when considering dt induction algorithms, there are far fewer studies available.

Recent work has also used meta-heuristics to design new dt induction algorithms combining components of existing ones Barros:2015 ; Podgorelec:2015 . The algorithms created are restricted by the existing components, and since they have to optimize the algorithm and its hyperparameters, they have a much larger search space and computational cost. Since this study focuses on hyperparameter tuning, this section does not cover dt induction algorithm design.

2.1 C4.5/J48 hyperparameter tuning

Table 1 summarizes studies performing hyperparameter tuning for the C4.5/J48 dt induction algorithm. For each study, the table presents which hyperparameters were investigated (following the J48 nomenclature also presented in Table 4222The original J48 nomenclature may also be checked at http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html.), which tuning techniques were explored, and the number and source of datasets used in the experiments. Empty fields in the table mean that the procedures used in that specific study could not be completely identified.

Reference Hyperparameter Tuning Number of
C M N O R B S A J U Technique Datasets
Schauerhuber et. al. (2008) Schauerhuber:2008 GS 18 (uci)
Sureka & Indukuri (2008) Sureka:2008 ga
Stiglic et. al. (2012) Stiglic:2012 vtj48 71 (uci)
Lin & Chen (2012) Lin:2012 SS 23 (uci)
Ma (2012) Ma:2012 gp 70 (uci)
Auto-WEKA Thornton:2013 ; Kotthoff:2016 smbo 21
Molina et. al. (2012) Molina:2012 gs 14
Sun & Pfahringer (2013) Sun:2013 pso 466
Reif et. al. (2014) Reif:2014 gs 54 (uci)
Sabharwal et. al. (2016) Sabharwal:2016 DAUP 2 artificial
4 real-world
Tantithamthavorn et. al. (2016) Tantithamthavorn:2016 caret 18
Delgado et. al. (2014) Delgado:2014 121 (uci)
Wainberg et. al. (2016) Wainberg:2016
Table 1: Some properties of the related studies that performed C4.5 (J48) hyperparameter tuning. The hyperparameters abbreviations are explained according to the reference description at the text.

Schauerhuber et. al. Schauerhuber:2008 presented a benchmark of four different open-source dt induction algorithm implementations, one being J48. In this study, authors assessed the algorithms performances on classification datasets from the uci repository. The authors tuned two hyperparameters: the pruning confidence (C) and the minimum number of instances per leaf (M).

Sureka & Indukuri Sureka:2008 used a ga (see Section 3.3) to recommend an algorithm and its best hyperparameter values for a problem. They used a binary representation to encode a wider hyperparameter space, including Bayes, Rules, Network and Tree-based algorithms, including J48. However, the authors do not provide more information about which hyperparameter, ranges, datasets or evaluation procedures were used to assess the hyperparameter settings. Experiments also showed that the algorithm can find good solutions, but requires massive computational resources to evaluate all possible models.

Stiglic et. al. Stiglic:2012 presented a study tuning a vtj48, i.e., J48 with predefined visual boundaries. They developed a new adapted binary search technique to perform the tuning of four J48 hyperparameters: the pruning confidence (C); the minimum number of instances per leaf (M); the use of binary splits (B) and subtree raising (S). Experimental results on uci Bache:2013 and bioinformatics datasets demonstrated a significant increase in accuracy in visually tuned dts, when compared with defaults. In contrast to classical ml datasets, there were higher gains in bioinformatics datasets.

Lin & Chen Lin:2012 proposed a novel ss-based algorithm to acquire optimal hyperparameter settings, and to select a subset of features that results in better classification performance. Experiments with uci datasets demonstrated that the hyperparameter settings for C4.5 algorithm obtained by the new approach, when tuning the ‘C’ and ‘M

’ hyperparameters, were better than those obtained by baselines (defaults, simple ga and a greedy combination of them). When feature selection is considered, classification accuracy rates on most datasets are increased.

Ma Ma:2012 leveraged the gp algorithm to optimize hyperparameters for some ml algorithms (including C4.5 and its hyperparameters ‘C’ and ‘M’) for uci classification and regression datasets. gps were compared with gs and rs methods (see Section 3.1). gps found solutions faster than both baselines with comparably high performances. However, compared specifically to rs, gps seems to be better for more complex problems, while rs is sufficient for simpler ones.

Sabharwal et. al. Sabharwal:2016 proposed a method to sequentially allocate small data batches to selected ml classifiers. The method, called “Data Allocation using Upper Bounds” (DAUP), tries to project an optimistic upper bound of the accuracy obtained by a classifier in the full dataset, using recent evaluations of this classifier on small data batches. Experiments evaluated the technique on classification datasets and more than algorithms with different hyperparameters, including C4.5 and its ‘C’ and ‘M ’ hyperparameters. The proposed method was able to select near optimal classifiers with a very low computational cost compared to full training of all classifiers.

In Tantithamthavorn et. al. Tantithamthavorn:2016 , the authors investigated the performance of prediction models when tuning hyperparameters using “caret333https://cran.r-project.org/web/packages/caret/index.html caret:2016 , a ml tool. A set of ML algorithms, including J48 and its ‘C’ hyperparameter, were tuned on proprietary and public datasets. In a comparison with defaults from caret using the AUC444Area under the ROC curve measure, the tuning produced better results.

Wainberg et. al. Wainberg:2016 reproduced the benchmark experiments described in Delgado:2014 . They evaluated classifiers from different learning groups on datasets from uci. The hyperparameters of the J48 algorithm were manually tuned.

Other studies used hyperparameter tuning methods to generate mtl systems Molina:2012 ; Reif:2014 ; Thornton:2013 ; Kotthoff:2016 ; Sun:2013 . These studies search the hyperparameter spaces to describe the behavior of ml algorithms in a set of problems, and later recommend hyperparameter values for new problems. For example, Molina et. al. Molina:2012 tuned two hyperparameters of the J48 algorithm (‘C’ and ‘M’) in a case study with educational datasets, using gs. They also used a set of meta-features to recommend the most promising set of <algorithm, hyperparameters> pairs for each problem. The proposed approach, however, did not improve the performance of the dts with defaults.

Sun & Pfahringer Sun:2013 also used hyperparameter tuning in the context of mtl. The authors proposed a new meta-learner for algorithm recommendation, and a feature generator to construct the datasets used in experiments. They searched ml algorithm hyperparameter spaces, one of them the C4.5 and its ‘B’ hyperparameter. The pso technique (see Section 3.4) was used to generate a meta-database for a recommendation experiment. Similarly, Reif et. al. Reif:2014 implemented an open-source mtl system to predict accuracies of target classifiers, one of them the C5.0 algorithm (a version of the C4.5), with its pruning confidence (C) tuned by gs.

A special case of hyperparameter tuning is the cash tool, introduced by Thornton:2013 as the Auto-WEKA555http://www.cs.ubc.ca/labs/beta/Projects/autoweka/ framework, and updated recently in Kotthoff:2016 . Auto-WEKA applies smbo (see Section 3.2) to select an algorithm and its hyperparameters to new problems based on a wide set of ml algorithms (including J48). In addition to the previously mentioned hyperparameters (C, M, B and S), Auto-WEKA also searches for the following HP values: whether to collapse the tree (O), use of Laplace smoothing (A), use of MDL correction for the info gain criterion (J) and generation of unpruned trees (U).

2.2 CART hyperparameter tuning

Table 2 summarizes previous studies on hyperparameter tuning for the cart algorithm. For each study, the table presents which hyperparameters, tuning techniques, and the number and source of datasets explored in the experiments.

Reference Hyperparameter Tuning Number of
cp min min max weights max max Technique Datasets
split bucket depth leaf leaf feat
Schauerhuber et. al. Schauerhuber:2008 GS 18 (uci)
Sun & Pfahringer (2013) Sun:2013 pso 466
Bermudez-Chacon et. al.(2015) Chacon:2015 rs 29 (uci)
sh 7 (other)
pd
Auto-skLearn Feurer:2015 ; Feurer:2015B smbo 140 (openml)
Levesque et. al. (2016) Levesque:2016 smbo 18 (uci)
Tantithamthavorn et. al. (2016) Tantithamthavorn:2016 caret 18 (various)
Delgado et. al. (2014) Delgado:2014 121 (uci)
Wainberg et. al. (2016) Wainberg:2016
Table 2: Summary of previous studies on cart hyperparameter tuning. The hyperparameter nomenclature adopted is explained according to the reference description in the original text.

In Schauerhuber et. al. Schauerhuber:2008 , the authors added cart/rpart to their benchmark analysis. They manually tuned only the complexity parameter ‘cp’. Sun et. al. Sun:2013 investigated the tuning of cart hyperparameters, in particular its minsplit hyperparameter, over datasets (some of which are artificially generated) using pso. This hyperparameter controls the minimum number of instances necessary for a split to be attempted. The hyperparameter settings assessed during the search were used to feed a meta-learning system. In Tantithamthavorn et. al. Tantithamthavorn:2016 , the authors did a similar study, but focused on the complexity parameter ‘cp’.

In Bermudez-Chacon et. al. Chacon:2015 , the authors presented a hierarchical model selection framework that automatically selects the best ml algorithm for a particular dataset, optimizing its hyperparameter values. Algorithms and hyperparameters are organized in a hierarchy, and an iterative process makes the recommendation. The optimization technique used for tuning is considered a component of the framework, and three choices are available: rs, sh and pd optimization methods. The technique encapsulates a long list of algorithms, including cart and some of its hyperparameters: ‘minsplit’; the minimum number of instances in a leaf (‘minbucket’); the maximum depth of any node of the final tree (‘maxdepth’); weighted values to leaf nodes (‘weights_leaf’); the maximum number of leafs (‘maxleafs’) and the maximum number of features from dataset used in trees (‘maxfeatures’).

In Feurer et. al. Feurer:2015 ; Feurer:2015B , the authors used the SMBO approach to select and tune algorithm from the “scikit learn666http://scikit-learn.org/ framework, hence Auto-skLearn777https://github.com/automl/auto-sklearn. The only dt induction algorithm covered here is cart. cart with some hyperparameters manually selected was also experimentally investigated in Delgado:2014 ; Wainberg:2016 .

Levesque et. al. Levesque:2016 investigated the use of hyperparameter tuning and ensemble learning for tuning cart hyperparameters when models induced by cart were part of an ensemble, using smbo. Four hyperparameters were tuned in the process: ‘minsplit’, ‘minbucket’, ‘maxdepth’ and the ‘maxleaf’. The tuning resulted in a significant improvement in generalization accuracy when compared with the Single Best Model Ensemble and Greedy Ensemble Construction techniques.

2.3 CTree hyperparameter tuning

Table 3 summarizes previous studies on hyperparameter tuning for the ctree algorithm Hothorn:2006 . For each study, the table presents which hyperparameters were investigated, which tuning techniques were explored, and the number (and source) of datasets used in the experiments. Studies at the table with no technique specified used a manual selection process.

Reference Hyperparameter Tuning Number of
min min min stump mtry max Technique Datasets
criterion split bucket depth
Schauerhuber et. al. (2008) Schauerhuber:2008 18 (uci)
Delgado et. al. (2014) Delgado:2014 121 (uci)
Wainberg et. al. (2016) Wainberg:2016
Sarda-Espinoza et. al. (2017) Sarda:2017 GS 4 (private)
Table 3: Summary of previous studies on ctree hyperparameter tuning. The hyperparameter nomenclature adopted is explained according to the reference description in the original text.

Schauerhuber et. al. Schauerhuber:2008 also included the ctree algorithm in their benchmark study. In their study, only the ‘mincriterion’ hyperparameter is manually tuned for 18 uci datasets. This hyperparameter defines the value of the statistic test () that must be exceeded for a split to occur.

A ctree implementation is also explored in the benchmark studies presented by Delgado:2014 ; Wainberg:2016 . Two hyperparameters are tuned manually: the ‘mincriterion’ and the maximum tree depth (‘maxdepth)’. Experiments were performed with a total of uci heterogeneous datasets.

Sarda-Espinoza et. al. Sarda:2017 applied conditional trees to extract relevant knowledge from electrical motors’ data. The final models were obtained after tuning two hyperparameters via gs: ‘mincriterion’ and ‘maxdepth’. The resulting models were applied to four different private datasets.

2.4 Literature Overview

The literature review indicates that hyperparameter tuning for dt induction algorithm could be more deeply explored. We found eleven studies investigating some tuning for the J48 algorithm, six for cart and only three for the ctree algorithm. These studies neither investigated the tuning task itself nor adopted a consistent procedure to assess candidate hyperparameter settings while searching the hyperparameter space:

  • some studies used hyperparameter sweeps;

  • some other studies used simple cv resamplings;

  • a few studies used nested-cv procedures, but only used an inner holdout and they did not repeat their experiments with different seeds888Since the stochastic nature of the often used tuning algorithms, experimenting with different seeds (for random generator) is desirable.; and

  • some studies did not even describe which experimental methodology was used.

Regarding the search space, most studies concerning C4.5/J48, cart and ctree hyperparameter tuning investigated only a small subset of the hyperparameter search spaces (as shown in Tables 12 and 3). Furthermore, most of the studies did the tuning manually, used simple hyperparameter tuning techniques or searched the hyper-spaces to generate meta-information for mtl and cash systems.

This paper overcomes these limitations by investigating several techniques for dt hyperparameter tuning, using a reproducible and consistent experimental methodology. It presents a comparative analysis for each of the investigated algorithms (C4.5, cart and ctree), and analyzes the importance and relationships between many hyperparameters of dt induction algorithms.

3 Hyperparameter tuning

Many applications of ml algorithms to classification tasks use hyperparameter default values suggested by ml tools, even though several studies have shown that their predictive performance mostly depends on using the right hyperparameter values Feurer:2015 ; Thornton:2013 ; Feurer:2015B . In early works, these values were tuned according to previous experiences or by trial and error. Depending on the training time available, finding a good set of values manually may be subjective and time-consuming. In order to overcome this problem, optimization techniques are often employed to automatically look for a suitable set of hyperparameter settings Bergstra:2011 ; Bardenet:2013 .

The hyperparameter tuning process is usually treated as a black-box optimization problem whose objective function is associated with the predictive performance of the model induced by a ml algorithm, formally defined as follows:

Let be the hyperparameter space for an algorithm , where is the set of ml algorithms. Each represents a set of possible values for the hyperparameter of () and can be usually defined by a set of constraints. Additionally, let be a set of datasets where is a dataset from . The function measures the predictive performance of the model induced by the algorithm on the dataset given a hyperparameter configuration . Without loss of generality, higher values of mean higher predictive performance.

Given , and , together with the previous definitions, the goal of a hyperparameter tuning task is to find such that

(1)

The optimization of the hyperparameter values can be based on any performance measure , which can even be defined by multi-objective criteria. Further aspects can make the tuning more difficult, like:

  • hyperparameter configurations that lead to a model with high predictive performance for a given dataset may not lead to high predictive performance for other datasets;

  • hyperparameter values often depend on each other (as in the case of svm BenHur:2010 ). Hence, independent tune of hyperparameters may not lead to a good set of hyperparameter values;

  • the evaluation of a specific hyperparameter configuration, not to mention many configurations, can be subjective and very time-consuming.

In the last decades, population-based optimization techniques have been successfully used for hyperparameter tuning of classification algorithms  Bergstra:2011 ; Bardenet:2013 . When applied to tuning, these techniques (iteratively) build a population of hyperparameter settings for which are being computed for each . By doing so, they can simultaneously explore different regions of a search space. There are various population-based hyperparameter tuning strategies, which differ in how they update at each iteration. Some of them are briefly described next.

3.1 Random Search

rs is a simple technique that performs random trials in a search space. Its use can reduce the computational cost when there is a large number of possible settings being investigated Andradottir:2015 . Usually, rs performs its search iteratively in a predefined number of iterations. Moreover, is extended (updated) by a randomly generated hyperparameter setting in each (th) iteration of the hyperparameter tuning process. rs has obtained efficient results in optimization for dl algorithms Bergstra:2012 ; Bardenet:2013 .

3.2 Sequential Model Based Optimization

smbo Snoek:2012 ; Brochu:2010 is a sequential method that starts with a small initial population which, at each new iteration , is extended by a new hyperparameter configuration , such that the expected value of is maximal according to an induced meta-model approximating on the current population . In the experiments reported in Bergstra:2011 ; Snoek:2012 ; Bergstra:2013B , smbo performed better than gs and rs and matched or outperformed state-of-the-art techniques in several hyperparameter optimization tasks.

3.3 Genetic Algorithm

Bio-inspired techniques, such as a ga, based on natural processes, have also been largely used for hyperparameter tuning Gomes:2012 ; Friedrichs:2005 ; Kalos:2005 . In these techniques, the initial population , generated randomly or according to background knowledge, is changed in each iteration according to operators based on natural selection and evolution.

3.4 Particle Swarm Optimization

pso is a bio-inspired technique relying on the swarming and flocking behaviors of animals Simon:2013 . In case of pso, each particle is associated with its position in the search space , a velocity and also its so far best found position . During iterations, the movements of each particle is changed according to its so far best-found position as well as the so far best-found position of the entire swarm (recorded through the computation).

3.5 Estimation of Distribution Algorithm

eda Hauschild:2011 lies on the boundary of ga and smbo by combining the advantages of both approaches such that the search is guided by iteratively updating an explicit probabilistic model of promising candidate solutions. In other words, the implicit crossover and mutation operators used in ga are replaced by an explicit probabilistic model .

3.6 Iterated F-Race

The irace Birattari:2010 technique was designed for algorithm configuration and optimization problems Lang:2013 ; Miranda:2014 based on ’racing’. One race starts with an initial population , and iteratively selects the most promising candidates considering the hyperparameter distributions, and comparing them by statistical tests. Configurations that are statistically worse that at least one of other configuration candidates are discarded from the racing. Based on the surviving candidates, the distributions are updated. This process is repeated until a stopping criterion is reached.

4 Experimental methodology

The nested cv Cawley:2010 ; Krstajic:2014 experimental methodology employed is illustrated by Figure 2. For each dataset, data are split into outer-folds: the training folds are used by the tuning techniques to find good hyperparameter settings, while the test fold is used to assess the ‘optimal’ solution found. Internally, tuning techniques split each of the training folds into inner-folds to measure the fitness value of each new hyperparameter setting. At the end of the process, a set of optimization paths, settings, and their predictive performances are returned. During the experiments, all the tuning techniques were run on the same data partitions, with the same seeds and data to allow their comparison. In Krstajic:2014 , the authors used . However, they argued that there is no study suggesting the number of folds in the outer and inner cv loops. Here, the same value used in the original paper was used for . Due to time constraints and the size of datasets used in experiments, was adopted. Next subsections detail the sub-components used in the tuning task.

Figure 2: Experimental methodology used to adjust dt hyperparameters. The tuning is conducted via nested cross-validation: -fold cv for computing fitness values and -fold cv for assessing performances. The outputs are the hyperparameter settings, the predicted performances and the optimization paths of each technique.

4.1 Hyperparameter spaces

The experiments were performed considering the hyperparameter tuning of three dt induction algorithms: the ‘J48’ algorithm, a WEKA999http://www.cs.waikato.ac.nz/ml/weka/ Witten:2005 implementation of the C4.5 algorithm; the rpart implementation of the cart Breiman:1984 algorithm, and the ctree algorithm Hothorn:2006 . These algorithms were selected due to their wide acceptance and use in many ml applications Maimon:2014 ; Jankowski:2014 ; Barros:2012 . The first two listed algorithms are among the most used in Machine Learning, specially by non-expert users Wu:2009 , and the third is a more recent implementation that uses statistical tests for splits, like the classical CHAID algorithm Kass:1980 . The correspondent hyperparameter spaces investigated are described in Table 4.

Algo Symbol hyperparameter Range Type Default Conditions
J48 C pruning confidence real 0.25 R = False
J48 M minimum number of instances in a leaf integer 2 -
J48 N number of folds for reduced integer 3 R = True
error pruning
J48 O do not collapse the tree {False,True} logical False -
J48 R use reduced error pruning {False,True} logical False -
J48 B use binary splits only {False,True} logical False -
J48 S do not perform subtree raising {False,True} logical False -
J48 A Laplace smoothing for predicted {False,True} logical False -
probabilities
J48 J do not use MDL correction for {False,True} logical False -
info gain on numeric attributes
CART cp complexity parameter real -
CART minsplit minimum number of instances in a integer -
node for a split to be attempted
CART minbucket minimum number of instances in a leaf integer -
CART maxdepth maximum depth of any node of integer -
the final tree
CART usesurrogate how to use surrogates in the splitting factor -
process
CART surrogatestyle controls the selection of the best factor -
surrogate
CTree mincriterion the value of the statistic test real 0.95 -
(1 - p-value) to be exceed for
a split occurrence
CTree minsplit minimum sum of weights in a integer 20 -
node for a split occurrence
CTree minbucket minimum sum of weights in a leaf integer 7 -
CTree mtry number of input variables randomly real 0 -
sampled as candidates at each node

for random forest like algorithms

CTree maxdepth maximum depth of any node of integer no restriction -
the final tree
CTree stump a stump (a tree with three nodes {False,True} logical False -
only) is to be computed
Table 4: Decision Tree hyperparameter spaces explored in the experiments. The J48 nomenclature is based on the RWeka package, the cart terms is based on the rpart package, and the CTree terms based on the party package.

Originally, J48 has ten tunable hyperparameters101010http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html: all presented at Table 4 plus the hyperparameter ‘U’, which enables the induction of unpruned trees. Since pruned trees look for the most interpretable models without loss of predictive performance, this hyperparameter was removed from the experiments, and just pruned trees were considered. For ctree, all the statistically dependent hyperparameters were kept out, since their effects were previously studied and the default choices were robust for a wide range of problems Hothorn:2006 , thus the non-statistically dependent hyperparameters were selected. Regarding cart, all the tunable hyperparameters in rpart were selected.

For each hyperparameter, Table 4 shows the allowed range of values, default values provided by the correspondent packages, and its constraints for setting new values. The hyperparameter values were the same used in Reif et. al. Reif:2011 . The range of the pruning confidence () hyperparameter was adapted from Reif et. al. Reif:2014 , because the algorithm internally controls the parameter values, does not allowing some values near zero or .

4.2 Datasets

The experiments were carried out using public datasets from the openml  Vanschoren:2014 website111111http://www.openml.org/, a free scientific platform for standardization of ml experiments, collaboration and sharing empirical results. Binary and multiclass classification datasets were selected, varying the number of attributes (D) () and examples (N) (). In all selected datasets each class (C) has at least examples, to allow the use of the stratified methodology. All datasets, with their main characteristics, are presented in Tables 7 and 8 at B.

4.3 Hyperparameter tuning techniques

Six hyperparameter tuning techniques were investigated:

  • three different meta-heuristics: a ga Goldberg:1989 , pso Kennedy:1995 and an eda Hauschild:2011 . These techniques are often used for hyperparameter tuning of ml classification algorithms in general Gascon-Moreno:2011 ; Lin:2008 ; Yang:2013 ;

  • a simple rs technique: suggested in Bergstra:2012 as a good alternative for hyperparameter tuning replacing gs technique;

  • irace Birattari:2010 : a racing technique designed for algorithm configuration problems; and

  • a smbo Snoek:2012 technique: a state of the art technique for optimization that employs statistical and/or machine learning techniques to predict distributions over labels, and allows a a more direct and faster optimization.

Element Method R package
HP-tuning techniques Random Search mlr
Genetic Algorithm GA
Particle Swarm Optimization PSO
Estimation of Distribution Algorithm copulaedas
Sequential Model Based Optimization mlrMBO
Iterated F-race irace
Decision Trees J48 algorithm RWeka
CART algorithm rpart
CTree algorithm party
Inner resampling 3-fold cross-validation mlr
Outer resampling 10-fold cross-validation mlr
Optimized measure Balanced per class accuracy mlr
Evaluation measure Balanced per class accuracy, mlr
Optimization paths
Budget 900 iterations
Repetitions 30 times with different seeds -
seeds = -
Baseline Default values (DF) RWeka
rpart
party
Table 5: Setup of the hyperparameter tuning experiments.

Table 5 summarizes the choices made to accomplish the general hyperparameter tuning techniques. Most of the experiments were implemented using the mlr R package121212https://github.com/mlr-org/mlr mlr:2016 (measures, resampling strategies, tuning main processes and rs technique). The ga, pso and eda meta-heuristics were implemented using the GA131313https://github.com/luca-scr/GAScrucca:2013 , pso141414https://cran.r-project.org/web/packages/pso/index.htmlBendtsen:2012 , and copulaedas151515https://github.com/yasserglez/copulaedasGonzalez-Fernandez:2014 R packages, respectively. The J48, cart and ctree algorithms were implemented using the RWeka161616https://cran.r-project.org/web/packages/RWeka/index.htmlHornik:2009 , rpart171717https://cran.r-project.org/web/packages/rpart/index.htmlrpart:2014 and party181818https://cran.r-project.org/web/packages/party/index.htmlHothorn:2006 packages, respectively, wrapped into the mlr package. The SMBO technique was implemented using the mlrMBO191919https://github.com/mlr-org/mlrMBO mlrMBO:2017 R package, with its rf surrogate models implemented by the randomForest202020https://cran.r-project.org/web/packages/randomForest/index.html R package rf:2002 . The Irace technique was implemented using the irace212121http://iridia.ulb.ac.be/irace/ irace:2016 R package.

Since the experiments handle a high number of datasets with different characteristics, many datasets may have unbalanced classes. Thus, the same predictive performance measure used during optimization as the fitness value, bac Brodersen:2010 , is used for model evaluation.

When tuning occurs in real scenarios, time is an important aspect to be considered. Sometimes the tuning process may take many hours to find good settings for a single dataset Reif:2012 ; Ridd:2014 . Thus, this work investigates whether it is possible to find the same good settings faster by using a reduced number of evaluations (budget). Based on previous results and analyses Mantovani:2016 , a budget size of evaluations was adopted in the experiments222222The budget size choice is discussed with more details in Section 6..

Since all techniques are stochastic, each one was executed times for each dataset using different seed values. It gives a total of (repetitions) (outer-folds) (budget) HP-settings generated during the search process for one dataset. Besides, the default hyperparameter values provided by the ‘RWeka’, ‘rpart’ and ‘party’ packages were used as baseline for the experimental comparisons.

Technique Parameter Value
RS stopping criteria budget size
PSO number of particles 10
maximum number of iterations 90
stopping criteria budget size
algorithm implementation SPSO2007 Clerc:2012
EDA number of individuals 10
maximum number of iterations 90
stopping criteria budget size
EDA implementation GCEDA
copula function normal
margin function truncnorm
GA number of individuals 10
maximum number of iterations 90
stopping criteria budget size
selection operator proportional selection with linear scaling
crossover operator local arithmetic crossover
crossover probability 0.8
mutation operator random mutation
mutation probability 0.05
elitism rate 0.05
SMBO points in the initial design 10
initial design method Random LHS
surrogate model Random Forest
stopping criteria budget size
infill criteria expected improvement
Irace number of instances for resampling 100
stopping criteria budget size
Table 6: hyperparameter tuning techniques parameters. Excepting the budget-dependent parameters all of them are the defaults provided by each R package implementation.

As this paper evaluates different tuning techniques, to avoid the influence of their hyperparameter values on their performances, the authors decided to use their default values. Each tuning technique has a different set of hyperparameters, and these are specific and different considering each technique’s paradigm. In the smbo, irace and pso cases, the use of the defaults have been shown robust enough to save time and resourcesmlrMBO:2017 ; irace:2016 ; Bigiarini:2013 . For eda and ga (and evolutionary methods in general) there is no standard values for their parameters Mills:2015 . So, to keep fair comparisons, the default parameter values provided by the correspondent R packages were used. All of these values may be seen in Table 6.

The tuning techniques have an initial population with random hyperparameter settings and the same stopping criteria: the budget size. The ga, pso and eda techniques use a “real-value” codification for the individuals/particles, thus, they were adapted to handle discrete and Boolean hyperparameters. All of them were executed sequentially in the same cluster environment. Every single job generated was executed in a dedicated core with no concurrency, and scheduled by the cluster system.

4.4 Repositories for the coding used in this study

The code for implementations used in this study are publicly available at:

Instructions to run each project may be found directly at the correspondent websites. The experimental results are also available as an openml study (https://www.openml.org/s/50), where all datasets, classification tasks, algorithms/flows and results of runs for this paper can be listed and downloaded.

5 Experimental results

Next subsections present main experimental results regarding the dt implementations.

5.1 Performance analysis regarding J48 algorithm

Figure 3 presents the results obtained by the tuning techniques when applied to the J48 dt induction algorithm. Sub-figure 3(a) shows the average bac values obtained by the tuning techniques and defaults over all datasets. The datasets at the x-axis are placed in decreasing order according to their predictive performances using default hyperparameter values232323The corresponding dataset names may be seen in Tables 7 and 8 at B..

(a) Average balanced per class accuracy performance.
(b) Average tree size. X-axis values are in scale.
Figure 3: Hyperparameter tuning results for the J48 algorithm.
(a) C
(b) M
(c) N
(d) R
(e) O
(f) B
(g) A
(h) S
(i) J
Figure 4: Distribution of the J48 hyperparameters found by the tuning techniques.

For each dataset, the name of the tuning technique that resulted in the best predictive performance is shown above the x-axis. The Wilcoxon paired-test was applied to assess the statistical significance of the results obtained by this best technique when compared to the results using default values. The test was applied to the solutions obtained from the repetitions (with ). An upper green triangle () at x-axis identifies datasets where statistically significant improvements were detected after applying the hyperparameter tuning technique. On the other hand, every time a red down triangle () is presented, the use of defaults was statistically better than the use of tuning techniques.

A first look at the results shows that all tuning techniques have similar performances, with few exceptions, since most of the curves overlap. In general, there is a small difference in predictive performance regarding the default values. Higher improvements may be seen only in a small subset of datasets. When the Wilcoxon statistical paired-test is applied comparing defaults with the best tuning technique, they show that, overall, tuned trees were better than those with default values with statistical significance in () datasets. In most of these situations, the irace, pso or smbo techniques produced the best results. Default values were significantly better in of the cases, and the remaining situations () did not present statistically significant differences (the approaches tied).

Sub-figure 3(b) shows the average tree size of the final J48 induced dts. The tree size measures the number of nodes in the final tree model. It is important to mention that the interpretability of a tree is mostly dependent on its size. Thus, larger trees are usually more difficult to understand than smaller trees. Regarding the J48 dt size, in most cases, default values (dotted black line) induced trees larger than those obtained by the hyperparameters suggested by tuning techniques. This fact was true whenever default values were the best option with statistical significance. For most of the multi-class tasks with many classes (datasets most to the right at the charts), the tuned trees were also smaller than those induced using default values. Even small concerning performance, the improvements were also significant.

Looking at the peaks of improvements due to the use of hyperparameter tuning, they were reached when the dts induced using default values were much smaller than those obtained using hyperparameter tuning. This occurred for the datasets with the ids = {}. When comparing tuning techniques among themselves, significant differences only appear in these datasets. The soft computing techniques tend to produce smaller trees than the smbo and rs techniques.

To compare default setting with the solutions found during the tuning process, and also get useful insights regarding the defaults effectiveness, the J48 hyperparameter’s distributions found by the tuning techniques are presented in Figure 4242424All the hyperparameters were already shown in Table 4. The numerical default values are represented by vertical dashed lines. In the J48 tuning scenario, the largest contrast may be noticed in the ‘R’ sub-plot: most of the obtained solutions presented ‘R=FALSE’, which disables the use of the reduce error pruning option and the hyperparameter ‘N’ (like default setting does). The M values obtained also tends to get close to the default value in most of the cases (close to

). The other Boolean hyperparameters seem not to influence the predictive performances reached during the optimization process since they present a very uniform distribution. Overall, the only hyperparameter that may contribute to generate solutions different from the default values is the

confidence pruning hyperparameter (‘C’), as indicated by Sub-figure 4 (a).

5.2 Performance analysis regarding CART algorithm

(a) Average balanced per class accuracy performance.
(b) Average tree size. Results are presented in scale.
(a) cp
(b) minbucket
(c) maxdepth
(d) minsplit
(e) usesurrogate
(f) surrogatestyle
Figure 5: Hyperparameter tuning results for the cart algorithm.
Figure 6: cart hyperparameters’ distributions found by the tuning techniques.
Figure 5: Hyperparameter tuning results for the cart algorithm.

Figure 6 presents graphical analysis for the cart results. Different from J48, cart was more affected by hyperparameter tuning. In most of the datasets analyzed, the use of tuned values improved the predictive performance with statistical significance when compared with the use of default values in () of the cases. It must be observed that the irace and smbo were the best optimization techniques, regarding just the predictive performance of the induced models. Defaults values were better than tuned ones in of the cases. In the remaining datasets, there was no significant statistical improvement using optimized values.

Regarding the size of cart dts, whenever defaults were statistically better, the trees induced by them have similar or lower sizes than the tuned ones. However, in most of the cases, tuned hyperparameter settings induced trees statistically better and much larger than those created using default values. Even ‘defaults’ trees being simpler, they were incapable of classifying most of the problems properly. The comparison among the tuning techniques showed results different from those obtained for the J48 algorithm. The tuning techniques led to the induction of dts with similar sizes. However, the dts induced when irace was used were slightly larger, and with better predictive performance than those induced using the other optimization techniques.

Thecart hyperparameters’ distributions found by the tuning techniques can be found in Figure 6. Different from J48, cart tuned trees were obtained from values substantially different from the default values. This is more evident for the numerical hyperparameters, as shown in sub-figures 6 (a) to (d). The ‘cp’, ‘minbucket’ and ‘minsplit’ values tend to be smaller than default values. For ‘maxdepth’, a wide range of values is tried, indicating a possible dependence on the input problem (dataset). However, the categorical hyperparameters’ distributions, shown in Sub-figures 6 (e) and (f), are very uniform, indicating that their choices may not influence the final predictive performance.

5.3 Performance analysis regarding CTree algorithm

The results obtained in the experiments with the ctree are illustrated by Figure 8. Most of the tuning techniques presented similar results, with the exception of ga (the green line), which was clearly worse than all the other techniques regarding predictive performance. Unlike the two previous case studies, ctree predictive performance was less influenced by the hyperparameter tuning. Default values generated the best models in of the datasets. Tuned values improved the predictive performance of the induced trees in () of the datasets. For the remaining there was no statistical difference between the use of default values and values produced by tuning techniques.

Considering the size of the induced trees, tuning techniques did not generate larger or smaller trees than those induced by using default values. There are just a few exceptions, for dataset ids = {79, 57}, were tuned trees are visually larger but improved the predictive performance. Comparing tuning techniques among them, irace and pso were the best techniques considering just the predictive performance of the models, followed by the smbo technique.

Figure 8 presents the ctree hyperparameter values’ distributions found during the tuning process. Similarly to the cart scenario, all the numerical hyperparameters presented values different from the default values: some of them produced values smaller than default values (‘minbucket’, ‘minsplit’); another was similar to the default value (‘mtry’); and all the others varied in a wide range of values (‘maxdepth’, ‘mincriterion’). The categorical hyperparameter ‘stump’, which enables the induction of a tree with just one level, is mostly set as stump = FALSE, like the default setting, having no real impact on the performance differences.

(a) Average balanced per class accuracy performance.
(b) Average tree size. Results are presented in scale.
(a) stump
(b) minbucket
(c) maxdepth
(d) minsplit
(e) mincriterion
(f) mtry
Figure 7: Hyperparameter tuning results for the ctree algorithm.
Figure 8: ctree hyperparameters’ distributions found by tuning techniques.
Figure 7: Hyperparameter tuning results for the ctree algorithm.

5.4 Statistical comparisons between techniques

The Friedman test Demvsar:2006 , with significance levels at and

, was also used to compare the hyperparameter tuning techniques, evaluating the statistical significance of the experimental results. The null hypothesis states that all classifiers induced with the hyperparameter settings found by the tuning techniques, and the classifier induced by default values, are equivalent concerning predictive bac performance. If the null hypothesis was rejected, the Nemenyi post-hoc test was applied, stating that the performances of two different techniques are significantly different if the corresponding average ranks differ by at least a cd value.

Figure 9 presents the cd diagram for the three dt induction algorithms. Techniques are connected when there is no statistically significant differences between them. Considering , Sub-figure 9(a) depicts the comparison in J48 scenario. One may note that there is no statistically differences between the top two best techniques: irace and pso. Also, the models induced with default hyperparameter values obtained no statistically better results than irace, pso, smbo and rs. eda and ga obtained statistically inferior performances.

For the cart algorithm (Sub-figure 9(b)), the best ranked technique over all datasets was irace, followed by rs with no statistically significant results. dts induced with default hyperparameter values obtained the worst performance, being statistically comparable only with ga and eda.

CD-diagrams for the ctree results are shown in Sub-figures 9(e) and 9(f). The defaults hyperparameter values were ranked first, followed by the irace, pso and smbo techniques. However, there is no statistical differences between them. The rs and eda compose the second block of techniques. They do not present statistical differences between them but do in relation to the first group of techniques. Finally, the ga technique was statistically worst than all the other techniques.

It is worth mentioning that irace was the best tuning technique for all the algorithms. Whereas the statistical test did not show significant differences between irace and pso (J48, ctree), and between irace and rs (cart), it is easy to see that irace is the preferred technique, presenting the lowest averaging ranking. When a larger value was used (with cd ), there were no changes in J48 and ctree scenarios. However, regarding cart performances, irace statistically outperformed all the other techniques, as can be seen in Sub-figure 9(d).

CD

1

2

3

4

5

6

7

Irace

PSO

defaults

GA

EDA

RS

SMBO
(a) J48 CD diagram with .

CD

1

2

3

4

5

6

7

Irace

PSO

defaults

GA

EDA

RS

SMBO
(b) J48 CD diagram with .

CD

1

2

3

4

5

6

7

Irace

RS

PSO

defaults

GA

EDA

SMBO
(c) CART CD diagram with .

CD

1

2

3

4

5

6

7

Irace

RS

PSO

defaults

GA

EDA

SMBO
(d) CART CD diagram with .

CD

1

2

3

4

5

6

7

defaults

Irace

PSO

GA

EDA

RS

SMBO
(e) ctree CD diagram with .

CD

1

2

3

4

5

6

7

defaults

Irace

PSO

GA

EDA

RS

SMBO
(f) ctree CD diagram with .
Figure 9: Comparison of the bac values of the hyperparameter tuning techniques according to the Nemenyi test. Groups of techniques that are not significantly different are connected. Left charts show results with , while right charts show comparisons with .

5.5 When to perform tuning?

A set of data complexity Orriols:2010 ; Garcia:2016 measures was used to characterize the datasets, and provide patterns that could explain when it is better to use tuned or default values. From the thirteen measures used, three were able to relate their values with the J48 hyperparameter tuning bac performances:

  • Fischer’s discriminant ratio (f1), f1 [0,+) - selects the attribute that best discriminates the classes: the higher the value, the higher the indicative that at least one of the dataset attributes is able to linearly separate data from different classes;

  • Collective feature efficiency (f4), f4 [0,+1] - considers the discriminative power of all the dataset’s attributes;

  • Fraction of points lying on the class boundary (n1), n1 [0,+1] - estimates the complexity of the correct hypothesis underlying the data. Higher values indicate the need for more complex boundaries to separate data.

Two of these measures (f1 and n1) try to identify the existence of at least one dataset attribute that may linearly separate classes, while f4 attempts to provide information by taking into account all the attributes available in the dataset. Considering them, some simple rules could also be observed: hyperparameter tuning is commonly recommended for multiclass problems with several classes (), for datasets with a Fischer’s discriminant ratio close to zero (), and finally, when the average number of instances in the class boundary is . In cases where a high collective feature efficiency occurs (), defaults hyperparameter values induce good models.

For cart, in addition to n1, two other measures were important:

  • The maximum individual attribute efficiency (f3), f3 [0,+1] - indicates the presence of attributes whose values do not overlap between classes;

  • The non-linearity of the one-nearest neighbor classifier (n4), n4

    [0,+1] - this measure creates a test set by linear interpolation with random coefficients between pairs of randomly selected instances of the same class. Then, it returns the test error of the 1-NN classifier.

Two of these measures (n1, n4) evaluate the class separability, while f3 measures the overlap in feature space from different classes. Defaults were suggested for few problems when more than points were placed in the boundaries, there was at least one attribute with a maximum individual efficiency bigger than , and a linear classifier performed quite well (). Thus, the analysis suggests that hyperparameter tuning is recommended especially for multiclass problems, and for those without a clear linear decision boundary to separate data instances (they are more complex).

Regarding ctree, a different set of measures was considered:

  • Average intra/inter class nearest neighbor distances (n2), n2 [0,+) - the average intra-class and inter-class distances ratio used by a k-NN algorithm to classify data examples. Low values indicate that examples from the same class lay closely in the feature space, while high values indicate that examples from the same class are dispersed;

  • Training error of a linear classifier (l2), l2 [0,+1] - the predictive performance of a linear classifier for the training data. The lower the value, the closer the problem is to be linearly separable.

The measures n2 and l2 are also related to the problem classes separability. Tuning is usually recommended when data from the same class are disperse (), and when a linear classifier is not able to classify examples with a training error (hard problems). For the other situations, default values are recommended.

5.6 Runtime analysis

Running time is also an important aspect to be considered when performing experimental analyses. Figures 11 to 13 show the average tuning, training and testing times spent by the techniques when performing the hyperparameter tuning of the dt induction algorithms.

Tuning and testing times are related to the optimization process. The first measures the time required by the techniques to find good hyperparameter settings considering the time budget size. The second calculates the time required for assessing hyperparameter settings recommended by the tuning techniques (illustrated by the outer loop of Figure 2). The training time measures the time required for inducing dts with the suggested hyperparameters using all the datasets’ instances. The idea is to reproduce how models would perform in a practical scenario.

The values in the y-axis of the Figures are in seconds but were scaled with a transformation due to their discrepancy. Each curve with a different color represents a tuning technique. Since there is no tuning with defaults, there is no black dotted curve in the tuning sub-charts.

Figure 10: Average processing time required for the tuning, training and test phases of the J48 algorithm.

5.6.1 J48 runtime

Figure 11 presents the runtime analysis for J48. Considering the tuning time, the meta-heuristics (pso, ga, eda) are the fastest tuning techniques. They are population-based techniques, so they benefit from population coding structure to speed up the convergence of their computation and tend to a common solution. RS and irace are in the middle. While the former technique simply randomly searches the space, the latter statistically compares many candidates in several rounds. That may explain why they require more running time than population-based techniques.

Figure 11: Average processing time required for the tuning, training and test phases of the cart algorithm.

Finally, the smbo technique presented the highest optimization/tuning time. The main reason is its inner sub-processes. After evaluating the initial points, the technique fits a rf regression model on the available data. Next, it queries the model to propose a new candidate hyperparameter solution using an acquisition function (or infill criteria). This function searches for a point at the hyperspace which yields the best infill value (the expected improvement) and then adds this value to the model for the next iteration. By checking the technique executions, it was observed that these steps are its main bottleneck, reflected directly in the final runtime.

The test runtime scale is too small, so in practice, there are no significant differences in the processing costs of the optimization techniques. Usually, tuned trees are assessed faster than those induced using default values, because tuning techniques induce smaller trees than the ones induced with default hyperparameter values (see Figure 3). Regarding training costs, training using default settings is faster than using tuned hyperparameter values. It may be due to the Boolean hyperparameters. They enable/disable some transformations that would require more time to handle data. When default hyperparameter settings are used, all of these transformations are disabled.

5.6.2 CART runtime

Figure 12 presents the same analysis for the cart algorithm. In general, running time results using cart provided similar insights to those obtained in the analysis of the J48 results. smbo was also the technique with the highest processing cost, i.e., it required more time to consume the budget of possible evaluations (as previously discussed). The other techniques have similar cost curves, with oscillating values depending on the dataset characteristics. For J48, for example, irace and rs required more time than the meta-heuristics.

When evaluating the hyperparameter settings testing the induced dts, models induced with default hyperparameter values required more time to be assessed than those induced with the recommended tuned settings. This fact occurred every time dts induced with default values presented a predictive performance statistically better than models induced with tuned hyperparameter settings. Regarding the training time, hyperparameter tuned dts spent more time to induce the models. Since default hyperparameter values generated smaller trees, the test instances need to follow less internal nodes to be labeled with one of the classes.

5.6.3 CTree runtime

Figure 13 presents the running time analysis for the ctree algorithm. Similarly to scenarios of the previous algorithms, the smbo technique was the most time-consuming technique to evaluate the defined budget size. The other techniques presented similar behavior, varying slightly depending on the problem under optimization. There are at least five datasets were all the techniques spent a long time to optimize the hyperparameters: they may be observed at data ids = {57, 64, 73, 74, 75}. All of them are multiclass classification tasks with at least classes, implying a difficulty that ctree may have to solve classification tasks with many classes.

Training models with default values required less time than using hyperparameter tuned solutions. By default, ctree does not apply any random selection of the input features during when training (). All the other numerical hyperparameters tend to present values smaller than default values, in theory, producing smaller trees. However, this is not seen in practice. Tree sizes are very similar (tuned vs default) and ‘mtry’ values might explain the difference. Regarding the test, the runtime scale is too small, so there are no real differences when evaluating settings found by using the tuning.

Figure 12: Average processing time required for the tuning, training and test phases of the ctree algorithm.

5.7 Convergence of the tuning techniques

Regarding the convergence of the tuning techniques, the boxplots in Figure 13

show the minimum, maximum and three quartiles for the number of evaluations assessed until the best solution was reached. The y-axis shows the number of evaluations, while the x-axis indicates the tuning techniques. Even using a budget of

iterations, all tuning techniques required at most steps in the three case studies. Except for irace, which required the largest number of candidates to converge, it is still possible to say that most of the good hyperparameter settings were reached between in first iterations for cart and J48 (as already observed in Mantovani:2016 ).

The exception here is the ctree algorithm, since it required more iterations than J48 and cart. Looking back to the tuning results at Figure (a)a, default values provided the best solution in almost 40% of the datasets, and the difficulty to find good hyperparameter settings that would outperform them is reflected in Figure 13(c).

(a) Evaluations required to tune J48 trees.
(b) Evaluations required to tune cart trees.
(c) Evaluations required to tune ctree.
Figure 13: Number of evaluation used by the tuning techniques to reach their best hyperparameter solutions.

Boxplots in Figure 13 also suggest that irace requires more evaluations than the rs technique. Looking in details, irace is based on three steps: (1) sampling new hyperparameter configurations according to a particular distribution252525The distributions are independently for each hyperparameter.; (2) selecting the best set of configurations by means of racing, and (3) updating the sampling distributions towards the optimal region Lopez:2016 .

The race procedure starts with a finite set of candidates, and, at each step, discards hyperparameter settings that perform statistically worse than at least another. This process continues with the survivors. In the first iteration, this initial set of candidates is generated from hyperparameter distributions. The authors Lopez:2016 emphasize that the first elimination process is fundamental, so there are some of instances () that must be seen before performing statistical tests. Therefore, new statistical comparisons are performed after new instances are assessed. By default, irace suggests (as detailed in Table 6) and ). These values were defined after being tuned and studied for different optimization scenarios Perez:2014 .

Internally, the technique estimates its racing hyperparameters based on the budget and target hyperspace. The number of races () depends on the number of hyperparameters, while each race has a proper budget () limited by the iteration index and a number of evaluations still available262626For further details, please consult the irace’s manual irace:2011 .. Thus, irace works in such a way that the number of candidate settings decreases with the number of iterations, which means more evaluations per configuration will be performed in late iterations.

Therefore, this difference concerning evaluations is better explained by the default value of , which increases the minimum number of evaluations required by the technique. The inner racing hyperparameters also influence, since they will control the number of races, requiring more statistical tests (and evaluations) in late iterations. However, even evaluating more hyperparameter candidates than the rs technique, irace does not require an additional time (as may be seen in Figures 11 to 13, except for some datasets and the J48 algorithm). Moreover, it might be covering different regions of the hyperspace, which is indicated by the results obtained and illustrated by figures 9.

Considering just the number of hyperparameters assessed during the search, although the runtime analysis showed that smbo is the most costly, it was able to find good solutions assessing a smaller number of candidates than irace (the technique that resulted in dts with the best predictive performance). This occurred for all the algorithms, suggesting that with different stopping criteria (early convergence), even smbo could be a reliable choice.

The pso technique was able to find good hyperparameter solutions in J48 and cart scenarios with less than iterations. Based on the statistical results from figure 9, pso was often among the best techniques for all the three scenarios. In some cases, depending on the statistical test, it was not statistically different from the best technique (irace). Thus, it may be a good alternative to fast obtain good solutions.

5.8 Hyperparameters’ importance analysis

Statistical analysis was also used to understand how different hyperparameters affect each other, and the dt induction algorithm performances. AN approach to evaluate how the hyperparameters are affecting the performance of the induced models when different tuning techniques are performed is the use of fANOVA (Functional ANOVA framework)272727https://github.com/automl/fanova, introduced in Hutter:2014

. In that paper, the authors present a linear-time algorithm for computing marginal predictions and quantify the importance of single hyperparameters and interactions between them. The key idea is to generate regression trees that predict the performance of hyperparameter settings and apply the variance decomposition framework directly to the trees in these forests.

In the source article, the authors ran fANOVA with smbo hyperparameter settings over some scenarios, but never with more than hyperparameter settings. Here, a single execution of irace generates evaluations. Thus, experiments using all techniques would have a high computational cost. Since irace was the best technique overall in both algorithms, it was used to provide the hyperparameter settings to this analysis. In the experiments, hyperparameter settings from repetitions were used and more memory was allocated to the fANOVA code.

Figure LABEL:fig:fanova_params shows the results for dt induction algorithms. In the figure, the x-axis shows all datasets while y-axis presents the hyperparameters importance regarding fANOVA. The larger the importance of a hyperparameter (or pair of them), the darker its corresponding square, i.e., more important is the hyperparameter for inducing trees in the dataset (scaled between zero and one).

In the figure, any single hyperparameter (or combination of them), whose contribution to the performance of the final models was lower than , was removed. Applying this filter substantially reduced the hyperparameters in focus, but even so, most of the rows in the heatmap are almost white (light red). This analysis shows that most of the combinations have little contribution to the performance of the induced dts.

In Sub-figure LABEL:fig:fanova_params(a), fANOVA indicates that most of the J48 performances were influenced by M hyperparameter values: when not alone, in combination with another hyperparameter (R, N, C). For cart, the ‘minbucket’ and ‘minsplit’ hyperparameters are the main responsible for the performance of the induced dts, as may be seen in Figure LABEL:fig:fanova_params(b).

For ctree, seven of the fANOVA’s jobs produced errors when executing. In these situations, a white column is presented at the heatmap. Regarding the analysis, the hyperparameters ‘minbucket’ and ‘minsplit’ are the most important, similarly with the cart’s chart. On the other hand, they have less strength to predict marginal distributions. It reinforces previous findings describing ctree as less sensitive to tuning.

These findings enforce what was discussed in the previous subsection: although each one of the analysis may point out a different important hyperparameter, the same subset of hyperparameters seems to influence the final performance of the induced dt.

6 Threats to Validity

In an empirical study design, methodological choices may impact the results obtained in the experiments. Next, the threats that may impact the results from this study are discussed.

6.1 Construct validity

The datasets used in the experiments were selected to cover a wide range of classification tasks, with different characteristics. They were used in their original versions, i.e., no preprocessing was required, since dt are able to handle any missing information or data from different types. The only restriction adopted ensures that all classes in the datasets must have at least observations. Thus, stratification with outer folds can be applied. Of course, other datasets may be used to expand data collection, if they obey the ‘stratified’ criterion. However, the authors believe that addition of datasets will not substantially change the overall behavior of tuning on the algorithms investigated.

Regarding the dt induction algorithms, cart and J48 are among the most popular algorithms used in data mining Wu:2009 . The ctree algorithm works similarly to the traditional CHAID algorithm, using statistical tests, but provides a more recent implementation which handles different types of data attributes282828The CHAID algorithm handles just categorical data attributes.. Experiments were focused on these algorithms due to the interpretability of their induced models and widespread use. All of them generate simple models, are robust for specific domains, and allow non-experts users to understand how the classification decision is made. The same experimental methodology and analyses can be applied to any other ml algorithm.

Since a wide variety of datasets compose the data collection, some of them may be imbalanced. Thus, the bac performance measure Brodersen:2010 was used as fitness function during the optimization process. Therefore, class distributions are being considered when assessing a candidate solution. The same performance measure is used to evaluate the final solutions returned by the tuning techniques. Other predictive performance measures can generate different results, depending on how they deal with data imbalance.

The experimental methodology described in Section 4 considers the tuning techniques that have been used in related literature Feurer:2015 ; Sureka:2008 ; Sun:2013 ; Kotthoff:2016 . The exceptions are the eda and irace techniques, which have been explored recently for hyperparameter tuning of other ml algorithms, like svm Padierna:2017 ; Miranda:2014 . Since there is a lack of studies investigating these techniques for dt (see Section 2.4), they were added to the experimental setup.

6.2 Internal validity

Krstajic et. al. Krstajic:2014 compared different resampling strategies for selecting and assessing the predictive performance of regression/classification models induced by ml algorithms. In Cawley & Talbot Cawley:2010 the authors also discuss the overfitting in the evaluation methodologies when assessing ml algorithms. They describe a so-called “unbiased performance evaluation methodology”, which correctly accounts for any overfitting that may occur in the model selection. The internal protocol described by the authors performs the model’s selection independently within each fold of the resampling procedure. In fact, most of the current studies on hyperparameter tuning have adopted nested-cv, including important autoML tools, like Auto-WEKA292929http://www.cs.ubc.ca/labs/beta/Projects/autoweka/ Thornton:2013 ; Kotthoff:2016 and Auto-skLearn303030https://github.com/automl/auto-sklearn Feurer:2015 ; Feurer:2015B . Since this paper aims to assess dt induction algorithms optimized by hyperparameter tuning techniques, the nested cv methodology is the best choice and was adopted in the experiments.

In the experiments carried out for this study, all the default settings provided by the implementations of the tuning techniques were used. In fact, most of these default values have been evaluated in benchmark studies and reported to provide good predictive performance mlrMBO:2017 ; Perez:2014 , while others (like pso’s) showed to be robust in a high number of datasets. For eda and ga, there is no standard choice for their parameter values Mills:2015 , and even adapting both to handle our mixed hyperparameter spaces properly they performed poorly. It suggests that a fine tuning of their parameters would be needed. Since this would considerably increase the cost of experiments by adding a new tuning level (the tuning of tuning techniques), and most of the techniques performed well with default values, this additional tuning was not assessed in this study.

The use of a larger budget, with evaluations for dt tuning, was investigated in Mantovani:2016 . The experimental results suggested that all the considered techniques required only evaluations to converge. The convergence here means the tuning techniques could not improve their predictive performance more than until the budget was consumed. Actually, in most cases, the tuning reached its maximum performance after steps. Thus, a budget size of evaluations was therefore deemed sufficient. Results obtained with this budget value showed that the exploration made in hyperparameter spaces led to statistically significant improvements in most cases.

6.3 External validity

Section 5.4 presented statistical comparisons between tuning techniques. In Demvsar:2006 , Demšar discusses the issue of statistical tests for comparisons of several techniques on multiple datasets reviewing several statistical methodologies. The method proposed as more suitable is the non-parametric analog version of ANOVA, i.e. the Friedman test, along with the corresponding Nemenyi post-hoc test. The Friedman test ranks all the methods separately for each dataset and uses the average ranks to test whether all techniques are equivalent. In case of differences, the Nemenyi test performs all the pairwise comparisons between the techniques and identifies the presence of significant differences. Thus, the Friedman ranking test followed by the Nemenyi post-hoc test was used to evaluate experimental results from this study.

Some recent studies raised concerns that a Friedman-Nemenyi test produces overlapping groups Tantithamthavorn:2017 . They recommend the use of the Scott-Knott Effect Size Difference test to produce non-overlapping groups. Using the Scott-Knott ESD test, under its assumptions, the analysis of the experimental results did not change.The main effect was to generate clean groups, while in the Friedman test a CD-diagram is required to interpret results. In general, there is no silver bullet, and each test will have its pros and cons.

The budget size adopted can directly influence the performance of the meta-heuristics, specially ga and eda. In Hauschild:2011 the authors recommend to use at least individuals to build a reliable eda model, suggestion followed in Mantovani:2016 . In this extended version, the budget size was reduced, supported by prior analyses, and tuning techniques adapted to work with the reduced number of evaluations. Increasing the population size would also increase both the number of iterations and the budget size. However, it has already been experimentally shown that just a small number of evaluations provides good predictive performance values Mantovani:2016 . It is important to highlight that even using a small population the pso technique reached robust results in a wide variety of tasks considering the three dt algorithms investigated. At this point, the poor performance values obtained by ga and eda can be considered a limitation: they do not search properly space under this budget restriction.

7 Conclusions

This paper investigated the effects of hyperparameter tuning on the predictive performance of dt induction algorithms, as well the impact hyperparameters have on the induced models’ performances. For this purpose, three different dt implementations were chosen as study cases: two of the most popular algorithms in ml - the J48 and cart algorithms, and the ctree algorithm, a more recent implementation similar to classical CHAID algorithm. An experimental analysis regarding the sensitivity of their hyperparameters was also presented. Experiments were carried out with public openml datasets and six different tuning techniques. The performances of dt induced using these techniques were also compared with dts generated with the default hyperparameter values (provided by the correspondent R packages). The main findings are summarized below.

7.1 Tuning of J48

In general, hyperparameter tuning for J48 produced modest improvements when compared to the RWeka default values: the trees induced with tuned hyperparameter settings reached performances similar to those obtained by defaults. Statistically significant improvements were detected in only one-third of the datasets, often those datasets where the default values produced very shallow trees.

The J48 Boolean hyperparameters are responsible for enabling/disabling some data transformation processes. In default settings, all of these hyperparameters are disabled. So, enabling them requires more time to induce and assess trees (which can be noted in the runtime analysis and charts in Section 5.6). Furthermore, the relative hyperparameter importance results (via fANOVA analysis) showed that these Boolean hyperparameters are irrelevant for most datasets. Only a subset of hyperparameters (R, C, N, M) contributes actively to the performance of the final dts.

Most of the related studies which performed some tuning for J48 tried different values for the complexity parameter (C), but none of them tried hyperparameter tuning using reduced error pruning: enabling ‘R’ and changing ‘N’ values. The use of ‘R’ and ‘N’ options may be a solution when tuning only ‘C’ does not sufficiently improve performance (as indicated by fANOVA analysis).

None of the related work used the irace technique: they focused more on smbo, pso or another tuning technique. smbo is often used with an early stopping criterion (a budget) since it is the slowest technique. However, it typically converged after relatively few iterations. If it is desirable to obtain good solutions faster, pso might be recommended. However, for the J48 algorithm, the best technique concerning performance is irace: it was better ranked, evaluated more candidates, and did not consume a lot of runtime.

The J48 default hyperparameter values were good for a significant number of datasets. This behavior can be since the defaults used by RWeka were chosen to be the best overall values performing on the uci ml repository Bache:2013 datasets.

7.2 Tuning of CART

Surprisingly, cart was much more sensitive to hyperparameter tuning than J48. Statistically significant improvements were reached in two thirds of the datasets, most of them with a high-performance gain. Most of the hyperparameters control the number of instances in nodes/leaves used for splitting. These hyperparameters directly affect the size and depth of the trees. The experimental analyses showed that default settings induced shallow and small trees for most of the problems. These trees did no obtain good predictive performances. Where the defaults did grow large trees, the performance was similar to the optimized performance. In general, cart’s default hyperparameter values induced trees which are on average smaller than those produced by J48 under default settings. One reason that may also explain the poor cart’s default performances would be the case that J48 hyperparameters were pre-tuned on uci datasets while the cart ones were not.

Our relative importance analysis indicated that hyperparameters such as ‘minsplit’ and ‘minbucket’ are the most responsible for the performance of the final trees. In the related literature, just two of the five works investigated the tuning of both. Even so, they used rs and smbo as tuning techniques. Experiments showed that for cart hyperparameter tuning, the irace technique significantly outperformed all the other ones (especially with ). It evaluated a higher number of candidates during the search, and its running time was comparable to that of the meta-heuristics. Thus, irace would be a good choice and might be further explored in future research.

7.3 Tuning of CTree

The tuning of ctree was a new task contribution from this study, considering the related results: none of them evaluated more than two hyperparameters before. The algorithm proved to be the least sensitive to the hyperparameter tuning process, setting up a third case distinct from the previous two. Statistically improvements were observed in just a quarter of the datasets. Default values were also statistically better in of the situations.

Similar to cart, most of its hyperparameters control the number of data examples in the node used for splitting (but in a statistical approach). Consequently, they control the size and depth of the induced trees. During the optimization of the hyperparameters, the tuning techniques found a wide range of hyperparameters values that differ from default settings (usually smaller). However, trees sizes did not show any visible difference, with irace, pso and smbo curves almost overlapping for all the datasets. It suggests that different from J48 and cart, another characteristic rather than the tree size could influence the final predictive performances.

The hyperparameter importance analysis also indicated that few of the hyperparameters studied are responsible in some way for the predictive performance of the final trees. Experiments also showed that irace would be the best hyperparameter tuning technique, being better ranked than other tuning techniques and presenting a running time comparable to the other meta-heuristics.

7.4 General scenario

In this analysis, we hypothesized that dataset complexity could explain when to use each tuning approach. It can be assumed that the more complex (difficult to classify) a dataset is, the more a dt algorithm will benefit from hyperparameter tuning. Thus, to understand when to use each approach, and be able to recommend when to tune the hyperparameter or use the defaults values, each dataset was described by characteristics obtained by extracting a set of complexity measures, which suggest how difficult a dataset is for a classification task.

We observed that hyperparameter tuning provides best results for datasets with many classes (), and when there are non-linear decision boundaries. On the other hand, defaults seem to be adequate for simple classification problems, where there is a higher separability between the classes.

Considering the algorithms investigated in this study, each one presented a different behavior under tuning. In general, it was possible to observe that the default hyperparameter values are suitable for a large range of datasets, but a fixed value would not be suitable for all the data classification tasks. It justifies and motivates the development of recommender systems able to suggest the most appropriate hyperparameter setting for a new problem.

7.5 Future Work

Our findings also point out to some future research directions. The data complexity characteristics provided some useful insight regarding in which situations tuning or defaults should be used. However, it would be possible to make more accurate suggestions exploring more concepts from the meta-learning field.

It would obviously also be interesting to explore other ML algorithms and their hyperparameters: not only dt induction algorithms, but many classifiers from different learning paradigms. The code developed in this study, which is publicly available, is easily extendable and may be adapted to cover a wider range of algorithms. The same can be said for the analysis.

All collected hyperparameter information might be leveraged in a recommendation framework to suggest hyperparameter settings. When integrated with openml, this framework could have great scientific (and societal) impact. The authors have already begun work in this direction.

Acknowledgments

The authors would like to thank CAPES and CNPq (Brazilian Agencies) for their financial support, specially to the grants #2012/23114-9, #2013/07375-0 and #2015/03986-0 from São Paulo Research Foundation (FAPESP).

EFOP-3.6.3-VEKOP-16-2017-00001: Talent Management in Autonomous Vehicle Control Technologies – The Project is supported by the Hungarian Government and co-financed by the European Social Fund.

References

  • (1) Lior Rokach and Oded Maimon. Data Mining With Decision Trees: Theory and Applications. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2nd edition, 2014.
  • (2) Simon Haykin. Neural Networks: A Comprehensive Foundation (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2007.
  • (3) Shigeo Abe. Support Vector Machines for Pattern Classification. Springer London, Secaucus, NJ, USA, 2005.
  • (4) Xindong Wu and Vipin Kumar. The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC, 1st edition, 2009.
  • (5) Dariusz Jankowski and Konrad Jackowski. Evolutionary algorithm for decision tree induction. In Khalid Saeed and Václav Snášel, editors, Computer Information Systems and Industrial Management, volume 8838 of Lecture Notes in Computer Science, pages 23–32. Springer Berlin Heidelberg, 2014.
  • (6) Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1 edition, 2005.
  • (7) L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Chapman & Hall (Wadsworth, Inc.), 1984.
  • (8) J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
  • (9) Ron Kohavi.

    Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.

    In Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996.
  • (10) Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning, 95(1-2):161–205, 2005.
  • (11) Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651–674, 2006.
  • (12) James S. Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2546–2554. Curran Associates, Inc., 2011.
  • (13) Martin Pilát and Roman Neruda. Multi-objectivization and Surrogate Modelling for Neural Network Hyper-parameters Tuning, pages 61–66. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
  • (14) Carlo M. Massimo, Nicolò Navarin, and Alessandro Sperduti. Hyper-Parameter Tuning for Graph Kernels via Multiple Kernel Learning, pages 214–223. Springer International Publishing, Cham, 2016.
  • (15) Luis Carlos Padierna, Martín Carpio, Alfonso Rojas, Héctor Puga, Rosario Baltazar, and Héctor Fraire. Hyper-Parameter Tuning for Support Vector Machines by Estimation of Distribution Algorithms, pages 787–800. Springer International Publishing, Cham, 2017.
  • (16) R.C. Barros, M.P. Basgalupp, A.C.P.L.F. de Carvalho, and A.A. Freitas. A survey of evolutionary algorithms for decision-tree induction. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(3):291–312, May 2012.
  • (17) Rodrigo C. Barros, André C. P. L. F. de Carvalho, and Alex Alves Freitas. Automatic Design of Decision-Tree Induction Algorithms. Springer Briefs in Computer Science. Springer, 2015.
  • (18) Matthias Reif, Faisal Shafait, and Andreas Dengel. Prediction of classifier training time including parameter optimization. In Joscha Bach and Stefan Edelkamp, editors,

    KI 2011: Advances in Artificial Intelligence

    , volume 7006 of Lecture Notes in Computer Science, pages 260–271. Springer Berlin Heidelberg, 2011.
  • (19) M. M. Molina, J. M. Luna, C. Romero, and S. Ventura. Meta-learning approach for automatic parameter tuning: A case study with educational datasets. In Proceedings of the 5th International Conference on Educational Data Mining, EDM 2012, pages 180–183, 2012.
  • (20) Matthias Reif, Faisal Shafait, Markus Goldstein, Thomas Breuel, and Andreas Dengel. Automatic classifier selection for non-experts. Pattern Analysis and Applications, 17(1):83–96, 2014.
  • (21) Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2005.
  • (22) Gordon V. Kass. An exploratory technique for investigating large quantities of categorical data applied statistics. Applied Statistics, 30(2):119–127, 1980.
  • (23) D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley, 1989.
  • (24) James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks, volume 4, pages 1942 – 1948, Perth, Australia, 1995.
  • (25) Mark Hauschild and Martin Pelikan. An introduction and survey of estimation of distribution algorithms.

    Swarm and Evolutionary Computation

    , 1(3):111 – 128, 2011.
  • (26) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012.
  • (27) Mauro Birattari, Zhi Yuan, Prasanna Balaprakash, and Thomas Stützle. F-Race and Iterated F-Race: An Overview, pages 311–336. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
  • (28) Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An efficient approach for assessing hyperparameter importance. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 754–762, 2014.
  • (29) Rafael Gomes Mantovani, Tomás Horváth, Ricardo Cerri, Joaquin Vanschoren, and André C. P. L. F. de Carvalho. Hyper-parameter tuning of a decision tree induction algorithm. In 5th Brazilian Conference on Intelligent Systems, BRACIS 2016, Recife, Brazil, October 9-12, 2016, pages 37–42. IEEE Computer Society, 2016.
  • (30) Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explor. Newsl., 15(2):49–60, 2014.
  • (31) Taciana A. F. Gomes, Ricardo B. C. Prudêncio, Carlos Soares, André L. D. Rossi, and nd André C. P. L. F. de Carvalho. Combining meta-learning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1):3–13, 2012.
  • (32) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:281–305, March 2012.
  • (33) Matthias Reif, Faisal Shafait, and Andreas Dengel. Meta-learning for evolutionary parameter optimization of classifiers. Machine Learning, 87:357–380, 2012.
  • (34) Barbara F.F. Huang and Paul C. Boutros. The parameter sensitivity of random forests. BMC Bioinformatics, 17(1):331, 2016.
  • (35) Katharina Eggensperger, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Efficient benchmarking of hyperparameter optimizers via surrogates. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 1114–1120. AAAI Press, 2015.
  • (36) Lidan Wang, Minwei Feng, Bowen Zhou, Bing Xiang, and Sridhar Mahadevan. Efficient hyper-parameter optimization for NLP applications. In Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton, editors,

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015

    , pages 2112–2117. The Association for Computational Linguistics, 2015.
  • (37) Tatjana Eitrich and Bruno Lang. Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of Comp. and Applied Mathematics, 196(2):425–436, 2006.
  • (38) J. Gascón-Moreno, S. Salcedo-Sanz, E. G. Ortiz-García, L. Carro-Calvo, B. Saavedra-Moreno, and J. A. Portilla-Figueras. A binary-encoded tabu-list genetic algorithm for fast support vector regression hyper-parameters tuning. In International Conference on Intelligent Systems Design and Applications, pages 1253–1257, 2011.
  • (39) Munehiro Nakamura, Atsushi Otsuka, and Haruhiko Kimura. Automatic selection of classification algorithms for non-experts using meta-features. China-USA Business Review, 13(3):199–205, 2014.
  • (40) Parker Ridd and Christophe Giraud-Carrier. Using metalearning to predict when parameter optimization is likely to improve classification accuracy. In Joaquin Vanschoren, Pavel Brazdil, Carlos Soares, and Lars Kotthoff, editors, Meta-learning and Algorithm Selection Workshop at ECAI 2014, pages 18–23, August 2014.
  • (41) Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michèle Sebag. Collaborative hyperparameter tuning. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 199–207. JMLR Workshop and Conference Proceedings, 2013.
  • (42) M. Lang, H. Kotthaus, P. Marwedel, C. Weihs, J. Rahnenführer, and B. Bischl. Automatic model selection for high-dimensional survival analysis. Journal of Statistical Computation and Simulation, 85(1):62–76, 2015.
  • (43) P.B.C. Miranda, R.M. Silva, and R.B. Prudêncio. Fine-tuning of support vector machine parameters using racing algorithms. In Proceedings of the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2014, pages 325–330, 2014.
  • (44) Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing bayesian hyperparameter optimization via meta-learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 1128–1135. AAAI Press, 2015.
  • (45) Vili Podgorelec, Saso Karakatic, Rodrigo C. Barros, and Márcio P. Basgalupp. Evolving balanced decision trees with a multi-population genetic algorithm. In IEEE Congress on Evolutionary Computation, CEC 2015, Sendai, Japan, May 25-28, 2015, pages 54–61. IEEE, 2015.
  • (46) Michael Schauerhuber, Achim Zeileis, David Meyer, and Kurt Hornik. Benchmarking Open-Source Tree Learners in R/RWeka, pages 389–396. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
  • (47) Ashish Sureka and Kishore Varma Indukuri. Using Genetic Algorithms for Parameter Optimization in Building Predictive Data Mining Models, pages 260–271. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
  • (48) Gregor Stiglic, Simon Kocbek, Igor Pernek, and Peter Kokol. Comprehensive decision tree models in bioinformatics. PLOS ONE, 7(3):1–13, 03 2012.
  • (49) Shih-Wei Lin and Shih-Chieh Chen. Parameter determination and feature selection for c4.5 algorithm using scatter search approach. Soft Computing, 16(1):63–75, jan 2012.
  • (50) J. Ma. Parameter Tuning Using Gaussian Processes. Master’s thesis, University of Waikato, New Zealand, 2012.
  • (51) C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proc. of KDD-2013, pages 847–855, 2013.
  • (52) Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin Leyton-Brown. Auto-weka 2.0: Automatic model selection and hyperparameter optimization in weka. Journal of Machine Learning Research, 17:1–5, 2016.
  • (53) Quan Sun and Bernhard Pfahringer. Pairwise meta-rules for better meta-learning-based algorithm ranking. Mach. Learn., 93(1):141–161, oct 2013.
  • (54) Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. Selecting near-optimal learners via incremental data allocation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2007–2015. AAAI Press, 2016.
  • (55) Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. Automated parameter optimization of classification techniques for defect prediction models. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 321–332, New York, NY, USA, 2016. ACM.
  • (56) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
  • (57) Michael Wainberg, Babak Alipanahi, and Brendan J. Frey. Are random forests truly the best classifiers? Journal of Machine Learning Research, 17(110):1–5, 2016.
  • (58) K. Bache and M. Lichman. UCI machine learning repository, 2013.
  • (59) Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, and Can Candan. caret: Classification and Regression Training, 2016. R package version 6.0-71.
  • (60) Róger Bermúdez-Chacón, Gaston H. Gonnet, and Kevin Smith. Automatic problem-specific hyperparameter optimization and model selection for supervised machine learning: Technical Report. Technical report, Zürich, 2015.
  • (61) Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2944–2952. Curran Associates, Inc., 2015.
  • (62) Julien-Charles Lévesque, Christian Gagné, and Robert Sabourin. Bayesian hyperparameter optimization for ensemble learning. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pages 437–446, Arlington, Virginia, United States, 2016. AUAI Press.
  • (63) Alexis Sardá-Espinosa, Subanatarajan Subbiah, and Thomas Bartz-Beielstein. Conditional inference trees for knowledge extraction from motor health condition data. Engineering Applications of Artificial Intelligence, 62:26 – 37, 2017.
  • (64) Asa Ben-Hur and Jason Weston. A user’s guide to support vector machines. In Data Mining Techniques for the Life Sciences, volume 609 of Methods in Molecular Biology, pages 223–239. Humana Press, 2010.
  • (65) Sigrun Andradottir. A review of random search methods. In Michael C Fu, editor, Handbook of Simulation Optimization, volume 216 of International Series in Operations Research & Management Science, pages 277–292. Springer New York, 2015.
  • (66) Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.
  • (67) J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proc. 30th Intern. Conf. on Machine Learning, pages 1–9, 2013.
  • (68) Frauke Friedrichs and Christian Igel. Evolutionary tuning of multiple svm parameters. Neurocomput., 64:107–117, 2005.
  • (69) Alex Kalos. Automated neural network structure determination via discrete particle swarm optimization (for non-linear time series models). In Proceedings of the 5th WSEAS International Conference on Simulation, Modelling and Optimization, SMO’05, pages 325–331. World Scientific and Engineering Academy and Society (WSEAS), 2005.
  • (70) Dan Simon. Evolutionary Optimization Algorithms. Wiley, first edition, 2013.
  • (71) Gavin C. Cawley and Nicola L. C. Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11:2079–2107, 2010.
  • (72) Damjan Krstajic, Ljubomir J. Buturovic, David E. Leahy, and Simon Thomas. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of Cheminformatics, 6(1):1–15, 2014.
  • (73) Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, and Zne-Jung Lee. Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Systems with Applications, 35(4):1817–1824, 2008.
  • (74) Xin-She Yang, Zhihua Cui, Renbin Xiao, Amir Hossein Gandomi, and Mehmet Karamanoglu. Swarm Intelligence and Bio-Inspired Computation: Theory and Applications. Elsevier Science Publishers B. V., 1st edition, 2013.
  • (75) Bernd Bischl, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016.
  • (76) Luca Scrucca. Ga: A package for genetic algorithms in r. Journal of Statistical Software, 53(1):1–37, 2013.
  • (77) Claus Bendtsen. pso: Particle Swarm Optimization, 2012. R package version 1.0.3.
  • (78) Yasser Gonzalez-Fernandez and Marta Soto. copulaedas: An R package for estimation of distribution algorithms based on copulas. Journal of Statistical Software, 58(9):1–34, 2014.
  • (79) Kurt Hornik, Christian Buchta, and Achim Zeileis. Open-source machine learning: R meets Weka. Computational Statistics, 24(2):225–232, 2009.
  • (80) Terry Therneau, Beth Atkinson, and Brian Ripley. rpart: Recursive Partitioning and Regression Trees, 2015. R package version 4.1-10.
  • (81) Bernd Bischl, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions.
  • (82) Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.
  • (83) Manuel López-Ibáñez, Jérémie Dubois-Lacoste, Leslie Pérez Cáceres, Mauro Birattari, and Thomas Stüetzle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43 – 58, 2016.
  • (84) Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. The balanced accuracy and its posterior distribution. In

    Proceedings of the 2010 20th International Conference on Pattern Recognition

    , pages 3121–3124. IEEE Computer Society, 2010.
  • (85) Maurice Clerc. Standard partcile swarm optimization. 15 pages, September 2012.
  • (86) Mauricio Zambrano-Bigiarini, Maurice Clerc, and Rodrigo Rojas. Standard particle swarm optimisation 2011 at CEC-2013: A baseline for future PSO improvements. In Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2013, Cancun, Mexico, June 20-23, 2013, pages 2337–2344. IEEE, 2013.
  • (87) K. L. Mills, J. J. Filliben, and A. L. Haines. Determining relative importance and effective settings for genetic algorithm control parameters. Evol. Comput., 23(2):309–342, June 2015.
  • (88) Janez Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006.
  • (89) A. Orriols-Puig, N. Macia, and T. K. Ho. Documentation for the data complexity library in c++. Technical report, La Salle - Universitat Ramon Llull, Barcelona, Spain, 2010.
  • (90) Luís P.F. Garcia, André C.P.L.F. de Carvalho, and Ana C. Lorena. Noise detection in the meta-learning level. Neurocomputing, 176:14–25, 2016.
  • (91) Manuel López-Ibáñez, Jérémie Dubois-Lacoste, Leslie Pérez Cáceres, Mauro Birattari, and Thomas Stützle. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43 – 58, 2016.
  • (92) Leslie Pérez Cáceres, Manuel López-Ibáñez, and Thomas Stützle. An Analysis of Parameters of irace, pages 37–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.
  • (93) Manuel López-Ibáñez, Jérémie Dubois-Lacoste, Thomas Stützle, and Mauro Birattari. The irace package, iterated race for automatic algorithm configuration. Technical Report TR/IRIDIA/2011-004, IRIDIA, Université Libre de Bruxelles, Belgium, 2011.
  • (94) Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. An empirical comparison of model validation techniques for defect prediction models. 43(1):1–18, 2017.

Appendix A List of abbreviations used in the paper

[type=, title=]

Appendix B List of OpenML datasets used for experiments

Nro OpenML name OpenML did D N C %MajC
1 acute-inflammations 1455 6 120 2 0.58
2 analcatdata_authorship 458 70 841 4 0.38
3 analcatdata_boxing1 448 3 120 2 0.65
4 analcatdata_boxing2 444 3 132 2 0.54
5 analcatdata_creditscore 461 6 100 2 0.73
6 analcatdata_dmft 469 4 797 6 0.19
7 analcatdata_germangss 475 5 400 4 0.25
8 analcatdata_lawsuit 450 4 264 2 0.93
9 appendicits 1456 7 106 2 0.80
10 artificial-characters 1459 7 10218 10 0.14
11 autoUniv-au1-1000 1547 20 1000 2 0.74
12 autoUniv-au4-2500 1548 100 2500 3 0.47
13 autoUniv-au6-1000 1555 40 1000 8 0.24
14 autoUniv-au6-750 1549 40 750 8 0.22
15 autoUniv-au6-400 1551 40 400 8 0.28
16 autoUniv-au7-1100 1552 12 1100 5 0.28
17 autoUniv-au7-700 1553 12 700 3 0.35
18 autoUniv-au7-500 1554 12 500 5 0.38
19 backache 463 31 180 2 0.86
20 balance-scale 11 4 625 3 0.46
21 banana 1460 2 5300 2 0.55
22 bank-marketing 1461 16 45211 2 0.88
23 banknote-authentication 1462 4 1372 2 0.56
24 blood-transfusion-service-center 1464 4 748 2 0.76
25 breast-w 15 9 699 2 0.66
26 breast-tissue 1465 9 106 6 0.21
27 live-disorders 8 6 345 2 0.58
28 car 21 6 1728 4 0.70
29 cardiotocography v.2 (version 2) 1560 35 2126 3 0.78
30 climate-model-simulation-crashes 1467 20 540 2 0.91
31 cloud 210 6 108 4 0.30
32 cmc 23 9 1473 3 0.43
33 sonar 40 60 208 2 0.53
34 vowel 307 13 990 11 0.09
35 dermatology 35 34 366 6 0.31
36 fertility 1473 9 100 2 0.88
37 first-order-theorem-proving 1475 51 6118 6 0.42
38 solar-flare 173 12 1389 6 0.29
39 haberman 43 3 306 2 0.74
40 hayes-roth 329 4 160 3 0.41
41 heart-c 49 13 303 5 0.54
42 heart-h 51 13 294 2 0.64
43 heart-long-beach 1512 13 200 5 0.28
44 heart-h v.3 (version 3) 1565 13 294 5 0.64
45 hepatitis 55 19 155 2 0.79
46 hill-valley 1479 100 1212 2 0.50
47 colic 25 27 300 2 0.64

Table 7: (Multi-class) classification OpenML datasets (1 to 47) used in experiments. For each dataset it is shown: the OpenML dataset name and id, the number of attributes (D), the number of examples (N), the number of classes (C), and the percentage of examples belonging to the majority class (%MajC).
Nro OpenML name OpenML did D N C %MajC
48 ilpd 1480 10 583 2 0.71
49 ionosphere 59 33 351 2 0.64
50 iris 61 4 150 3 0.33
51 kr-vc-kp 3 36 3196 2 0.52
52 LED-display-domain-7digit 40496 7 500 10 0.11
53 lsvt 1484 310 126 2 0.67
54 mammography 310 5 961 2 0.54
55 meta 566 21 528 24 0.04
56 mfeat-fourier 14 76 2000 10 0.10
57 micro-mass 1514 1300 360 10 0.10
58 molecular-biology-promoters 164 57 106 2 0.50
59 splice 46 62 3190 3 0.52
60 monks-problems-1 333 6 556 2 0.50
61 monks-problems-2 334 6 601 2 0.66
62 monks-problems-3 335 6 554 2 0.52
63 libras-move v.2 40736 90 360 15 0.07
64 mfeat-factors 12 217 2000 10 0.10
65 mushroom 24 21 8124 2 0.52
66 nursery (v.3) 1568 9 12958 4 0.33
67 optdigits 28 62 5620 10 0.10
68 ozone-level-8hr 1487 72 2534 2 0.94
69 ozone_level v.2 40735 72 2536 2 0.97
70 page-blocks 30 10 5473 5 0.90
71 parkinsons 1488 22 195 2 0.75
72 phoneme 1489 5 5404 2 0.71
73 one-hundred-plants-margin 1491 65 1600 100 0.01
74 one-hundred-plants-shape 1492 65 1600 100 0.01
75 one-hundred-plants-texture 1493 65 1599 100 0.01
76 wall-robot-navigation v.3 (version 3) 1526 4 5456 4 0.40
77 sa-heart 1498 9 462 2 0.65
78 seeds 1499 7 210 3 0.33
79 semeion 1501 257 1593 10 0.10
80 credit-g 31 20 1000 2 0.70
81 heart-statlog 53 13 270 2 0.56
82 segment 36 18 2310 7 0.14
83 satellite_image v.2 40734 36 2859 6 0.30
84 vehicle 54 18 846 4 0.26
85 steel-plates-fault 1504 33 1941 2 0.65
86 tae 48 5 151 3 0.34
87 texture 40499 40 5500 11 0.09
88 thoracic-surgery 1506 16 470 2 0.85
89 thyroid-allbp 40474 26 2800 5 0.58
90 thyroid-allhyper 40475 26 2800 5 0.58
91 user-knowledge 1508 6 403 5 0.32
92 vertebra-column 1523 6 310 3 0.48
93 wine 187 14 178 3 0.39
94 yeast (version v.7) 40733 8 1484 4 0.36

Table 8: (Multi-class) classification OpenML datasets (48 to 94) used in experiments. For each dataset it is shown: the OpenML dataset name and id, the number of attributes (D), the number of examples (N), the number of classes (C), and the percentage of examples belonging to the majority class (%MajC).