A meta-learning recommender system for hyperparameter tuning: predicting when tuning improves SVM classifiers

06/04/2019 ∙ by Rafael Gomes Mantovani, et al. ∙ Universidade de São Paulo unesp UTFPR TU Eindhoven 0

For many machine learning algorithms, predictive performance is critically affected by the hyperparameter values used to train them. However, tuning these hyperparameters can come at a high computational cost, especially on larger datasets, while the tuned settings do not always significantly outperform the default values. This paper proposes a recommender system based on meta-learning to identify exactly when it is better to use default values and when to tune hyperparameters for each new dataset. Besides, an in-depth analysis is performed to understand what they take into account for their decisions, providing useful insights. An extensive analysis of different categories of meta-features, meta-learners, and setups across 156 datasets is performed. Results show that it is possible to accurately predict when tuning will significantly improve the performance of the induced models. The proposed system reduces the time spent on optimization processes, without reducing the predictive performance of the induced models (when compared with the ones obtained using tuned hyperparameters). We also explain the decision-making process of the meta-learners in terms of linear separability-based hypotheses. Although this analysis is focused on the tuning of Support Vector Machines, it can also be applied to other algorithms, as shown in experiments performed with decision trees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 24

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many ml algorithms, among them svm Vapnik (1995), have been successfully used in a wide variety of problems. svm are kernel-based algorithms that perform non-linear classification using a hyperspace transformation, i.e., they map data inputs into a high-dimensional feature space where the problem is possibly linearly separable. As most ml algorithms, svm are sensitive to their hp values, which directly affect their predictive performance and depend on the data under analysis. The predictive performance of svm is mostly affected by the values of four hp: the kernel function (), its width () or polynomial degree (), and the regularized constant (). Hence, finding suitable svm hp is a frequently studied problem Horn et al. (2016); Padierna et al. (2017). svm hp tuning is commonly modeled as a black-box optimization problem whose objective function is associated with the predictive performance of the SVM induced model. Many optimization techniques have been proposed in the literature for this problem, varying from a simple gs to the state of the art smbo technique Snoek et al. (2012). In Bergstra and Bengio (2012), Bergstra & Bengio showed theoretically and empirically that rs is a better alternative than gs and is able to find good hp settings when performing hp tuning. Mantovani et. al. Mantovani et al. (2015a)

also compared rs with meta-heuristics to tune svm hp. A large amount of empirical experiments showed that rs generates models with predictive performance as effective as those obtained by meta-heuristics.

However, regardless the optimization technique, hyperparameter tuning usually has a high computational cost, particularly for large datasets, with no guarantee that a model with high predictive performance will be obtained. During the tuning, a large number of hp settings usually need to be assessed before a “good” solution is found, requiring the induction of several models, multiplying the learning cost by the number of settings evaluated. Besides, several aspects, such as the complexity of a dataset, can influence the tuning cost.

When computational resources are limited, a commonly adopted alternative is to use the default hp values suggested by ml tools. Previous works have pointed out that for some datasets, hp tuning of svm is not necessary Ridd and Giraud-Carrier (2014). Using default values largely reduce the overall computational cost, but, depending on the dataset, can result in models whose predictive performance is significantly worse than models produced by using hp tuning. The ideal situation would be to recommend the best alternative, default or tuned hp values for each new dataset.

In this paper, we propose a recommender system to predict, when applying svm to a new dataset, whether it is better to perform hp tuning or it is sufficient to use default hp values. This system, based on mtl Brazdil et al. (2009), is able to reduce the overall cost of tuning without significant loss in predictive performance. Another important novelty in this study is a descriptive analysis of how the recommendation occurs. Although the recommender system is proposed for the hp tuning of svm, it can also be used for other ml algorithms. To illustrate this aspect, we present an example where the recommender system is used for hp tuning of a dt induction algorithm.

The proposed recommender system can also be categorized as an automl solution Feurer et al. (2015), since it aims to relieve the user from the repetitive and time-consuming tuning task, automating the process through mtl. The automl area is relatively new, and there still many questions to be addressed. This fact, and the emerging attention it has attracted from important research groups Feurer et al. (2015); Kotthoff et al. (2016) and large companies 111Google Cloud AutoML - https://cloud.google.com/automl/ 222Microsoft Custom Vision - https://www.customvision.ai, highlights the importance of new studies in this area. An essential aspect for the success of automl systems is to provide an automatic and robust tuning system, which also emphasizes the relevance of the problem investigated in this paper.

In summary, the main contributions of this study are:

  • the development of a modular and extensible mtl framework to predict when default hp values provide accurate models, saving computational time that would be wasted on optimization with no significant improvement;

  • a comparison of the effectiveness of different sets of meta-features and preprocessing methods for meta-learning, not previously investigated;

  • reproducibility of the experiments and analyses: all the code and experimental results are available to reproduce experiments, analyses and allow further investigations333The code is available in Github repositories, while experimental results are available on openml Vanschoren et al. (2014) study pages. These links are provided in Table 8 at Subsection 4.7..

It is important to mention that we considered the proposed framework for predictive tasks, in particular, supervised classification tasks using svm. However, the issues investigated in this paper can be easily extended to other tasks (such as regression) and other ml algorithms 444A note on the generalization of the proposal is presented in Section 5.8.

This paper is structured as follows: Section 2 presents the basic mtl concepts used in our approach. Section 3 defines the hp tuning problem and presents a concise survey of prior work combining svm with mtl in some way. The complete experimental methodology covered to obtain the results is presented in Section 4. Results are discussed in Section 5 while final considerations and conclusions are presented in Section 6.

2 Background on Meta-Learning

Several ml algorithms have been proposed for prediction tasks. However, since each algorithm has its inductive bias, some of them can be more appropriate for a particular data set. When applying a ml algorithm to a dataset, a higher predictive performance can be obtained if an algorithm whose bias is more adequate to the dataset is used. The recommendation of the most adequate ml algorithm for a new dataset is investigated in an research area known as mtl Brazdil et al. (2009).

mtl has been largely used for algorithm selection Ali and Smith-Miles (2006), and for ranking Soares and Brazdil (2006) and prediction Reif et al. (2014) of predictive performance of ml algorithms. It investigates how to learn from previous ml experiments. According to Brazdil et. al. Brazdil et al. (2009), meta-learning can be used to improve the learning mechanism itself after each training process. In mtl, the process of using a learning algorithm to induce a model for a data set is called base-learning. At the meta-level, likely useful information extracted from this process (meta-features) are used to induce a meta-model. This meta-model can recommend the most promising learning algorithm, a set of the

best learning algorithms or a ranking of learning algorithms according to their estimate predictive performance for a new dataset. The knowledge extracted during this process is called meta-knowledge. The meta-features extracted from each dataset is a critical aspect. They must be sufficient to describe the main aspects necessary to distinguish the predictive performance obtained by different learning algorithms when applied to this dataset. As a result, it should allow the induction of a meta-model with good predictive performance. According to

Vilalta et al. (2004) three different sets of measures can be applied to extract meta-features:

  • Simple, Statistical and Information-theoretic meta-features  Brazdil and Henery (1994)

    : these consist of simple measures about the input dataset, such as the number of attributes, examples and classes, skewness, kurtosis and entropy. They are the most explored subset of meta-features in literature 

    Feurer et al. (2015); Gomes et al. (2012); Miranda et al. (2012); Reif et al. (2012, 2014); Soares et al. (2004);

  • Model-based meta-features Bensusan et al. (2000): these are a set of properties of a model induced by a ml algorithm for the dataset at the hand. For instance, if a decision tree induction algorithm is applied to the dataset, statistics about nodes, leaves and branches can be used to describe the dataset. They have also been used frequently in literature Reif et al. (2012, 2014);

  • Landmarking Pfahringer et al. (2000): the predictive performance obtained by models induced by simple learning algorithms, called landmarkers, are used to characterize a dataset. These measures were explored in studies such as Feurer et al. (2015); Reif et al. (2014).

Recently, new sets of measures have been proposed and explored in literature:

  • Data complexity Ho and Basu (2002): this is a set of measures which analyze the complexity of a problem considering the overlap in the attributes values, the separability of the classes, and geometry/topological properties. They have been explored in Garcia et al. (2016); and

  • Complex networks Morais and Prati (2013): measures based on complex network properties are extracted from a network built with the data instances. These measures can only be extracted from numerical data. Thus, preprocessing procedures are required for their extraction. They were explored in Garcia et al. (2016).

3 Meta-learning for Hyperparameter tuning

As previously mentioned, there is a large number of studies investigating the use of mtl to automate one or more steps in the application of ml algorithms for data analysis tasks. These studies can be roughly grouped into the following approaches, according to what mtl does:

  • it recommends hp settings;

  • it predicts training runtime;

  • it recommends initial values for hp optimization;

  • it estimates predictive performance for an hp setting;

  • it predicts hp tuning improvement/necessity.

Table 1 summarizes a comprehensive list of studies that either embedded or used mtl to cope with the svm hp tuning problem. Next, these works are described in more detail.

3.1 Recommendation of HP settings

The first approach considered hp settings as independent algorithm configurations and predicted the best setting based on characteristics of the dataset under analysis. In this approach, the hp settings are predicted without actually evaluating the model on the new dataset Soares et al. (2004). In Soares et. al. Soares et al. (2004) and Soares & Brazdil Soares and Brazdil (2006), the authors predicted the width () of the svm Gaussian kernel for regression problems. A finite set of values was investigated for regression problems and the predictive performance was assessed using -fold cv and the nmse evaluation measure. The recommendation of

values for new datasets used a knn meta-learner.

Ali & Smith-Miles Ali and Smith-Miles (2006) presented a similar study but selected one among five different svm kernel functions for classification datasets. They assessed model predictive performance for different hp settings using -cv procedure and the simple acc measure. Miranda & Prudêncio Miranda and Prudêncio (2013) proposed another mtl approach, called at Leite et al. (2012), to select the hp and the soft margin (). Experiments performed on classification datasets assessed the settings using a single -cv and the acc measure.

Lorena et. al. Lorena et al. (2018) proposed a set of complexity meta-features for regression problems. One of the case studies evaluated was the svm hp tuning problem. The authors generated a finite grid of , C and (margin of tolerance for regression svm) values, assessing them with a single 10-fold cv and nmse measure, considering regression problems. The recommendation of hp for new unseen datasets was performed by a knn distance-based meta-learner.

3.2 Prediction of Training Runtime

Other works investigated the use of mtl to estimate the training time of classification algorithms when induced by different hp settings. In Reif et. al. Reif et al. (2011), the authors predicted the training time for several classifiers, including svm. They defined a discrete grid of hp settings, assessing these settings on classification datasets considering the pmcc and the nae performance measures. In Priya et. al Priya et al. (2012), the authors conducted a similar study but used a ga to optimize parameters and perform meta-feature selection of six meta-learners. Experiments were carried out over classification datasets assessing hp settings using a -fold cv and the mad evaluation measure.

3.3 Recommendation of initial values for HP optimization

mtl has also been used to speed up the optimization of hp values for classification algorithms Feurer et al. (2015); Gomes et al. (2012); Miranda et al. (2012); Reif et al. (2012). In Gomes et. al. Gomes et al. (2012) mtl is used to recommend hp settings as initial search values by the pso and ts optimization techniques. Experiments were conducted in regression datasets adjusting the and hp to reduce the nmse value. A knn meta-learner was used to recommend the initial search values.

Reif et. al. Reif et al. (2012) and Miranda et. al. Miranda et al. (2012) investigated, respectively, the use of ga and different versions of pso for the same task. In Miranda et.al. Miranda et al. (2014), the authors used multi-objective optimization to optimize the hp to increase predictive the performance and the number of support vectors. These studies used simple accuracy measure and -fold cv to optimize hp values.

The same approach is explored in a tool to automate the use of ml algorithms, the Auto-skLearn Feurer et al. (2015). In this tool, mtl is used to recommend hp settings for the initial population of the smbo optimization technique. The authors explored all the available svm hp in openml classification datasets. It is the first and perhaps the only work that uses nested-cv to assess hp settings. Each setting was assessed in terms of the simple acc measure.

3.4 Estimation of predictive performance for an HP setting

A more recent approach uses mtl to estimate ml algorithms’ performance considering their hp. In Reif et. al. Reif et al. (2014), the authors evaluated different ml algorithms in datasets, including svm, and used the performance predictions to develop a mtl system for automatic algorithm selection.

Wistuba et. al. Wistuba et al. (2018) adapted the acquisition function of surrogate models by one optimized meta-model. They evaluated several svm hp configurations in a holdout fashion procedure over datasets and used the meta-knowledge to predict the performance of new hp settings for new datasets. The authors also proposed a new taf that extended the original proposal by predicting the predictive performance of hp settings for surrogate models.

Eggensperger et. al. Eggensperger et al. (2018) proposed a benchmarking approach of “surrogate scenarios”, which extracts meta-knowledge from hp optimization and algorithm configuration problems, and approximates the performance surface by regression models. One of the meta-datasets explored in the experimental setup has a set of svm hp settings assessed for the MNIST dataset. These settings were obtained executing a simple rs method and three optimizers: roar Hutter et al. (2009), irace López-Ibáñez et al. (2016), and pils.

Reference Year Meta-learning svms Tuning Meta Number of Datasets’ Evaluation Evaluation
Parameters Techniques Learner Datasets Source Procedure Measure
Soares et al. Soares et al. (2004) 2004 Recommends hp settings gs knn 42 UCI, METAL 10-CV NMSE
Soares & Brazdil Soares and Brazdil (2006) 2006 Recommends hp settings gs knn 42 UCI, METAL 10-CV NMSE
Ali & Smith-Miles Ali and Smith-Miles (2006) 2006 Recommends hp settings gs C5.0 112 UCI, KDC 10-CV Acc
Miranda & 2013 Recommends hp settings gs at 60 UCI 10-CV Acc
    Prudêncio Miranda and Prudêncio (2013)
Lorena & et al. Lorena et al. (2018) 2018 Recommends hp settings gs knn 39 UCI 10-CV nmse
Reif et al. Reif et al. (2011) 2011 Predicts training runtime gs svm 123 UCI - PMCC
NAE
Priya et al. Priya et al. (2012) 2012 Predicts training runtime ga J48,svm 78 UCI 5-CV MAD
Bagging
knn,NB
Jrip
Gomes et al. Gomes et al. (2012) 2012 Recommends initial values gs knn 40 WEKA 10-CV NMSE
for hp optimization pso,ts
Miranda et al. Miranda et al. (2012) 2012 Recommends initial values gs, pso knn 40 UCI, WEKA 10-CV Acc
for hp optimization
Reif et al. Reif et al. (2012) 2012 Recommends initial values gs,ga knn 102 UCI, Statlib 10-CV Acc
for hp optimization
Miranda et al. Miranda et al. (2014) 2014 Recommends initial values gs,pso knn 100 UCI 10-CV Acc
for hp optimization
Auto-skLearn Feurer et al. (2015) 2015 Recommends initial values for gs, smbo knn 140 OpenML Nested-CV Acc
for hp optimization
Reif et al. Reif et al. (2014) 2014 Estimates predictive performance gs svm 54 UCI, Statlib 10-CV RMSE
for a hp setting PMCC
Wistuba et al. Wistuba et al. (2018) 2018 Estimates predictive performance gs gp 109 UCI, WEKA Holdout -
for a hp setting
Eggensperger et al. Eggensperger et al. (2018) 2018 Estimates predictive performance roar, irace rf 11 AClib Holdout RMSE
for a hp setting rs, pils scrr
Ridd & 2014 Predicts hp tuning improvement pso J48, RF 326 UCI, WEKA - AUC
   Giraud-Carrier Ridd and Giraud-Carrier (2014) svm
Mantovani et al. Mantovani et al. (2015b) 2015 Predicts hp tuning necessity gs, rs, eda J48, SVM, LR 143 UCI Nested-CV BAC
ga,pso NB, knn, RF
Sanders & 2017 Predicts hp tuning improvement ga mlp 229 OpenML 10-CV AUC
   Giraud-Carrier Sanders and Giraud-Carrier (2017)
Table 1: Summary of related studies applying to mtl to svm. Fields without information in the related study are marked with a hyphen.

3.5 Prediction of HP tuning improvement/necessity

Although the studies mentioned in this section are the most related to our current work regarding the proposed modeling, they have different goals. While Ridd & Giraud-Carrier  Ridd and Giraud-Carrier (2014) and Sanders & Giraud-Carrier Sanders and Giraud-Carrier (2017) are concerned with predicting tuning improvement, Mantovani et al. Mantovani et al. (2015b) and the present study aimed to predict when hp tuning is necessary.

Ridd & Giraud-Carrier Ridd and Giraud-Carrier (2014) investigated a cash problem. They carried out experiments using pso technique to search the hyperspace of this cash problem in binary classification datasets. Their mtl-based method predicts whether hp tuning would lead to a considerable increase in accuracy considering a pool of algorithms, including svm.Even though this is one of the first studies in this direction, we could point out some drawbacks:

  • the proposed method does not identify which algorithm and correspondent hp values the user should run to achieve an improved performance;

  • there is no guarantee that training and testing data are not mutually exclusive;

  • the rule to label the meta-examples is defined empirically, based on thresholds of the difference of the accuracy between default and tuned HP values;

  • all the datasets are binary classification problems; and

  • it is not possible to reproduce the experiments, specially base-level tuning since most of the details are not explained, and the code is not available.

Sanders & Giraud-Carrier Sanders and Giraud-Carrier (2017) used a ga technique for hp tuning of three different ml algorithms, including svm. Their experimental results with  openml classification datasets showed that tuning almost always yielded significant improvements compared to default hp values. Thus, they focused on the regression task of predicting how much improvement can be expected by tuning hp compared to default values. They also addressed this task using mtl. However, their study presents some limitations, such as:

  • the optimization process of svm hyperparameters were computationally costly and did not finish for most of the datasets;

  • the meta-learner was not able to predict hyperparameter tuning improvements for svm in those datasets whose tuning process finished;

  • there is no guarantee that the generated meta-examples are different from each other (intersection between training and test data), since OpenML stores different versions of the same dataset. This could lead to biased results; and

  • experiments are not reproducible since most of the details are not explained, and the code is not available.

Mantovani et al. Mantovani et al. (2015b) proposed a mtl recommender system to predict when svm hp tuning is necessary, i.e., when tuning is likely to improve the generalization power of the models. The meta-dataset was created by extracting characteristics based on simple and data complexity measures from

classification datasets. In the base-level, different meta-heuristics (PSO, GA and EDA) were used to tune the SVM hp using a nested-cv resampling strategy. An ensemble of meta-models achieved the best predictive performance assessed by the F-Score using simple meta-features. Besides these promising results, this study presents some shortcomings, such as:

  • the best predictive performance at the meta-level is moderately low;

  • when the method recommends tuning, the meta-heuristic which would lead to the best performance is not recommended;

  • only two default hp settings were investigated. In general, users try more than two settings before tuning;

  • there is no evidence that this method and the results can be generalized to other ml algorithms.

The main differences between the proposed approach and the most related work are shown in Table 2. It is important to note that although these are the most similar studies we have found in the literature, they addressed different problems. Furthermore, the meta-datasets generated by each study were also different, since they were generated using different datasets, target algorithms, meta-features, and labeling rules. Because of these particularities, the straightforward comparison of these studies is unfeasible. The only free choice we could explore is the same meta-features adopted by them. In fact, Ridd & Giraud-Carrier Ridd and Giraud-Carrier (2014) and Sanders & Giraud-Carrier Sanders and Giraud-Carrier (2017) used a total of meta-features which are included in our experimental setup (Section 4.4).

Based on the literature, we realized that there is room for improvement in terms of predicting hp tuning necessity for ml algorithms and to better understand this meta-learning process. Our present work attempts to fill this gap yielding meta-models with high predictive performance and reasons why their decisions were made. To do this, we have comprehensive and systematically evaluated different categories of meta-features and preprocessing tasks, such as meta-feature selection and data balancing, and different default hp values.

Study Goal Tuning setup Labeling Target Meta
prediction Task (base-level) rule algorithm features
 improv. class Not detailed Simple
Ridd & Accuracy CASH Statistical
  Giraud-Carrier Ridd and Giraud-Carrier (2014) threshold (20 algs.) Landmarkers
Model-based
class Nested-CVs SVM
Mantovani tuning holdout (inner) Confidence Simple
  et al. Mantovani et al. (2015b) necessity 10-CV (outer) interval Data complexity
BAC (fitness)
 improv. regr CART Simple
Sanders & 10-CV (single) Confidence MLP Statistical
 Giraud-Carrier Sanders and Giraud-Carrier (2017) AUC (fitness) interval SVM Landmarkers
Model-based
Current study class Nested-CVs
tuning 3-fold (inner) Wilcoxon SVM Many
necessity 10-fold (outer) test J48 (see Section 4.4)
BAC (fitness)
Table 2: The most related studies to our current approach. In the Goal prediction column, “improv.” means improvement prediction. In the Task column, “class” denotes classification, while “regr” denotes regression.

3.6 Summary of Literature Overview

The literature review carried out by the authors found a large increase in the use of mtl for tasks related to svm hp tuning. The authors found related works, but only three of them investigated specifically when hp tuning is necessary or its improvement (see Section 3.5). Overall, the following aspects were observed:

  • fourteen of the studies created the meta-knowledge using gs to tune the hp;

  • most of the studies also evaluated the resultant models with a single cv procedure and the simple acc evaluation measure;

  • half of the studies used in most datasets. In Ridd and Giraud-Carrier (2014), the authors used more than datasets, but all of them for binary classification;

  • all investigated a small number of categories to generate meta-features;

  • nine of the studies used only knn as meta-learner;

  • three of the studies applied meta-feature selection techniques to the meta-features;

  • two of the studies provided the complete resources necessary for the reproducibility of experiments;

  • None of the studies combined all these six previous issues.

In order to provide new insights in the investigation of how the use of mtl in the svm hp tuning process can affect its predictive performance, this paper extends previous works by exploring:

  • Meta-features produced by measures from different categories;

  • Use of different learning algorithms as meta-learners;

  • Adoption of a reproducible and rigorous experimental methodology at base and meta-learning levels; and

  • Assessment of the use of meta-feature selection techniques to evaluate and select meta-features.

One of the main contributions of this paper is the analysis of the meta-model predictions to identify when it is better to use default or tuned hp values for the svm, and which meta-features have a major role in this identification.

4 Experimental methodology

In this paper, experiments were carried out using mtl ideas to predict whether hyperparameter tuning can significantly improve svm induced models, when compared with performance provided by their default hyperparameter values555The e1071 package was used to implement svm. It is the LibSVM Chang and Lin (2011) interface to the R environment. . The framework treats the recommendation problem as a binary classification task and is formally defined as follows:

Let be the dataset collection. Each dataset is described by a vector of meta-features, with , the set of all known meta-features. Additionally, let be a statistical labeling rule based on the prior evaluations from tuned and default hyperparameter settings (). Given a significance level , maps prior performances to a binary classification task: . Thus, we can train a meta-learner to predict whether optimization will lead to significant improvement on new datasets , i.e.:

(1)

Figure 1 shows graphically the general framework, linking two-level learning steps: the base level, where the hyperparameter tuning process is performed for different datasets (); and the meta level, where the meta-features () from these datasets are extracted, the meta-examples are labeled according to tuning experiments (, ) and the recommendation to a new unseen dataset occurs (). Further subsections will describe in detail each one of its components.

Figure 1: Meta-learning system to predict whether hyperparameter tuning is required (Adapted from Mantovani et al. (2015b)). At the figure, ”mf” means meta-feature.

4.1 OpenML classification datasets

The experiments used datasets from openml Vanschoren et al. (2014), a free scientific platform for standardization of experiments and sharing empirical results. openml supports reproducibility since any researcher can have access and use the same data for benchmark purposes. A total of binary and multiclass classification datasets () from different application domains were selected for the experiments (Item 01 in Figure 1). From all the available and active datasets, those meeting the following criteria were selected:

  • number of features does not exceed ;

  • number of instances between and ;

  • must not be a reduced, modified or binarized version of the original classification problem

    666More details about dataset versions can be found in the openml paper Vanschoren et al. (2014) and documentation page: https://docs.openml.org/#data.;

  • must not be an adaptation of a regression dataset;

  • all the classes must have at least examples, enabling the use of stratified -fold cv resampling.

These criteria are meant to ensure a proper evaluation (a-b), e.g. datasets should not be so small or so large that they cause memory problems; they should not be too similar (c-d) (to avoid data leakage in our evaluation); and allow the use of 10-fold cv stratified resampling, given the high probability of dealing with imbalanced datasets (e). We also excluded datasets already used in our related work on defining optimized defaults, resulting in 156 datasets to be used in our meta-dataset. All datasets meeting these criteria and their main characteristics are presented on the study page at OpenML

777https://www.openml.org/s/52/data.

In order to be suitable for svm, datasets were preprocessed: any constant or identifier attributes were removed; the logical attributes were converted into values

; missing values were imputed by the median for numerical attributes, and a new category for categorical ones; all categorical attributes were converted into the 1-N encoding; all attributes were normalized with

and . The OpenML Casalicchio et al. (2017)888https://github.com/openml/openml-r package was used to obtain and select datasets from the openml website, while functions from the mlr Bischl et al. (2016)999https://github.com/mlr-org/mlr package were used to preprocess them.

4.2 SVM hyperparameter space

The svm hyperparameter space used in the experiments is presented in Table 3. For each hyperparameter, the table shows its symbol, name, type, range/options, scale transformation applied, default values provided by LibSVM Chang and Lin (2011) and whether it was tuned. Here, only the rbf kernel is considered since it achieves good performances in general, may handle nonlinear decision boundaries, and has less numerical difficulties than other kernel functions (e.g., the values of the polynomial kernel may be infinite) Hsu et al. (2007). For and , the selected range covers the hyperspace investigated in Ridd and Giraud-Carrier (2014). LibSVM default values are , and , where is the number of features of the dataset under analysis101010LibSVM default values can be consulted at https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

its symbol, name, type, range/options, scale transformation applied, default values and whether it was tuned.

Symbol Hyperparameter Type Range/Options Scale Default Tuned
k kernel categorical {RBF} - RBF x
C cost real log 1
width of the kernel real log
Table 3: SVM hyperparameter space used in experiments. The following was shown for each hyperparameter:

4.3 Hyperparameter tuning process

The hyperparameter tuning process is depicted in Figure 1 (Item 2). Based on the defined hyperspace, svm hyperparameters were adjusted through a rs technique for all datasets selected. The tuning process was carried out using nested cv resamplings Krstajic et al. (2014), an “unbiased performance evaluation methodology” that correctly accounts for any overfitting that may occur in the model selection (considering the hyperparameter tuning). In fact, most of the important/current state of the art studies, including the Auto-WEKA111111http://www.cs.ubc.ca/labs/beta/Projects/autoweka/ Kotthoff et al. (2016); Thornton et al. (2013) and Auto-skLearn121212https://github.com/automl/auto-sklearn Feurer et al. (2015) tools, have been using the nested cv methodology for hyperparameter selection and assessment. Thus, nested-cvs were also adopted in this current study. The number of outer folds was defined as such as in Krstajic et al. (2014). Due to runtime constraints, the number of inner folds was set to .

A budget with a maximum of evaluations per (inner) fold was considered. A comparative experiment using different budget sizes for svm was presented in Mantovani et al. (2015a). Results suggested that only a few iterations are required to reach good solutions in the optima hyperspace region. Indeed, in most of the cases, tuning has reached good performance values after - steps. Among techniques used by the authors, the rs was able to find near-optimum hyperparameter settings like the most complex tuning techniques did. Overall, they did not show statistical differences regarding performance and presented a runtime lower than population-based techniques131313These findings go towards what was previously described in  Bergstra and Bengio (2012)..

Hence, the tuning setup detailed in Table 4 generates a total of (outer folds)  (inner folds)  (budget)  (seeds) hp settings during the search process for a single dataset. Tuning jobs were parallelized in a cluster facility provided by our university141414http://www.cemeai.icmc.usp.br/Euler/index.html and took four months to be completed.

Element Method R package
Tuning techniques Random Search mlr
Base Algorithm Support Vector Machines e1071
Outer resampling 10-fold cross-validation mlr
Inner resampling 3-fold cross-validation mlr
Optimized measure Balanced per class accuracy mlr
Evaluation measure Balanced per class accuracy, mlr
Optimization paths
Budget 300 iterations
Repetitions times with different seeds -
seeds = -
Baselines LibSVM defaults e1071
optimized defaults
Table 4: Hyperparameter base level learning experimental setup.

4.4 Meta-features

The meta-datasets used in the experiments were generated out of ‘meta-features’ () describing each dataset (Figure 1 - Item 3). These meta-features were extracted by applying a set of measures to the original datasets which obtain likely relevant characteristics from these datasets. A tool was developed to extract the meta-features and can be found on GitHub151515https://github.com/rgmantovani/MfeatExtractor, as presented in Table 8. We extracted a set of meta-features from different categories, as described in Section 2. The set includes all the meta-features explored by the studies described in Subsection 3.5. The exact number of meta-features used from each category can be seen in Table 5. A complete description of them may be found in Tables 10 and 11 (A).

Acronym Category #N Description

SM Simple 17 Simple measures
ST Statistical 7 Statistical measures
IN Information-theoretic 8 Information theory measures
MB Model-based (trees) 17 Features extracted from decision tree models
LM Landmarking 8 The performance of some ML algorithms
DC Data Complexity 14 Measures that analyze the complexity of a problem
CN Complex Networks 9 Measures based on complex networks
Total 80
Table 5: Meta-feature category used in experiments.

4.5 Meta-targets

The last meta-feature is the meta-target, whose value indicates whether the hp tuning significantly improved the predictive performance of the svm model, compared with the use of default values. Since the hp tuning experiments contain several and diverse datasets, many of them may be imbalanced. Hence, the bac measure Brodersen et al. (2010) was used as the fitness value during tuning, as well as for the final model assessment at the base-level learning161616These performance values are assessed by bac using a nested-cv resampling method..

The so-called “meta-label rule (Item 4 - Figure 1) applies the Wilcoxon paired-test to compare the solutions achieved by the rs technique () and the default hp settings (). Given a dataset and a significance level (), if the hp tuned solutions were significantly better than those provided by defaults, its corresponding meta-example is labeled as ‘Tuning’ (); otherwise, it receives the ‘Default’ label ().

When performing the Wilcoxon test, three different values of were considered, resulting in three meta-datasets with different class distributions (Item 5 - Figure 1). The different significance levels () influence how strict the recommending system is when evaluating if tuning improved models’ performance compared to the use of default hp values. The smaller the significance stricter it is, i.e., there must be greater confidence than the tuned hyper-parameter values obtained by improving the performance of the induced models. It may also imply in different labels for the same meta-example when evaluating different values. The initial experimental designs only compared LibSVM suggested default values with the hp tuned solutions. The resulting meta-datasets presented a high imbalance rate, prevailing the “Tuning” class. It was difficult to induce a meta-model with high predictive performance using this highly imbalanced data. An alternative to deal with this problem was to consider the optimized default hp values proposed in Mantovani et al. (2015c). The optimized default values were obtained optimizing a common set of hp values, able to induce models with high predictive performance, for a group of datasets.

Figure 2: Average balanced per class accuracies comparing LibSVM default (libsvm.defaults), Multiple optimized default hyperparameter settings (multiple.defaults) and Random Search tuning technique (random.search) when defining the meta-target of each meta-example.

Figure 2 illustrates the benefits of using multiple default settings: LibSVM and optimized default values. In this figure, the x-axis identifies datasets by their OpenML ids, listing them decreasingly by the balanced per class accuracy performances (y-axis) obtained using LibSVM defaults hyperparameter values. This figure shows three different curves:

  • libsvm.defaults: a black dotted line representing the averaged performance values obtained using LibSVM default hyperparameter values. It represents the choice of a user using LibSVM defaults;

  • random.search: a green line representing the averaged performance values obtained using the rs technique for tuning. It represents the choice of always tuning svm hyperparameters; and

  • multiple.defaults: a red line representing the best choice considering the LibSVM and optimized defaults hyperparameter values. It represents our approach, exploring multiple default values.

By looking at the difference between the black and green lines, it is possible to observe that tuned models using rs outperformed models using default settings (provided by LibSVM) for around 2/3 of the datasets. However, when we consider multiple default settings (the best setting between LibSVM and optimized values), identified by the red curve, their performance values were close to the performance with tuned values. Thus, the meta-target labelling rule considered the difference between the predictive performances with tuned hyperparameters and the best predictive performances with multiple default hp values. A side effect of using multiple default hp values is a more class-balanced meta-dataset, increasing the proportion of meta-examples labeled with “default” use. As a result, the imbalance rate171717imbalance rate = (majority class size / minority class size) in the meta-datasets was reduced from to .

Table 6 presents for each resultant meta-dataset: the value used to generate the labels; the number of meta-examples, the number of meta-features and the class distribution. It is important to observe that none of these datasets were used in a related previous study that produced optimized default hp setting Mantovani et al. (2015c).

In our experimental setup, the null hypothesis of the statistical meta-label rule states that there is no significant difference between tuned and default svm hp settings. Since we are concerned about preventing tuning hp when it is not necessary, a type I error is defined as labeling a meta-example as “

Tuning” when its label is, in fact, “Defaults’. Therefore, the lower the , the higher the probability that the improvement achieved by tuned values is not due to chance. On the other hand, the higher the alpha, the lower the requirement that the performance gain by the tuning process is significant compared to default values.

Since we are controlling the error of labeling a meta-example as “Tuning’, smaller values will lead to a greater number of ”default” meta-examples. On the contrary, the greater the alpha value, the greater the number of meta-examples labeled as“tuning”. As can be seen in Table 6, a value of implies more instances with the meta-target “Tuning” than when using . In summary, if predictive performance is more critical, the user should set the significance level as high as possible (e.g., ). On the other hand, if the user is concerned about computational cost, the significance level should be set to smaller values (e.g., ). An example of this effect can be seen in Figure 2, where the blue dots represent all the datasets where defaults should be used, i.e., tuning is not statistically significant better (for .

Meta-dataset Meta Meta Class Distribution
examples features Tuning Default
SVM_90 156 80 102 54
SVM_95 156 80 98 58
SVM_99 156 80 94 62
Table 6: Meta-datasets generated from experiments with svm.

4.6 Experimental Setup

Seven classification algorithms were used as meta-learners (Item 7 - Figure 1): svm, cart, rf, knn, nb lr and gp. These algorithms were chosen because they follow different learning paradigms with different learning biases. All seven algorithms were applied to the meta-datasets using a -fold cv resampling strategy and repeated times with different seeds (for reproducibility). All the meta-datasets presented in Table 6 are binary classification problems. Thus, meta-learners’ predictions were assessed using the auc performance measure, a more robust metric than bac for binary problems. Moreover, auc also enable us to evaluate the influence of different threshold values on predictions. Three options were also investigated at the meta-level:

  • Meta-feature Selection: as each meta-example is described by many meta-features, it may be the case that just a small subset of them is necessary to induce meta-models with high accuracy. Thus, a sfs feature selection option was added to the meta-learning experimental setup. The sfs method starts from an empty set of meta-features, and in each step, the meta-feature increasing the performance measure the most is added to the model. It stops when a minimum required value of improvement (=) is not satisfied. Internally, it also performs a stratified -fold cv assessing the resultant models also according to the auc measure;

  • Tuning: since the hyperparameter values of the meta-learners may also affect their performance, tuning of the meta-learners was also considered in the experimental setup. A simple rs technique was performed with a budget of evaluations and resultant models assessed through an inner stratified -fold cv and auc measure. Table 12 (Appendix B) shows the hyperspace considered for tuning the meta-learners.

  • Data balancing: even using the optimized default hp values, the classes in the meta-datasets were imbalanced. Thus, to reduce this imbalance, the smote Chawla et al. (2002) technique was used in the experiments.

Some of the algorithms’ implementations selected as meta-learners use a data scaling process by default. This is the case of the svm, knn and gp meta-learners. A preliminary experiment showed that removing this option decreases their predictive performance considerably, while it does not affect the other algorithms. When data scaling is considered for all algorithms, the performance values of rf, cart, nb and lr meta-learners were decreased. Thus, data scaling was not considered as an option, and the algorithms used their default procedures, with which they obtained their best performance values. Two baselines were also adopted for comparisons: a meta-dataset composed only by simple meta-features and another with data complexity ones. Both categories of meta-features were investigated before by related studies listed in Section 3.5.

Element Method R package
Meta-learner svm e1071
cart rpart
rf randomForest
knn kknn
nb e1071
lr gbm
gp kernlab
Resampling -fold cv mlr
Meta-feature Selection Sequential Forward Search - mlr
inner 3-CV - measure auc
Tuning rs mlr
budget = 300
inner 3-CV - measure auc
Data Balancing smote mlr
oversampling rate = 2
Repetitions times with different seeds -
from the interval -
Evaluation measures auc mlr
predictions (prob)
Baselines Simple meta-features -
Data complexity meta-features -
Table 7: Meta-learning experimental setup.

4.7 Repositories for the coding used in this study

Details of the base-level tuning and meta-learning experiments are publicly available in the openml Studies (ids and , respectively). In the corresponding pages, all datasets, classification tasks, algorithms/flows and results are listed and available for reproducibility. The code used for the hp tuning process (HpTuning), extracting meta-features (MfeatExtractor), running meta-learning (mtlSuite), and performing the graphical analyses (MtlAnalysis) are hosted at GitHub. All of these repositories are also listed in Table 8.

Task/Experiment Website/Repository
Hyperparameter tuning code https://github.com/rgmantovani/HpTuning
Hyperparameter tuning results https://www.openml.org/s/52
Meta-feature extraction https://github.com/rgmantovani/MfeatExtractor
Meta-learning code https://github.com/rgmantovani/mtlSuite
Meta-learning results https://www.openml.org/s/58
Graphical Analysis https://github.com/rgmantovani/MtlAnalysis
Table 8: Repositories with tools developed by the authors and results generated by experiments.

5 Results and Discussion

The main experimental results are described in the next subsections. First, an overview of the predictive performance of the meta-models for the predicting task when it is worth performing svm hp tuning. Next, different experimental setups and preprocessing techniques, such as dimensionality reduction, are evaluated. Finally, the predictions and meta-knowledge produced by the meta-models are analyzed.

5.1 Average performance

Figure 3 summarizes the predictive performance of different meta-learners for three different sets of meta-features, namely: all, complex and simple. The former has all available meta-features, the complex set contains only data complexity measures as meta-features and the latter consists of simple and general meta-features.

In Figure (a)a, the x-axis shows the meta-learners while the y-axis shows their predictive performance assessed by the auc averaged over repetitions. In addition, it shows the impact of different alpha () levels for the Wilcoxon test for the definition of the meta-target labels. The Wilcoxon paired-test with was applied to assess the statistical significance of the predictive performance differences obtained by the meta-models with all meta-features, when compared to the second best approach.

An upward green triangle () at the x-axis identifies situations where using all the meta-features were statistically better. On the other hand, red downward triangles () show results where one of the alternative approaches was significantly better. In the remaining cases, the predictive performance of the meta-models were equivalents.

(a) Meta-learners average auc performance on svm meta-datasets. The black dotted line at represents the predictive performance of ZeroR and Random meta-models.

CD = 3.002

1

2

3

4

5

6

7

RF

GP

SVM

NB

LR

CART

KNN
(b) Comparison of the auc values of the induced meta-model according to the Friedman-Nemenyi test (. Groups of algorithms that are not significantly different are connected.
Figure 3: auc performance values obtained by all meta-learners considering different meta-features’ categories. Results are averaged considering repetitions.

The best results were obtained by the rf meta-learner using data complexity (complex) meta-features, achieving auc values nearly for all levels. These meta-models were also statistically better than those obtained by other approaches at . When , the rf meta-learner using all the meta-features also generated a model with auc .

When the value of in the meta-label rule is reduced, predictive performances using data complexity and all the available meta-features tend to show similar distributions. The meta-learners obtained their best auc values with the highest assumption (). Overall, varying the value did not substantially change the predictive performance of the evaluated algorithms. In fact, few meta-examples had their meta-targets modified by the meta-rule with different values of . Thus, the predictions in the different scenarios are mostly the same and the performances remained similar.

Regarding predictive performance, rf, svm, gp and knn induced accurate meta-models for the three meta-dataset variations. The auc value varied in the interval . Even the lr, depending on the meta-features used to represent the recommendation problem, achieved reasonable auc values. For comparison purposes, it is important to mention that both Random and ZeroR181818This classifier simply predicts the majority class. baselines obtained auc of in all these meta-datasets191919The auc performance values were assessed using the implementations provided by the mlr R package..

The Friedman test Demšar (2006), with a significance level of , was used to assess the statistical significance of the meta-learners. In the comparisons, we considered the algorithms’ performance across the combination of the meta-datasets and the categories of meta-features. The null hypothesis states that all the meta-learners are equivalent regarding the auc performance. When the null hypothesis is rejected, the Nemenyi post-hoc test is also applied to indicate when two different techniques are significantly different.

Figure (b)b presents the resultant cd diagram. Algorithms are connected when there are no significant differences between them. The top-ranked meta-learner was the rf with an average rank of , followed by gp (), svm () and knn (). They did not present statistically differences among them, but mostly did when compared with simpler algorithms: cart (), lr () and nb(). Even not being statistically better than all the other choices, the rf was always ranked at the top regardless of the meta-dataset and meta-features.

Although the best result was obtained using dc meta-features (“complex”), most of the meta-learners achieved their highest auc performance values exploring all the available meta-features. Thus, since we want to analyze the influence of different categories of meta-features when inducing meta-models, and given the possibility of selecting different subsets from all the categories, we decided to explore all of them in the next analysis.

5.2 Evaluating different setups

Due to the large difference among meta-learners results, three different setups were also evaluated to improve their predictive performances and enable a comprehensive analysis of the investigated alternatives:

  • featsel - meta-feature selection via sfs Bischl et al. (2016);

  • tuned - hp tuning of the meta-learners using a simple rs technique;

  • smote: dataset balancing with smote Chawla et al. (2002).

They were compared with the original meta-data with no additional process (none), which is the baseline for these analyses. These setups were not performed at the same time to avoid overfitting, since the meta-datasets have meta-examples and, depending on their combinations, three levels of cv would be used to assess models. For example: if feature meta-selection and hp tuning were enabled at the meta-level, one cv would be used for meta-feature selection, one for tuning and another to assess the resulting models.

(a) Average auc performance values. The black dotted line at represents the predictive performance of ZeroR and Random meta-models.
(b) Average auc ranking values.

CD

1

2

3

4

5

6

7

RF

SVM

GP

NB

LR

CART

KNN
(c) Comparison of the auc values of the induced meta-model according to the Friedman-Nemenyi test (. Groups of algorithms that are not significantly different are connected.
Figure 4: auc performance values obtained by all meta-learners considering different experimental setups. Results are averaged over repetitions.

Figure 3 summarizes the main aspects of these experimental results. The top figure shows the average auc values for each experimental setup considering all the meta-learners and the levels. The nb and lr meta-learners do not have any tunable hp. Thus, their results in this figure are missing for the tuned setups (with and without smote). Similarly to Figure (a)a, the statistical analysis is also presented. Every time an upward green triangle is placed at the x-axis, the raw meta-data (none) generated results statistically better than using the best of the experimental setups evaluated. On the other hand, red triangles indicate when tuning, meta-feature selection or smote could statistically improve the predictive performance of the meta-models. In the remaining cases, the meta-models were equivalent.

Despite the different setups evaluated, rf is still the best meta-learner for all scenarios. It is followed by svm and gp versions using smote. Depending on the experimental setup, knn and lr also presented good predictive performances. Regarding the hp tuning (tuned) of the meta-learners, only for knn the performances slightly improved for all the alpha values. Using just smote resulted in improved results for svm, gp and cart meta-learners. In general, it produced small improvements, but most of them were statistically significant. When used with tuning or meta-feature selection, it affected the algorithms in different ways: for svm and gp, the performance improved; for lr, nb and knn there was no benefit; the other algorithms were not affected by its use. The low gain obtained using smote may be due to the fact that data imbalance was already reduced using the optimized defaults when defining meta-targets.

Using meta-feature selection (featsel

) deteriorated the performance of the svm, rf, gp and cart meta-learners. On the other hand, it clearly improved the knn, lr and nb performances for most cases. knn benefited from using a subset of meta-features to maximize the importance of more relevant meta-features. For nb and lr, selecting a subset of the attributes reduced the presence of noise and irrelevant attributes. Furthermore, it is important to observe that the meta-models induced with the selected features presented the highest standard deviation between the setups (light area along the curve). A possible reason is the different subsets selected every time meta-feature selection is performed for the

repetitions.

Additionally, Figure (b)b presents a ranking with all the combinations of meta-learners and experimental setups. At the x-axis, they are presented in ascending order according to their average ranking for the three scenarios ( values), shown at the y-axis. The more red the squares, the lower the ranking, i.e., the better the results.

As previously reported, rf with no additional option was the best-ranked method, followed by its smoted versions. The svm, gp and knn versions are in the next positions. The Friedman test with a significance level of was also used to assess the statistical significance of the meta-learners when using different experimental setups in different meta-datasets. Figure (c)c shows the resultant cd diagram. Results are quite similar to those reported in Figure (b)b: the rf algorithm was the best algorithm with an average ranking of , and is statistically better than most of the meta-learners, except for svm and gp.

Since there were no improvements considering the maximum auc values achieved so far, and the results from the rf meta-learners were still the best-ranked, the next subsections will analyze the relative importance of the meta-features according to the final rf induced models.

5.3 Importance of meta-features

From the induced rf meta-models, the relative importance of the meta-features based on the Gini impurity index used to calculate the node splits Breiman (2001). Figure 5 shows the average relative importance of the meta-features obtained from the rf meta-models. The relative importance is shown for the experiments considering all meta-features and (middle case). At the x-axis, meta-features are presented in decreasing order according to their average relative importance values. From this point, anytime a specific meta-feature is mentioned we present its name with a prefix indicating its category (according to Table 5).

Figure 5: Average meta-features relative importance obtained from rf meta-models. The names of the meta-features in the x-axis follow the acronyms presented in Tables 10 and 11 in A.

Since no negative value (negative relative importance) was obtained, no meta-feature was discharged to build meta-models. It also shows that a large number of meta-features were relevant to the induction of the meta-models, a possible reason for why meta-feature selection produced worse results for most of the meta-learners.

The most important meta-feature was a landmarking meta-feature: “LM.stump_sd”, which describes the standard deviation of the number of examples correctly classified by a decision stump. It measures the complexity of the problem considering its simplicity. The second most important was a simple meta-feature: “SM.classes_min”, which measures the minority class size. The third was also a simple meta-feature: “SM.classes_sd”, which describes the standard deviation of the number of examples per class. These meta-features together strongly indicate that for rf, the most important meta-features are related with class imbalance. A rule extracted from a model induced by rf states that if the dataset is imbalanced, it is better to use default hp for svm. The other important meta-features were:

  • IN.nClEnt” and “IN.mutInf”: these are information-theoretical meta-features. While the first describes the class entropy for a normalized base level dataset, the second measures the mutual information, a reduction of uncertainty about one random feature given the knowledge of another;

  • CN.betweenness”: betweenness centrality is a meta-feature derived from complex networks that measures, for a set of vertex and edges, the average number of shortest paths that traverses them. The value will be small for simple datasets, and high for complex datasets;

  • DC.l1” and “DC.t2”: these are data complexity meta-features. While the first measures the minimum of an error function for a linear classifier, the second measures the average number of points per dimension. These features are related with the class separability (l1), and the geometry of the problem’s dimension (t2);

  • SM.dimension”: this is a simple meta-feature that measures the relation between the number of examples and attributes in a dataset;

  • CN.maxComp”: this is another complex-network meta-feature. It measures the maximum number of connected components in a graph. If a dataset presents a high overlapping of its classes, the graph will present a large number of disconnected components, since connections between different classes are pruned.

Among the most important, there are meta-features from different categories (simple, data complexity, complex-networks and from information-theory). Complex-network measures describe data complexity regarding graphs and indicate how sparse the classes are between their levels. Data complexity meta-features try to extract information related to the class separability. The stump meta-feature works along the same lines, trying to identify the complexity of the problem by simple landmarking. The information-theoretical meta-features indirectly checks how powerful the dataset attributes are to solve the classification problem.

Although summarized rules cannot be obtained from rf meta-models, the analysis of meta-features importance provides some useful information. For instance, dataset characteristics such as the data balancing, class sizes, complexity and linearity were considered relevant to recommend when hp tuning is required.

5.4 Linearity Hypothesis

The previous sections, in particular the rf meta-analysis, suggest that linearity is a key aspect to decide between the recommendation of default or tuned hp values. Experimental results indicate that default hp values might be good for classification tasks with high linear separability. As a consequence, tuning would be required for tasks with complex decision surfaces, where svm would need to find irregular decision boundaries.

(a) Performance differences between svm and a linear classifier in all the base-level datasets.
(b) Average relative importance of the meta-features obtained from rf meta-models. The names of the meta-features in the x-axis follow the acronyms presented in Tables 10 and 11 in A.
Figure 6: Linearity hypothesis results considering relative landmarking meta-features.

In order to investigate this hypothesis, a linear classifier was also evaluated in all the available datasets using the same base-level experimental setup described in Table 4. If the linearity hypothesis is true, the performance difference between svm and the linear classifier in meta-examples labeled as “Defaults” would be smaller than or equal to the meta-examples labeled as “Tuning’.

Figure (a)a shows the performance differences obtained in all the datasets at the base-level. Datasets at the x-axis are split based on their meta-target labels, “Tuning’, left side, in black, and “Defaults

”, right side, in red. Despite some outliers, the performance differences for “

Tuning” meta-examples are in general much higher than those for the “Defaults” meta-examples. Thus, the observed patterns support the linearity hypothesis.

In Leite et al. (2012), the authors proposed a set of “rl” meta-features based on the pairwise performance difference of simple landmarking algorithms. This new data characterizations schema is used to train meta-learning based on the at algorithm. The patterns observed in Figure (a)a follow the same principle, presenting a new alternative to characterize base-level datasets. Following this proposal, new relative landmarking meta-features were generated based on five landmarking algorithms: knn, nb, lr, svm and ds. These new meta-features are described in Table 11 in Appendix A.

The same rf meta-analysis described in Section 5.3 was performed, adding the relative landmarking meta-features to the meta-datasets. These experiments pointed out how useful the new meta-features are for the recommendation problem. Figure 6 shows the relative importance values of the meta-features averaged in executions. The relative importance of these new meta-features are highlighted in red, while the simple landmarking is shown in blue.

Two of the relative landmarking meta-features are placed in the top-10 most important meta-features: RL.diff.nn.lm (), and RL.diff.svm.lm (); another two measures are in the top-20 - RL.diff.stump.lm and RL.diff.stump.lm; and all of them depend directly on the linear classifier performance. It is also important to mention that simple landmarking meta-features performed, in general, worse than relative landmarking. All these relative importance plots show evidence that the linearity hypothesis is true, and at least one characteristic that defines the need of hp tuning for svm is the linearity of the base-level classification task.

5.5 Overall comparison

Given the potential shown by the relative landmarking meta-features, they were experimentally evaluated in combination with the meta-features previously evaluated as most important. Complex network (cnet) meta-features were included because they were ranked between the most important descriptors (as shown in Subsection 5.3). Simple and data complexity (complex) meta-features were the other two approaches evaluated in the related studies listed in Section 3.5.

(a) Average AUC values when considering relative landmarking meta-features with all the other meta-features’ categories.
(b) Average AUC values for the best overall experimental setup with the simple and the data complexity baselines presented in Section 5.1.
Figure 7: Evaluating the previous experimental setups adding relative landmarking meta-features. The results are the average of runs.

Figure 7 presents a comparison between the main experimental setups considering the addition of the relative landmarking (relativeLand) meta-features. The left chart of figure (a)a shows auc performance values obtained for each of the original setups, while the chart on the right presents setup performances when relative landmarking meta-features were included. This figure shows that the use of relative landmarking meta-features improved all the setups where they were included. At least three different setups used by rf were higher than the auc performance value obtained in the initial experiments. The setup considering simple and relative landmarking meta-features induced the best meta-models for rf, svm and gp. The knn and lr meta-learners obtained the best predictive performance using data complexity and relative landmarking meta-features and the same occurred for cart and nb for “relativeLand” set.

Figure (b)b compares the best setup from Figure (a)a: “simpleRelativeLand”, which uses both simple and relative landmarking meta-features, with the the baselines from Figure (a)a, using“simple” and data complexity (“complex”) meta-features, often explored in related studies (see Table 1).

In this figure, the x-axis shows the different meta-learners, while the y-axis shows their predictive performance assessed by the AUC averaged over repetitions. The Wilcoxon paired-test with = 0.05 was applied to assess the statistical significance of these results. An upward green triangle () at the x-axis identifies situations where the use of “simpleRelativeLand” was statistically better than using the baselines. In the same figure, the red downward triangles () indicate when the use of baselines was significantly better. In the remaining cases, the predictive performance of the meta-models were equivalents.

Overall, the meta-models induced with “simpleRelativeLand” meta-features were significantly better than those induced with baseline meta-features for most of the meta-learners: rf, svm, knn and cart obtained superior auc values. Furthermore, the best meta-learner (RF) also significantly outperformed our previous results. The baselines produced the best meta-models for only two algorithms: nb and lr. For the gp algorithm, the different setups did not present any statistically significant difference.

5.6 Analysis of the predictions

A more in-depth analysis of the meta-learner predictions can help to understand their behavior. Figure 8 shows the misclassifications of the meta-learners considering their best experimental setups. The top chart (Fig.  (a)a) shows all the individual predictions, with the x-axis listing all the meta-examples and y-axis the meta-learners. In this figure, “Defaults” labels are shown in black and “Tuning” labels in gray. The top line in the y-axis, “Truth” shows the truth labels of the meta-examples, which are ordered according to their truth labels. The bottom line (“*”) shows red points for meta-examples misclassified by all meta-learners.

(a) Meta-learners’ individual predictions.
(b) Meta-learners’ misclassification rates.
Figure 8: Meta-learners’ predictions considering the experimental setups which obtained the best auc values.

In the svm hp tuning recommendation task, “Defaults” is defined as the positive class and “Tuning” as the negative class. Therefore, a fn is a wrong recommendation to perform hp tuning on svm, and fp is a wrong recommendation to use default hp values. While a reduction in fn can decrease the computational cost, a reduction in fp can improve predictive performance.

Algorithms following different learning biases present different prediction patterns and this can be observed in Figure  8. Usually, most meta-examples are correctly classified (a better performance than the baselines). Besides, the following patterns can be observed:

  • knn and gp minimize the fn rate, correctly classifying most of the meta-examples as “Defaults”. However, they misclassified many examples from the “Tuning” class, penalizing the overall performance of the recommender system;

  • svm, cart and lr minimized the fp rate, correctly classifying most of the meta-examples requiring tuning. However, they tended to classify the meta-examples in the majority class;

A more balanced scenario is provided by the rf meta-models, which presented the best predictive performance. Although it was not the best algorithm for each class individually, it was the best when the two classes were considered.

Nro Name id D N C P Def (sd) Tun (sd) Label
17 jEdit_4.0_4.2 1073 8 274 2 0.96 0.73 (0.01) 0.73 (0.01) Defaults
36 banknote-authentication 1462 4 1372 2 0.80 0.99 (0.01) 0.99 (0.01) Tuning
78 autoUniv-au7-500 1554 12 500 5 0.22 0.29 (0.01) 0.31 (0.01) Tuning
97 optdigits 28 62 5620 10 0.97 0.99 (0.01) 0.99 (0.01) Tuning
Table 9: Misclassified datasets by all the meta-learners. For each dataset it is shown: the meta-example number (Nro); the OpenML dataset name (Name) and id (id); the number of attributes (D), examples (N) and classes (C); the proportion between the number of examples from minority and majority classes (P); the performance values obtained by defaults (Def) and tuned (Tun) hp settings assessed by bac; and the truth label (Label).

Table 9 lists all the datasets misclassified by all the meta-learners as indicated in Figure (a)a. Meta-examples with ids (“Defaults”) and (“Tuning”) were corrected labeled by the statistical meta-rule, and therefore the misclassification may have occurred due to the lack of the descriptive ability of meta-features or noise in the meta-dataset. The other two meta-examples (, ) were both labeled as “Tuning” but the statistical difference indicated is very small in terms of performance, and may indicate a limitation of the current meta-target rule criteria.

5.7 Projecting performances at base-level

This section assesses the impact of the choices made by the meta-learners at the base-level. It also analyses and discusses the reduction of runtime when using the proposed meta-learning recommender system. Figure 9 shows the predictive performance of svm at the base-level using the method (“Tuning’ or “Defaults’) selected by the meta-learners to define their hp values. The best meta-learners identified in the previous sections were compared with three simple baselines: a model that always recommends hp tuning (Tuning), a model that always recommends the use of defaults (Defaults), and a model that provides random recommendations (Random).

Figure (a)a shows a scatter plot with the projected bac performance and runtime value averaged in all the base-level datasets. Performing always the hp tuning had the highest average bac value but was also the most expensive approach. On the other hand, always using default hp values is the fastest approach, but with the lowest average bac value. The proposed meta-models are above Random and Defaults baselines, performing close to the Tuning baseline but with lower average runtime costs.

(a) Average bac and runtime for the svm base-level data.

CD

1

2

3

4

5

6

7

8

9

10

SVM

Tuning

RF

CART

LR

Defaults

Random

NB

KNN

GP
(b) cd diagram comparing the bac values of the meta-learners at the base-level according to the Friedman-Nemenyi test (.
Figure 9: Performance of the meta-learners projected into the svm hyperparameter tuning problem (base-level).

The Friedman test, with a significance level of , was also used to assess the statistical significance of the base-level predictions. The null hypothesis is that all the meta-learners and baselines are equivalent regarding the average predictive bac performance. When the null hypothesis is rejected, the Nemenyi post-hoc test is applied to indicate when two different techniques are significantly different. Figure (b)b presents the cd diagram. Techniques are connected when there is no significant differences between them.

Overall, the approach always using default hp (Defaults) is ranked last, followed by the Random baseline. Almost all meta-learners are significantly better than both and are equivalent to Tuning, which always performs hp tuning. In this figure, although the rf meta-model is considered the best, it was not ranked first. The first was the svm meta-model. This occurred because it most often recommended the use of tuned settings, which was reflected in its performance at the base-level. With many datasets at the base-level, it can be pointed out that the overall gain is diluted between them. Even so, the meta-learners could considerably reduce the computational costs related to tuning, maintaining a high predictive performance.

Besides, it can be observed in Figure (a)a that tuning svm hyperparameters for just one dataset will take on average two days. The most costly datasets (with a high number of features or examples) took almost 10 days to finish all the 10 tuning repetitions (seeds) even paralleling the jobs in a high performance cluster. Regarding the meta-feature extraction, the same datasets will take at most 10 minutes, specially because of some mathematical operations used by dc meta-features. For most meta-features, the time taken to characterize a dataset is less than seconds. Thus, during the prediction phase with the induced meta-model, the computational cost of extracting the characteristics of a new dataset is irrelevant compared to the computational cost of the tuning process. We think this is an important argument in favor of using our system, which is used in practical scenarios.

5.8 A note on the generalization

Although the main focus of the paper is to investigate the svm hyperparameter tuning problem, we also conducted experiments for predicting the need for tuning dt. These experiments aim to provide more evidence that the proposed method can be generalized. Once the meta-knowledge is extracted, the system is able to induce meta-models to any supervised learning algorithm. In particular, we investigated the hp tuning of the

J48 algorithm, a WEKA202020http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html implementation for the Quinlan‘s C4.5 dt induction algorithm Quinlan (1993), and one of the most popular ml algorithms. The tree models were induced using the RWeka212121https://cran.r-project.org/web/packages/RWeka/index.html package. The meta-datasets for the J48 algorithm were generated based on hp tuning results obtained from datasets reported in Mantovani et al. (2016). We also expanded the meta-knowledge by performing additional experiments to cover the same datasets explored with svm. In total, we obtained results performing the hp tuning of the J48 algorithm in openml datasets. The tuning processes followed the experimental setup described in Table 4 with some differences:

  • the J48 hyperparameter space has nine hyperparameters222222The J48 hyperparameter space is also presented in C.;

  • a budget size of evaluations was adopted in the experiments with trees. It is greater than for svm because of the greater search space of J48;

  • the tuned hyperparameter results were compared with those obtained from the J48 RWeka/WEKA default settings.

Table 14 in C presents the main characteristics of the meta-datasets generated with J48 tuning experiments. Class distribution columns (Tuning and Default) show values which indicate a different hyperparameter profile232323In this paper, the “hyperparameter profile” term refers to how sensitive an algorithm may be to the hp tuning task. compared to that observed in experiments with svm: most of the meta-examples are labelled as “Default”, i.e., tuning did not statistically improve the algorithm performance in two thirds of the datasets. Here, we present results just on the meta-dataset using = 0.05 for the statistical labeling rule. However, results obtained with different values were quite similar.

(a) AUC values considering different categories of meta-features.
(b) AUC values considering different experimental setups.

CD = 3.002

1

2

3

4

5

6

7

RF

KNN

GP

CART

NB

LR

SVM
(c) CD diagram comparing different meta-learners in different categories of meta-features.

CD = 2.122

1

2

3

4

5

6

7

SVM

RF

GP

LR

CART

NB

KNN
(d) CD diagram comparing different meta-learners in different experimental setups).
Figure 10: Meta-learners average AUC results in the J48 meta-dataset labelled with , and CD diagrams comparing meta-learners according to the Friedman-Nemenyi test (). Results are averaged in runs.

The predictive performance of the meta-learners considering different categories of meta-features are summarized in Figure (a)a. The best results were obtained using the rf, svm and gp meta-learners. They achieved their best predictive performances using preferably “all” the available meta-features, with auc values between auc. It may be due to the fact that predictions tend toward a specific class (Defaults) since these meta-datasets are imbalanced. Overall, the best meta-model was obtained by the svm algorithm. However, when considering all the possible scenarios (different values), there was no statistical difference among the top ranked meta-learners: the rf, knn, gp, svm and lr algorithms (see Figure (c)c). Furthermore, in general, the complete set of meta-features provided the best results for most algorithms.

Figure (b)b shows the average auc values considering different experimental setups. The NB and LR algorithms do not have any tunable hyperparameters. Consequently, their results for “tuned” setups are missing in the chart. In general, when considering auc values obtained in the original meta-dataset, improvements were obtained only for cart and knn applying smote. For the other algorithms, best meta-learners were still induced without any additional process. Statistical comparisons highlight the performance of the svm, rf and gp algorithms (see Figure (d)d).

We also evaluated the potentiality of the rl meta-features in the J48 tuning recommendation problem. However, differently than reported in Section 5.4, they worsened most of the meta-models. Linearity is not a key aspect in the J48 tuning problem, which further reinforces the results obtained from svm (see Section 5.4).

Reproducing the rf analysis in the J48 tuning problem (see Section 5.3), the most important meta-feature was the data complexity measure “DC.f4”. This meta-feature describes the collective attribute efficiency in a dataset. The second was a simple meta-feature: “SM.abs_cor”, a metric that measures the linear relationship between two attributes. This value is averaged in all pairs of attributes in the dataset. The top-3 is completed with another data complexity meta-feature: “DC.f3”, which describes the maximum individual attribute efficiency. The two dc meta-features measure the discriminative power of the dataset’s attributes, while the absolute correlation verifies if the information provided by attributes is not redundant. These most important meta-features suggest that if a dataset has representative attributes, default hp values are robust to solve it. Otherwise, the J48 tuning is recommended.

(a) Average bac and runtime for the J48 base-level data.

CD = 1.0546

1

2

3

4

5

6

7

8

9

10

RF

NB

LR

KNN

CART

Defaults

GP

Random

SVM

Tuning
(b) cd diagram comparing the bac values of the meta-learners at the base-level according to the Friedman-Nemenyi test (.
Figure 11: Performance of the Meta-learners projected in the J48 hyperparameter tuning problem (base-level).

Figure 11 shows J48 meta-level predictions projected overall base-level datasets. Mostly, but not all of the meta-examples are labeled with “Defaults”. The induced meta-models depicted in Figure (a)a are above the “Defaults” baseline, but relatively close to the Random and Tuning ones. The average bac values of all the approaches are very close, even considering all the baselines. This can be noted by the scale on the y-axis. It is explained by the small improvements obtained at the base-level tuning processes; they were relatively small if compared with those obtained with svm.

However, all the meta-models have a lower average runtime compared with Random and Tuning baselines. Most of the meta-models are better ranked than baselines, but, overall, there are no statistical differences among the evaluated approaches (Figure (b)b). Even so, it is important to highlight that meta-learners could also considerably reduce the computational cost related to tuning, keeping the predictive performance in the dataset collection.

6 Conclusions

This paper proposed and experimentally investigated a mtl framework to predict when to perform svm hp tuning. To do this, 156 different datasets publicly available at openml were used. The predictive performances of svm induced with hp tuning (by a simple rs) and with default hp values were compared and used to design a recommender system. The default values were provided by the e1071 R package), and by optimized common settings from Mantovani et al. (2015c). Different experimental setups were analyzed with different sets of meta-features. The main findings are summarized next.

6.1 Tuning prediction

The main issue investigated in this paper was whether it is possible to accurately predict when hp tuning can improve the predictive performance of svm, when compared to the use of default hp values. If so, this can reduce the processing cost when applying svm to a new dataset. According to experimental results, using rf and svm as meta-learners, this prediction is possible with a predictive accuracy of auc=.

Three significance levels () were used with the Wilcoxon test to define the meta-target, which indicates whether it is better to use hp tuning or default hp values. Different sets of meta-features were evaluated, and rf meta-models using all the available meta-features obtained the best results, regardless the value considered. However, the complex set of meta-features resulted in high predictive performance for most of the investigated meta-learners. Different experimental setups were also evaluated at the meta-level, but improvements were observed, and in a few cases, only when smote or meta-feature selection were used. Thus, the best setup was to use raw meta-data and meta-learners with default hp values.

An analysis of the rf meta-models show that most meta-features actively contributed to predictions. This explains the decrease in performance when using meta-feature selection in most of the algorithms. Among the meta-features ranked as most important, there are meta-features from different sets. Each ranked meta-feature describes different characteristics of the problem, such as data imbalance, linearity, and complexity.

This paper also investigated the hypothesis that the linear separability level could be an important meta-feature to be used by the recommender system. Meta-features based on relative landmarking were used to measure the linear separability degree. In the experiments performed, this meta-feature was shown to play an important role in the recommender system prediction. Three meta-learners had their best auc performance values using a combination of simple and relative landmarking meta-features.

Using two different default hp settings maximized the number of default wins, reducing the imbalance rate at the meta-datasets. In addition to presenting the best predictive performance, rf, by the frequency of meta-features in their trees, provided useful information regarding when default settings are suitable.

We also performed experiments for the J48 tuning recommendation problem aiming to show the ability of generalization of the mtl recommender system. Results showed that different to svm, where the linearity was important to recommend the use of default settings, the most important meta-features suggest that if a dataset has representative attributes, default settings are robust to solve it.

In fact, our extensive experiments suggest that the guideline depends on the algorithm used to induce the meta-model. If we use a white box algorithm, such as rf, we can use the meta-features in the root of the trees (and nodes close to the root) to explain when to tune. The high predictive performance of the rf algorithm indicates that the induced models were able to find a good hypothesis for situations where tuning is necessary, in both cases (for svm and J48).

6.2 Linking findings with the literature

Two of the related studies in the literature Mantovani et al. (2015b); Ridd and Giraud-Carrier (2014) used meta-models based on decision trees to interpret their predictions. However, in the current results, cart trees were among the worst meta-learners considering all the experimental setups analyzed. Thus, in this study, the meta-analysis performed was based on the rf meta-learner, extracting the average of the Gini index from the meta-features provided by the inner rf meta-models.

Meta-feature selection was also evaluated in Ridd and Giraud-Carrier (2014). The authors explored a cfs method and reported the “nn” meta-feature242424The performance of 1-NN algorithm. See tables 10 and 11 in Appendix A. as the most important. However, the results reported here were not improved by meta-feature selection. In fact, it decreased the performance of most meta-models, as shown in Figure 3(a). Meta-feature selection was also tried with filter methods, but the results were even worse than using sfs. For this reason, it was not reported in this study. In addition, “nn” meta-features did not appear among the top-20 most important meta-features computed by rf.

Our experimental results show that using meta-features from different categories have improved the predictive performance of the meta-learners for different setups. The most important meta-features were ”SM.classes_min” and ”LM.stump_sd252525This is shown in Figure 5 in Subsection 5.3.. In  Mantovani et al. (2015b), which only used meta-features from simple and data complexity sets, “SM.classes_max” and “SM.attributes” were reported as the most important meta-features. The first describes the percentage of examples in the majority class. The second is the number of predictive attributes in a dataset. Although these studies disagree about the relative importance order of these meta-features, both extract information related to the same characteristics: data complexity and dimensionality.

The hp tuning investigated in Sanders and Giraud-Carrier (2017)assumed that tuning is always necessary”, and therefore focused on the improvement prediction as a regression task. They obtained hp tuning results for less than half of the datasets (111/229) in the base-level when dealing with svm. In addition, their meta-models were not able to predict the hp improvement for svm, not providing any valid conclusion about the problem.

6.3 Main difficulties

During the experiments, there were several difficulties to generate the meta-knowledge. The process itself is computationally expensive, since a lot of tuning tasks must be run and evaluated in a wide range of classification tasks. Initially, a larger number of datasets were selected, but some of them were extremely expensive computationally speaking for either hp tuning or the extraction of meta-features. A walltime of hours was considered to remove high-cost datasets.

The class imbalance at the meta-level was another problem faced in the experiments. To deal with this problem, the optimized default hp settings Mantovani et al. (2015c) were added to the meta-dataset, increasing the number of meta-examples in the default meta-target. Besides the addition of relative landmarking set, some meta-examples were never correctly classified. This points out the need to define specific meta-features for some mtl problems.

6.4 Future work

Some findings from this study also open up future research directions. The proposed MtL recommender system could be extended to different ml algorithms, such as neural networks, another decision tree induction algorithms and ensemble-based techniques. It could also be used to support hp tuning decision in different tasks, such as pre-processing, regression and clustering. It could even be used to define whether to tune hp for more than one task, in a pipeline or simultaneously.

It would also be a promising direction to investigate the need of new meta-features to characterize data when dealing with data quality problems, for instance imbalancing measures, due to their influence in the quality of the induced meta-models. Besides, a multicriteria objective function could replace the current meta-label rule, weighing predictive performance, memory, and runtime. Another possibility would be to explore the use of ensembles as meta-models, given the complementary behavior of some of the algorithms studied here as meta-learners.

The code used in this study is publicly available, easily extendable and may be adapted to cover several other ml algorithms. The same may be said of the analysis, also available for reproducibility. All the experimental results generated are also available at openml correspondent studies web pages, from which they can be integrated and reused in different mtl systems. This framework is expected to be integrated to openml, so that the scientific community can use it.

Acknowledgments

The authors would like to thank CAPES and CNPq (Brazilian Agencies) for their financial support, specially for grants #2012/23114-9, #2015/03986-0 and #2018/14819-5 from the São Paulo Research Foundation (FAPESP).

References

References

  • Ali and Smith-Miles (2006) Ali, S., Smith-Miles, K.A.. A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing 2006;70(13):173–186.
  • Bensusan et al. (2000) Bensusan, H., Giraud-Carrier, C., Kennedy, C.. A higher-order approach to meta-learning. In: Proceedings of the ECML - Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination. 2000. p. 109–118.
  • Bergstra and Bengio (2012) Bergstra, J., Bengio, Y.. Random search for hyper-parameter optimization. J Mach Learn Res 2012;13:281–305.
  • Bischl et al. (2016) Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., Jones, Z.M.. mlr: Machine learning in r. Journal of Machine Learning Research 2016;17(170):1–5.
  • Brazdil et al. (2009) Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.. Metalearning: Applications to Data Mining. 2nd ed. Springer Verlag, 2009.
  • Brazdil and Henery (1994) Brazdil, P.B., Henery, R.J.. Analysis of results.

    In: Michie, D., Spiegelhalter, D.J., Taylor, C.C., Campbell, J., editors. Machine learning, neural and statistical classification. Ellis Horwood; 1994. p. 175–212.

  • Breiman (2001) Breiman, L.. Random forests. Machine Learning 2001;45(1):5–32.
  • Brodersen et al. (2010) Brodersen, K.H., Ong, C.S., Stephan, K.E., Buhmann, J.M.. The balanced accuracy and its posterior distribution.

    In: Proceedings of the 2010 20th International Conference on Pattern Recognition. IEEE Computer Society; 2010. p. 3121–3124.

  • Casalicchio et al. (2017) Casalicchio, G., Bossek, J., Lang, M., Kirchhoff, D., Kerschke, P., Hofner, B., Seibold, H., Vanschoren, J., Bischl, B.. OpenML: An R package to connect to the machine learning platform OpenML. Computational Statistics 2017;:1–15.
  • Chang and Lin (2011) Chang, C.C., Lin, C.J.. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2011;2:27:1–27:27.
  • Chawla et al. (2002) Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.. SMOTE: synthetic minority over-sampling technique. J Artif Int Res 2002;16(1):321–357.
  • Demšar (2006) Demšar, J.. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 2006;7:1–30.
  • Eggensperger et al. (2018) Eggensperger, K., Lindauer, M., Hoos, H.H., Hutter, F., Leyton-Brown, K.. Efficient benchmarking of algorithm configurators via model-based surrogates. Machine Learning 2018;107(1):15–41.
  • Feurer et al. (2015) Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.. Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., editors. Advances in Neural Information Processing Systems 28. Curran Associates, Inc.; 2015. p. 2944–2952.
  • Garcia et al. (2016) Garcia, L.P.F., de Carvalho, A.C.P.L.F., Lorena, A.C.. Noise detection in the meta-learning level. Neurocomputing 2016;176:14–25.
  • Gomes et al. (2012) Gomes, T.A.F., Prudêncio, R.B.C., Soares, C., Rossi, A.L.D., nd André C. P. L. F. Carvalho, . Combining meta-learning and search techniques to select parameters for support vector machines. Neurocomputing 2012;75(1):3–13.
  • Ho and Basu (2002) Ho, T.K., Basu, M.. Complexity measures of supervised classification problems. Pattern Analysis and Machine Intelligence, IEEE Transactions on 2002;24(3):289–300.
  • Horn et al. (2016) Horn, D., Demircioğlu, A., Bischl, B., Glasmachers, T., Weihs, C.. A comparative study on large scale kernelized support vector machines. Advances in Data Analysis and Classification 2016;:1–17.
  • Hsu et al. (2007) Hsu, C.W., Chang, C.C., Lin, C.J.. A Practical Guide to Support Vector Classification. Department of Computer Science - National Taiwan University; Taipei, Taiwan; 2007. .
  • Hutter et al. (2009) Hutter, F., Hoos, H., Leyton-Brown, K., Stützle, T.. Paramils: an automatic algorithm configuration framework.

    Journal of Artificial Intelligence Research 2009;(36):267–306.

  • Kotthoff et al. (2016) Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.. Auto-weka 2.0: Automatic model selection and hyperparameter optimization in weka. Journal of Machine Learning Research 2016;17:1–5.
  • Krstajic et al. (2014) Krstajic, D., Buturovic, L.J., Leahy, D.E., Thomas, S.. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of cheminformatics 2014;6(1):10+.
  • Leite et al. (2012) Leite, R., Brazdil, P., Vanschoren, J.. Selecting classification algorithms with active testing. In: Proceedings of the 2012 Conference on Machine Learning and Data Mining (MLDM 2012). 2012. p. 117–131.
  • López-Ibáñez et al. (2016) López-Ibáñez, M., Dubois-Lacoste, J., Cáceres, L.P., Birattari, M., Stützle, T.. The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives 2016;3:43 – 58.
  • Lorena et al. (2018) Lorena, A.C., Maciel, A.I., de Miranda, P.B.C., Costa, I.G., Prudêncio, R.B.C.. Data complexity meta-features for regression problems. Machine Learning 2018;107(1):209–246.
  • Mantovani et al. (2016) Mantovani, R.G., Horváth, T., Cerri, R., Vanschoren, J., de Carvalho, A.C.P.L.F.. Hyper-parameter tuning of a decision tree induction algorithm. In: 5th Brazilian Conference on Intelligent Systems, BRACIS 2016, Recife, Brazil, October 9-12, 2016. IEEE Computer Society; 2016. p. 37–42. URL: http://dx.doi.org/10.1109/BRACIS.2016.018. doi:10.1109/BRACIS.2016.018.
  • Mantovani et al. (2015a) Mantovani, R.G., Rossi, A.L.D., Vanschoren, J., Bischl, B., de Carvalho, A.C.P.L.F.. Effectiveness of random search in SVM hyper-parameter tuning. In: 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 12-17, 2015. IEEE; 2015a. p. 1–8.
  • Mantovani et al. (2015b) Mantovani, R.G., Rossi, A.L.D., Vanschoren, J., Bischl, B., Carvalho, A.C.P.L.F.. To tune or not to tune: Recommending when to adjust SVM hyper-parameters via meta-learning. In: 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 12-17, 2015. IEEE; 2015b. p. 1–8.
  • Mantovani et al. (2015c) Mantovani, R.G., Rossi, A.L.D., Vanschoren, J., Carvalho, A.C.P.L.F.. Meta-learning recommendation of default hyper-parameter values for svms in classification tasks. In: Vanschoren, J., Brazdil, P., Giraud-Carrier, C.G., Kotthoff, L., editors. Proceedings of the 2015 International Workshop on Meta-Learning and Algorithm Selection co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2015 (ECMLPKDD 2015), Porto, Portugal, September 7th, 2015. CEUR-WS.org; volume 1455 of CEUR Workshop Proceedings; 2015c. p. 80–92.
  • Miranda and Prudêncio (2013) Miranda, P., Prudêncio, R.. Active testing for SVM parameter selection. In: Neural Networks (IJCNN), The 2013 International Joint Conference on. 2013. p. 1–8.
  • Miranda et al. (2014) Miranda, P., Silva, R., Prudêncio, R.. Fine-tuning of support vector machine parameters using racing algorithms. In: Proceedings of the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2014. 2014. p. 325–330.
  • Miranda et al. (2012) Miranda, P.B.C., Prudêncio, R.B.C., Carvalho, A.C.P.L.F., Soares, C.. An experimental study of the combination of meta-learning with particle swarm algorithms for svm parameter selection. Lecture Notes in Computer Science 2012;7335 LNCS(PART 3):562–575.
  • Morais and Prati (2013) Morais, G., Prati, R.C.. Complex network measures for data set characterization. In: Brazilian Conference on Intelligent Systems, BRACIS 2013, Fortaleza, CE, Brazil, 19-24 October, 2013. IEEE Computer Society; 2013. p. 12–18.
  • Padierna et al. (2017) Padierna, L.C., Carpio, M., Rojas, A., Puga, H., Baltazar, R., Fraire, H.. Hyper-Parameter Tuning for Support Vector Machines by Estimation of Distribution Algorithms; Cham: Springer International Publishing. p. 787–800.
  • Pfahringer et al. (2000) Pfahringer, B., Bensusan, H., Giraud-Carrier, C.G.. Meta-learning by landmarking various learning algorithms. In: Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2000. p. 743–750.
  • Priya et al. (2012) Priya, R., De Souza, B.F., Rossi, A.L.D., Carvalho, A.C.P.L.F..

    Using genetic algorithms to improve prediction of execution times of ML tasks.

    In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). volume 7208 LNAI; 2012. p. 196–207.
  • Quinlan (1993) Quinlan, J.R.. C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993.
  • Reif et al. (2011) Reif, M., Shafait, F., Dengel, A.. Prediction of classifier training time including parameter optimization. In: Proceedings of the 34th Annual German conference on Advances in artificial intelligence. Springer-Verlag; KI’11; 2011. p. 260–271.
  • Reif et al. (2012) Reif, M., Shafait, F., Dengel, A.. Meta-learning for evolutionary parameter optimization of classifiers. Machine Learning 2012;87:357–380.
  • Reif et al. (2014) Reif, M., Shafait, F., Goldstein, M., Breuel, T., Dengel, A.. Automatic classifier selection for non-experts. Pattern Analysis and Applications 2014;17(1):83–96.
  • Ridd and Giraud-Carrier (2014) Ridd, P., Giraud-Carrier, C.. Using metalearning to predict when parameter optimization is likely to improve classification accuracy. In: Vanschoren, J., Brazdil, P., Soares, C., Kotthoff, L., editors. Meta-learning and Algorithm Selection Workshop at ECAI 2014. 2014. p. 18–23.
  • Sanders and Giraud-Carrier (2017) Sanders, S., Giraud-Carrier, C.G.. Informing the use of hyperparameter optimization through metalearning. In: Raghavan, V., Aluru, S., Karypis, G., Miele, L., Wu, X., editors. 2017 IEEE International Conference on Data Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017. IEEE Computer Society; 2017. p. 1051–1056.
  • Snoek et al. (2012) Snoek, J., Larochelle, H., Adams, R.P.. Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K., editors. Advances in Neural Information Processing Systems 25. Curran Associates, Inc.; 2012. p. 2951–2959.
  • Soares and Brazdil (2006) Soares, C., Brazdil, P.B.. Selecting parameters of svm using meta-learning and kernel matrix-based meta-features. In: Proceedings of the 2006 ACM symposium on Applied computing. ACM Press; SAC’06; 2006. p. 564–568.
  • Soares et al. (2004) Soares, C., Brazdil, P.B., Kuba, P.. A meta-learning method to select the kernel width in support vector regression. Machine Learning 2004;54(3):195–209.
  • Thornton et al. (2013) Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In: Proc. of KDD-2013. 2013. p. 847–855.
  • Vanschoren et al. (2014) Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.. Openml: Networked science in machine learning. SIGKDD Explor Newsl 2014;15(2):49–60.
  • Vapnik (1995) Vapnik, V..

    The Nature of Statistical Learning Theory.

    Springer-Verlag, 1995.
  • Vilalta et al. (2004) Vilalta, R., Giraud-Carrier, C.G., Brazdil, P., Soares, C.. Using meta-learning to support data mining. International Journal of Computer Science & Applications 2004;1(1):31–45.
  • Wistuba et al. (2018) Wistuba, M., Schilling, N., Schmidt-Thieme, L.. Scalable gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning 2018;107(1):43–78.

Appendix A List of Meta-features used in experiments

Type Acronym Description
sm classes Number of classes
attributes Number of attributes
numeric Number of numerical attributes
nominal Number of nominal attributes
samples Number of examples
dimension samples/attributes
numRate numeric/attributes
nomRate nominal/attributes
symbols (min, max, mean, sd, sum) Distributions of categories in attributes
classes (min, max, mean, sd) Classes distributions
st sks Skewness
sksP Skewness for normalized dataset
kts Kurtosis
ktsP Kurtosis for normalized datasets
absC Correlation between attributes
canC Canonical correlation between matrices
frac Fraction of canonical correlation
in clEnt Class entropy
nClEnt Class entropy for normalized dataset
atrEnt Mean entropy of attributes
nAtrEnt Mean entropy of attributes for
normalized dataset
jEnt Joint entropy
mutInf Mutual information
eqAtr clEnt/mutInf
noiSig (atrEnt - mutInf)/MutInf
mb (Trees) nodes Number of nodes
leaves Number of leaves
nodeAtr Number of nodes per attribute
nodeIns Number of nodes per instance
leafCor leave/samples
lev (min, max, mean, sd) Distributions of levels of depth
bran (min, max, mean, sd) Distributions of levels of branches
att (min, max, mean, sd) Distributions of attributes used
lm nb Naive Bayes accuracy
stump (min, max, mean, sd) Distribution of decision stumps
stMinGain Minimum gain ratio of decision stumps
stRand Random gain ratio of decision stumps
nn 1-Nearest Neighbor accuracy
Table 10: Meta-features used in experiments - part 1. For each meta-features it is shown: its type, acronym and description. Extended from Garcia et al. (2016).
Type Acronym Description
dc f1 Maximum Fisher’s discriminant ratio
f1v Directional-vector maximum Fisher’s discriminant ratio
f2 Overlap of the per-class bounding boxes
f3 Maximum feature efficiency
f4 Collective feature efficiency
l1 Minimized sum of the error distance of a linear classifier
l2 Training error of a linear classifier
l3 Nonlinearity of a linear classifier
n1 Fraction of points on the class boundary
n2 Ratio of average intra/inter-class NN distance
n3 leave-one-out error rate of the 1-NN classifier
n4 Nonlinearity of the 1-NN classifier
t1 Fraction of maximum covering spheres
t2 Average number of points per dimension
cn edges Number of edges
degree Average degree of the network
density Average density of the network
maxComp Maximum number of components
closeness Closeness centrality
betweenness Betwenness centrality
clsCoef Clustering Coefficient
hubs Hub score
avgPath Average path length
rl diff.svm.lm performance(SVM) - performance(Linear)
diff.svm.nb performance(SVM) - performance(NB)
diff.svm.stump performance(SVM) - performance(Decision Stump)
diff.svm.nn performance(SVM) - performance(1-NN)
diff.nn.lm performance(1-NN) - performance(Linear)
diff.nn.stump performance(1-NN) - performance(Decision Stump)
diff.nn.nb performance(1-NN) - performance(NB)
diff.nb.stump performance(NB) - performance(Decision Stump)
diff.nb.lm performance(NB) - performance(Linear)
diff.stump.lm performance(Decision Stump) - performance(Linear)
Table 11: Meta-features used in experiments - part 2. For each meta-features it is shown: its type, acronym and description. Extended from Garcia et al. (2016).

Appendix B Hyperparameter space of the meta-learners used in experiments

Algo Symbol hyperparameter Range Type Default Package
CART cp complexity parameter real rpart
minsplit minimum number of instances in a integer
node for a split to be attempted
minbucket minimum number of instances in a leaf integer
maxdepth maximum depth of any node of integer
the final tree
GP sigma width of the Gaussian kernel real - kernlab
SVM k kernel Gaussian - - e1071
C regularized constant real 1
width of the Gaussian kernel real
RF ntree number of trees integer 500 randomForest
nodesize minimum node size of the decision trees integer 1
KNN k number of nearest neighbors integer 7 kknn
NB - - - - - e1071
LR - - - - - gbm
Table 12: Meta-learner’s hyperparameter spaces explored in the experiments. The nomenclature follows their respective R packages. The nb and lr classifiers do not have any hyperparameter for tuning.

Appendix C J48 hyperparameter space and meta-datasets used in experiments from Section 5.8

Symbol Hyperparameter Range Type Default Conditions
C pruning confidence real 0.25 R = False
M minimum number of instances in a leaf integer 2 -
N number of folds for reduced integer 3 R = True
error pruning
O do not collapse the tree {False,True} logical False -
R use reduced error pruning {False,True} logical False -
B use binary splits only {False,True} logical False -
S do not perform subtree raising {False,True} logical False -
A Laplace smoothing for predicted {False,True} logical False -
probabilities
J do not use MDL correction for {False,True} logical False -
info gain on numeric attributes
Table 13: J48 hyperparameter space explored in experiments. The nomenclature is based on the RWeka package. Table adapted from Mantovani et al. (2016).
Meta-dataset Meta Meta Class Distribution
examples features Tuning Default
J48_90 165 80 63 102
J48_95 165 80 57 108
J48_99 165 80 52 113
Table 14: Meta-datasets generated for J48 experiments.

Appendix D List of abbreviations used in the paper

[type=, title=]