1 Introduction
The recent substantial progress in machine learning (ML) has led to a growing demand for handsfree ML systems that can support developers and ML novices in efficiently creating new ML applications. Since different datasets require different ML pipelines, this demand has given rise to the young area of automated machine learning (AutoML [1]). Popular AutoML systems, such as AutoWEKA [2], hyperoptsklearn [3] Autosklearn [4], TPOT [5] and
AutoKeras
[6]perform a combined optimization across different preprocessors, classifiers, hyperparameter settings, etc., thereby reducing the effort for users substantially.
Our work is motivated by the first and second ChaLearn AutoML challenge [7], which evaluated AutoML tools in a systematic way under rigid time and memory constraints. Concretely, the AutoML tools were required to deliver predictions in up to minutes which demands an efficient AutoML system that can quickly adapt to a dataset at hand. Performing well in such a setting would allow for an efficient development of new ML applications onthefly in daily business. We managed to win both challenges with Autosklearn [4] and PoSH Autosklearn [8], relying mostly on metalearning and robust resource management.
While AutoML reliefs the user from making lowlevel design decisions (e.g. which model to use), AutoML itself opens a myriad of highlevel design decisions (e.g. which model selection strategy to use [9]). Whereas our submissions to the AutoML challenges were mostly handdesigend, in this work we go one step further by automating AutoML itself. Specifically, our contributions are:

We explain practical details which allow Autosklearn to handle iterative algorithms more efficiently.

We extend Autosklearn’s choice of model selection strategies to include ones optimized for highthroughput evaluation of ML pipelines.

We introduce both the practical approach as well as the theory behind building better portfolios for the metalearning component of Autosklearn.

We propose a metalearning technique based on algorithm selection to automatically choose the best AutoMLpipeline for a given dataset, further robustifying AutoML itself.

We conduct a large experimental evaluation, comparing against Autosklearn (1.0) and benchmarking each contribution separately.
The paper is structured as follows: First, we formally describe the AutoML problem in Section 2 and then discuss the background and basis of this work, Autosklearn (1.0) and PoSH Autosklearn in Section 3. In the following four sections, we describe in detail the improvements we propose for the next generation, Autosklearn (2.0). For each improvement, we state the practical and theoretical motivation, mention the most closely related work (with further related work deferred to Appendix A), and discuss the methodological changes we make over Autosklearn (1.0). We conclude each improvement section with a preview of empirical results highlighting the benefit of each change; we defer the detailed explanation of the experimental setup to Section 8.1.
2 Problem Statement:
Automated Machine Learning
Let be a distribution of datasets from which we can sample an individual dataset’s distribution . The AutoML problem is to generate a trained pipeline , hyperparameterized by that automatically produces predictions for samples from the distribution minimizing the generalization error:
(1) 
Since a dataset can only be observed through a set of independent observations , we can only empirically approximate the generalization error on sample data:
(2) 
AutoML systems automatically search for the best :
(3) 
where denotes that was trained on the th training fold . Assuming that an AutoML system can select via both, the algorithm and its hyperparameter settings, this definition using is equivalent to the definition of the CASH problem [2, 4].
2.1 Timebounded AutoML
In practice, users are not only interested to obtain an optimal pipeline eventually, but have constraints on how much time and compute resources they are willing to invest. We denote the time it takes to evaluate as and the overall optimization budget by . Our goal is to find
(5) 
where the sum is over all pipelines evaluated, explicitly honouring the optimization budget .
2.2 Generalization of AutoML
Ultimately, a well performing and robust optimization policy of an AutoML system should not only perform well on a single dataset but on the entire distribution over datasets . Therefore, the metaproblem of AutoML can be formalized as minimizing the generalization error over this distribution of datasets:
(6) 
which in turn can again only be approximated by a finite set of metatrain datasets (each with a finite set of observations):
(7) 
Having set up the problem statement, we can use this to further formalize our goals. We will introduce a novel system for designing from data in Section 7 and extend this to a function which automatically suggests an AutoML optimization policy for a new dataset.
3 Background on Autosklearn
AutoML systems following the CASH formalism are driven by a sufficiently large and flexible pipeline configuration space and an efficient method to search this space. Additionally, to speed up this procedure, information gained on other datasets can be used to kickstart or guide the search procedure (i.e. metalearning). Finally, one can also combine the models trained during the search phase in a posthoc ensembling step. In the following, we give details on the background of these four components, describe how we implemented them in Autosklearn (1.0) and how we extended them for the second ChaLearn AutoML challenge.
3.1 Configuration Space
Analogously to AutoWEKA [2], AutoKeras [6] and
AutoPyTorch
[10], Autosklearn is also built around an existing ML library, namely scikitlearn [11], forming the backbone of our system. The configuration space allows ML pipelines consisting of three steps: data preprocessing, feature preprocessing and an estimator. A pipeline can consist of multiple data preprocessing steps (e.g., imputing missing data and normalizing data), one feature preprocessing step (e.g., a principal component analysis) and one estimator (e.g., gradient boosting). Our configuration space is hierarchically organized in a treestructure and contains continuous (e.g., the learning rate), categorical (e.g., type of estimator) and conditional hyperparameters (e.g., the learning rate of a estimator). For classification, the space of possible ML pipelines currently spans across
classifiers, feature preprocessing methods and numerous data preprocessing methods, adding up to hyperparameters for the latest release.^{1}^{1}1We used software version , however, we note that the software version does not align with the method version.3.2 Bayesian Optimization (BO)
BO [12] is the driving force of Autosklearn. It is an optimization procedure specially designed toward sample efficiency where the evaluation time of configurations dominates the search procedure. BO is based on two mechanisms: 1) Fitting a probabilistic model mapping hyperparameters to their loss value and 2) optimizing an acquisition function to choose the next hyperparameter setting to evaluate. BO iterates these two steps and evaluates the selected setting. A common choice for the internal model are Gaussian process models [13], which perform best in lowdimensional problems with continuous hyperparameters. For Autosklearn
we use random forests
[14], since they have been shown to perform well for highdimensional and structured optimization problems, like the CASH problem [15, 2].3.3 MetaLearning
AutoML systems often solve similar problems over and over again while starting from scratch for every task. Metalearning techniques can exploit experience gained on previous optimization tasks and equip the optimization process with this knowledge. For Autosklearn (1.0), we used a casebased reasoning system [16, 17], dubbed nearest datasets (KND). For a new dataset, this procedure warmstarts BO with the best known ML pipelines found on the nearest datasets, where the distance between datasets is defined as the distance on metafeatures describing these datasets. Metalearning for efficient AutoML is an active area of research and we refer to a recent literature review [18].
3.4 Ensembles
While searching for the best ML pipeline, AutoML systems train numerous such ML pipelines, but would traditionally only use the single best one. An easy remedy is to combine these posthoc into an ensemble to further improve performance and reduce overfitting [19, 4]. Autosklearn uses ensemble selection [20]
to continuously output ensembles during the training process. Ensemble selection greedily adds ML pipelines to an ensemble to minimize the error and can therefore be used with any loss function without adapting the training procedure of the base models or the hyperparameter optimization.
3.5 Second AutoML Challenge
The goal of the challenge was to design an AutoML system that, without any human intervention, operates under strict time and memory limits on unknown binary classification datasets on the Codalab platform [7]. A submission had between seconds, 2 CPUs and 16GB memory to produce predictions that were evaluated and ranked according to area under the curve and the submission with the best average rank won the competition – a predecessor of Autosklearn (2.0) which we dubbed PoSH Autosklearn, short for Portfolio and Successive Halving [8]. Our main changes with respect to Autosklearn (1.0) were the usage of successive halving instead of regular holdout to evaluate more machine learning pipelines, a reduced search space to complement successive halving by mostly containing iterative models, and a static portfolio instead of the KND technique to avoid the computation of metafeatures. Our work picks up on the ideas of PoSH Autosklearn, describes them in more detail and provides a thorough evaluation while also presenting a novel approach to AutoML which was motivated by the manual design decisions we had to make for the competition.
4 Improvement 0: Practical Considerations
The performance of an AutoML system not only relies on efficient hyperparameter optimization and model selection strategies, but also on practical considerations. Here, we will describe design decisions we applied for all further experiments since they in general improve performance.
4.1 Early Stopping and Retrieving Intermittent Results
Estimating the generalization error of a pipeline practically requires to restrict the CPUtime per evaluation to prevent that one single, very long algorithm run stalls the optimization procedure [2, 4]
. If an algorithm run exceeds the assigned time limit, it is terminated and the worst possible generalization error is assigned. If the time limit is set too low, a majority of the algorithms do not return a result and thus provide very scarce information for the optimization procedure. A too high time limit, however, might as well not return any meaningful results since all time may be spent on longrunning, underperforming pipelines. To mitigate this risk, for algorithms that can be trained iteratively (e.g., gradient boosting and linear models trained with stochastic gradient descent) we implemented two measures. Firstly, we allow a pipeline to stop training based on a heuristic at any time, i.e. early stopping, which prevents overfitting. Secondly, we make use of intermittent results retrieval, e.g., saving the results at checkpoints spaced at geometrically increasing iteration numbers, thereby ensuring that every evaluation returns a performance and thus yields information for the optimizer. With this, our AutoML tool can robustly tackle large datasets without the necessity to finetune the number of iterations dependent on the time limit.
4.2 Search Space
The first search space of Autosklearn consisted of hyperparameters, whereas the latest release has grown to even hyperparameters. Current hyperparameter optimization algorithms can cope with such spaces, given enough time, but, in this work, we consider a heavily timebounded setting. Therefore, we reduced our space to hyperparameters only including iterative models to benefit from the early stopping and intermittent results retrieval.
4.3 Preview of Experimental Results
4.3.1 Do we need intermittent results retrieval?
Ideally, we want all iterative algorithms to converge on a dataset, i.e. allow them to run for as many iterations as required until the earlystopping mechanism terminates training. However, on large datasets this might be infeasible, so one would need to carefully set the number of iterations such that the algorithm terminates within the given timelimit or tune the number of iterations as part of the configuration space. The left plot in Figure 1 shows the substantial improvement of intermittent results retrieval. While the impact on small datasets is negligible (dotted line), on large datasets this is crucial (dashed line).
4.3.2 Do we need the full search space?
The larger the search space, the greater the flexibility to adapt a method to a new dataset. However, a strict timelimit might prohibit a thorough search in a large search space. Therefore, we studied two pruned versions of our search space: 1) reducing the classifier space to only contain models that can be fitted iteratively and 2) further reduce the preprocessing space to only contain necessary preprocessing steps, such as imputation of missing values and onehotencoding of categorical values.
Focusing on a subset of classifiers that always return a result reduces the chance of wasteful timeouts which motivates the first point. This subset mostly contains treebased methods, which often inherently contain forms of feature selection which lead us to the second point above. In Figure
1, we compare our AutoML system using different configuration spaces. Again, we see that the impact of this decision is most evident on large datasets. We provide the reduced search space in Appendix C.5 Improvement 1: Model Selection strategy
A key component of any efficient AutoML system is its model selection strategy, which addresses the two following problems: 1) how to approximate the generalization error of a single ML pipeline and 2) how many resources to allocate for each pipeline evaluation. In this section, we discuss different combinations of these to increase the flexibility of Autosklearn (2.0) for different use cases.
5.1 Assessing the Performance of a Model
Given a training set , the goal is to best approximate the generalization error to 1) provide a precise signal for the optimization procedure and 2) based on this to select in the end. We usually compute the validation loss, which is obtained by splitting the training data into into two smaller, disjoint sets and , by following the common trainvalidationtest protocol [21, 22]. The two most common ways to assess the performance of a model are holdout and the Kfold crossvalidation [23, 24, 25, 9, 26, 22]. We expect the holdout strategy to be a better choice for large datasets where the holdout set is representative of the test set, and where it is computationally wasteful to apply crossvalidation. Consequently, we expect crossvalidation to yield the best results for small datasets, where its computational overhead does not play a role, and where only the use of all available samples can result in a reliable estimate of the generalization error.^{2}^{2}2Different model selection strategies could be ignored from an optimization point of view, where the goal is to optimize performance given a loss function, as is often done in the research fields of metalearning and hyperparameter optimization. However, for AutoML systems this is highly relevant as we are not interested in the optimization performance (of some subpart) of these systems, but the final generalization performance when applied to new data.
5.2 Allocating Resources to Choose the Best Model
Considering that the available resources are limited, it is important to trade off the time spent assessing the performance of each ML pipeline versus the number of pipelines to evaluate. Currently, Autosklearn (1.0) implements a conceptually simple approach and evaluates each pipeline under the same resource limitations and on the same budget (e.g., number of iterations using iterative algorithms). The recent bandit strategy successive halving (SH) [27, 28] employs the concept of assigning higher budgets
to more promising pipelines when evaluating them; the budgets can, e.g., be the number of iterations in gradient boosting, the number of epochs in neural networks or the number of data points. Given a minimal and maximal budget per ML pipeline, SH starts by training a fixed number of ML pipelines for the smallest budget. Then, it iteratively selects
of the pipelines with lowest generalization error, multiplies their budget by , and reevaluates. This process is continued until only a single ML pipeline is left or the maximal budget is spent.While SH itself uses random search to propose new pipelines , we follow recent work combining SH with BO [29]. BO iteratively suggests new ML pipelines , which we evaluate on the lowest budget until a fixed number of pipelines has been evaluated. Then, we run SH as described above. We build the model for BO on the highest available budget where we have observed the performance of pipelines.
SH potentially provides large speedups, but it could also too aggressively cut away good configurations that need a higher budget to perform best. Thus, we expect SH to work best for large datasets, for which there is not enough time to train many ML pipelines for the full budget, but for which training a ML pipeline on a small budget already yields a good indication of the generalization error.
5.3 Preview of Experimental Results
Choosing the correct evaluation strategy not only depends on the characteristics of the dataset at hand, but also on the given timelimit. While there exist general recommendations, we observed in practice that this is a crucial design decision that drastically impacts performance. To highlight this effect, in Figure 2 we show exemplary results comparing holdout, 3CV, 5CV, 10CV with and without SH on different optimization budgets and datasets. We give details on the SH hyperparameters in Appendix C.
The top row shows results obtained using the same optimization budget of 10 minutes on two different datasets. While holdout without SH is best on dataset robert (top left) the same strategy performs worst on dataset fabert (top right). Also, on robert, SH performs slightly worse in contrast to fabert, where SH performs better on average. The bottom rows shows how the given timelimit impacts the performance. Using a quite restrictive optimization budget of 10 minutes (bottom left), SH with holdout, which aggressively cuts ML pipelines on lower budgets, performs best on average. With a higher optimization budget (bottom right), the overall results improve, but holdout is also no longer the best option and 3CV performs best.
6 Improvement 2: Portfolio Building
Finding the optimal solution to the optimization problem from Eq. (5) requires to search a large space of possible solutions as efficiently as possible. BO is built to work under exactly these conditions, however, it starts from scratch for every new problem. A better solution would be to warmstart BO with ML pipelines that are expected to work well, such as KND described in Section 3.3. However, we found this solution to introduce new problems: First, it is time consuming since it requires to compute metafeatures describing a new dataset, where good metafeatures are often quite expensive to compute. Second, it adds complexity to the system as the computation of the metafeatures must also be done with a time and memory limit. Third, a lot of metafeatures are not defined with respect to categorical data and missing values, making them hard to apply for most datasets. Fourth, it is not immediately clear which metafeatures work best for which problem. Fifth, in the KND approach mentioned in Section 3.3, there is no mechanism to guarantee that we do not execute redundant ML pipelines. Therefore, here we propose a metafeaturefree approach which does not warmstart with a set of configurations specific to a new dataset, but which uses a portfolio – a set of complementary configurations that covers as many diverse datasets as possible and minimizes the risk of failure when facing a new task.
Portfolios were introduced for hard combinatorial optimization problems, where the runtime between different algorithms varies drastically and allocating time shares to multiple algorithms instead of allocating all available time to a single one reduces the average cost for solving a problem
[30, 31]. Algorithm portfolios were introduced to ML with the goal of reducing the required time to perform model selection compared to running all ML pipelines under consideration [32, 33, 34]. Portfolios of ML pipelines can be superior to BO for hyperparameter optimization [35] or BO with a model that takes past performance data into account [36]. They can also be applied when there is simply no time to perform full hyperparameter optimization [8] which is our main motivation.6.1 Approach
To improve the efficiency of Autosklearn (2.0) in its early phase and to obtain results if there is no time for thorough hyperparameter optimization, we build a portfolio consisting of highperforming and complementary ML pipelines to perform well on as many datasets as possible. All pipelines in the portfolio are simply evaluated one after the other instead of an initial design or pipelines proposed by a global optimization algorithm.
We outline the proposed process in Algorithm 1, which is motivated by the Hydra algorithm [37, 38]. First, we initialize our portfolio to the empty set (Line 2). Then, we repeat the following procedure until reaches a predefined limit: from a set of candidate ML pipelines , we greedily add a candidate to that reduces the estimated generalization error over all metatrain datasets most (Line 4), and then remove the from (Line 5).
We define the estimated generalization error of across all metatrain datasets as
(8) 
which is the estimated generalization error of selecting the ML pipeline according to the model selection strategy , where is a function which trains different , compares them with respect to their estimated generalization error and returns the best one as described in the previous section, see Appendix B for further details.
In contrast to Hydra, we first run BO on each metadataset and use the best found solution as a candidate. Then, we evaluate each of these candidates on each metatrain dataset in to obtain a performance matrix which we use as a lookup table to construct the portfolio. To build a portfolio across datasets, we need to take into account that the generalization errors for different datasets live on different scales [39]. Thus, before taking averages, we transform them to the simple regret scaled between zero and one for each dataset [36, 40]. We compute the statistics for zeroone scaling by taking the results of all model selection strategies into account (i.e., we use the lowest observed test loss and the largest observed test loss for each metatrain dataset).
For each metatrain dataset , as mentioned before, we split the training set into two smaller disjoint sets and . We usually train models using , use to choose a ML pipeline from the portfolio by means of the model selection strategy, and judge the portfolio quality by the generalization loss of on . However, if we instead select the ML pipeline on the test set , we obtain a submodular algorithm which we detail in Section 6.2. Therefore, we follow this approach in practice, but we emphasize that this only affects the offline phase; for a new dataset, our algorithm of course does not access the test set.
6.2 Theoretical Properties of the Greedy Algorithm
Besides the already mentioned practical advantages of the proposed greedy algorithm, the worstcase performance of the portfolio is even bounded.
Proposition 1
Minimizing the test loss of a portfolio on a set of datasets , when choosing a ML pipeline from for using holdout or crossvalidation based on its performance on , is equivalent to the sensor placement problem for minimizing detection time [41].
We detail this equivalence in Appendix B. Thereby, we can apply existing results for the sensor placement problem to our problem and can conclude that the greedy portfolio building algorithm choosing on proposed in Section 6.1 is submodular and monotone. Using the test set of the metatrain datasets to construct a portfolio is perfectly fine as long as we do not use the metatest datasets .
This finding has several important implications. First, we can directly apply the proof from Krause et al. [41] that the socalled penalty function (maximum estimated generalization error minus the observed estimated generalization error) is submodular and monotone to our problem setup. Since linear combinations of submodular functions are also submodular [42], the penalty function for all metatrain datasets is also submodular. Second, we know that the problem of finding an optimal portfolio is NPhard [43, 41]. Third, the reduction of regret achieved by the greedy algorithm is at least , meaning that we reduce our regret to at most 37% of what the best possible portfolio would achieve [43, 42]. A generalization of this result given by Krause and Golovin [42] also allows to reduce the regret to 1% of what the best possible portfolio of size would achieve by extending the portfolio to size . This means that we can find a closetooptimal portfolio on the metatrain datasets . Under the assumption that we apply the portfolio to datasets from the same distribution of datasets, we have a strong set of default ML pipelines. Fourth, we can apply other strategies for the sensor set placement in our problem setting, such as mixed integer programming strategies; however, these do not scale to portfolio sizes of a dozen ML pipelines [41]. The same proof and consequences apply if we select a ML pipeline based on an intermediate step in a learning curve or use crossvalidation instead of holdout. We describe the properties of the greedy algorithm when using SH, and when choosing an algorithm on the validation set in Appendix B.
6.3 Preview of Experimental Results
We introduced the portfoliobased warmstarting to avoid computing metafeatures for a new dataset. However, the portfolios work inherently differently. While KND aimed at using only well performing configurations, a portfolio is built such that there is at least one configuration that works well, which also provides a different form of initial design for BO. Here, we study the performance of the learned portfolio and compare it against Autosklearn (1.0)’s default metalearning strategy using configurations. Additionally, we also study how pure BO would perform. We give results in Table I. For the new AutoMLhyperparameter we chose to allow two full iterations of SH with our hyperparameter setting of SH. Unsurprisingly, warmstarting improves the performance on all datasets, often by a large margin. Although the KND approach mostly does not perform statistically worse, the portfolio approach achieves a better average performance while being conceptually simpler and theoretically motivated.
10 minutes  60 minutes  

BO  KND  Port  BO  KND  Port  
Holdout  4.31  3.40  3.48  2.95  2.84  2.76 
SH, Holdout  4.01  3.51  3.43  2.91  2.74  2.66 
3CV  6.82  5.87  5.78  5.39  5.17  5.20 
SH, 3CV  6.50  6.00  5.76  5.43  5.21  4.97 
5CV  9.73  8.66  9.12  7.83  7.46  7.62 
SH, 5CV  9.58  8.43  8.93  7.85  7.43  7.41 
10CV  17.37  15.82  15.70  16.15  15.07  17.23 
SH, 10CV  16.79  15.72  15.65  15.74  14.98  15.25 
7 Improvement 3: Automated Policy Selection
The goal of AutoML is to yield stateoftheart performance without requiring the user to make lowlevel decisions, e.g., which model and hyperparameter configurations to apply. However, some highlevel design decisions remain and thus AutoML systems suffer from a similar problem as they are trying to solve. We consider the case, where an AutoML system can be run with different optimization policies (e.g., model selection strategies) and study how to further automate AutoML using algorithm selection. In practice, we extend the formulation introduced in Eq. 7 to not contain a fixed policy , but to contain a selector :
(9) 
In the remainder of this section, we describe how to construct such a selector.
7.1 Design Decisions in AutoML
Optimization strategies in AutoML itself are often heavily hyperparameterized. In our case, we deem the model selection strategy (see Section 5) as the most important design decision of an AutoML system. This decision depends on both the given dataset and the available resources. As there is also an interaction between the model selection strategy and the optimal portfolio , we consider here that the optimization policy is parameterized by a combination of model selection strategy and a portfolio optimized for this strategy: .
7.2 Automated Algorithm Selection of AutoMLPolicies
We introduce a new layer on top of AutoML systems that automatically selects a policy for a new dataset. We show an overview of this system in Figure 3 which consists of a training (TR1–TR6) and a testing stage (TE1–TE4). In brief, in training steps TR1–TR3, we obtain a performance matrix of size , where is a set of candidate ML pipelines, and is the number of representative metatrain datasets. This matrix is used to build policies in training step TR4, e.g., including portfolios greedily built, see Section 6. In steps TR5 and TR6, we compute metafeatures and use them to train a selector which will be used in the online test phase.
For a new dataset , we first compute metafeatures describing (TE1) and use the selector from step TR6 to automatically select an appropriate policy for based on the metafeatures (TE2). This will relieve users from making this decision on their own. Given a policy, we then apply the AutoML system using this policy to (TE3). Finally, we return the best found pipeline based on the training set of (TE4.1). Optionally, we can then compute the loss of on the test set of (TE4.2); we emphasize that this would be the only time we ever access the test set of .
7.2.1 MetaFeatures
To train our selector and to select a policy, we use metafeatures [44, 18] describing all metatrain datasets (TR4) and new datasets (TE1). To avoid the problems discussed in Section 6 we only use very simple and robust metafeatures, which can be reliably computed in linear time for every dataset: 1) the number of data points, 2) the number of features and 3) the number of classes. In our experiments we will show that even with only these trivial and cheap metafeatures we can substantially improve over a static policy.
7.2.2 Constructing the Selector
To construct the meta selection model (TR6), we follow the selector design of HydraMIP [45]: for each pair of AutoML policies, we fit a random forest to predict whether policy outperforms policy given the current dataset’s metafeatures. Since the misclassification loss depends on the difference of the losses of the two policies (i.e. the regret when choosing the wrong policy), we weight each metaobservation by their loss difference. To make errors comparable across different datasets [39], we scale the individual error values for each dataset across all policies to be between zero and one wrt the minimal and maximal observed loss. At test time (TE2), we query all pairwise models for the given metafeatures, and use voting to choose a policy . We will refer to this strategy as Selector.
To obtain an estimate of the generalization error of a policy on a dataset we run the proposed AutoML system. In order to not overestimate the performance of on a dataset , dataset must not be part of the metadata for constructing the portfolio. To overcome this issue, we perform an inner 5fold crossvalidation and build each on four fifths of the datasets and evaluate it on the final fifth of training datasets. As we have access to the performance matrix we introduced in the previous section, constructing these additional portfolios for crossvalidation comes at little cost.
To improve the performance of the selection system, we applied random search to optimize the selector’s hyperparameters (its random forest’s hyperparameters) [46] to minimize the error of the selector computed on outofbag samples [47]. Hyperparameters are shared between all pairwise models to avoid factorial growth of the number of hyperparameters with the number of new model selection strategies.
7.2.3 Backup strategy
Since our selector may not extrapolate well to datasets outside of the metadatasets, we use a fallback measure to avoid failures due to the fact that random forests struggle to extrapolate well. Such failures can be harmful if a new dataset is much larger than any dataset in the metadataset and the selector proposes to use a policy that would time out without any solution. More specifically, if there is no dataset in the metatrain datasets that has higher or equal values for each metafeature (i.e. dominates the dataset metafeatures), our system falls back to use holdout with SH.
7.3 Preview of Experimental Results
To study the performance of the selector, we compare three different selector strategies: 1) a random policy for each dataset and each repetition (Random), 2) the policy that is best on average wrt balanced error rate for each repetition with 5fold crossvalidation on metatrain (Single Best), 3) our trained selector (Selector) and 4) the optimal policy (Oracle), which marks the lowest possible error that can theoretically be achieved by a policy.^{3}^{3}3Also the oracle performance is not necessarily zero, because even evaluating the best policy on a dataset can exhibit overfitting compared to the single best model we use to normalize data.
In Table II
we report quantitative results for short (10 minutes) and long (60 minutes) optimization budgets. As expected, the random strategy performs worst and also yields the highest variance across repetitions. Choosing the policy that is best on average performs substantially better, but still worse than using a selector.
When turning to the ranking shown in Figure 4, we observe that the random policy is competitive with the single best policy (in contrast to the results shown in Table II). This is due to the fact that some policies, especially crossvalidation with a high number of folds, fail to produce results on a few datasets and thus get the worst possible error, but work best on the majority of other, smaller datasets. The random strategy can select these policies and therefore achieves a similar rank as the single best policy. In contrast, our proposed selection approach does not suffer from this issue and outperforms both baseline methods after the first few minutes.
regret  
10min  60min  
Selector  3.09  2.66 
Single Best  4.84  
Oracle  
Random 
STD  

10min  60min 
Averaged normalized balanced error rate. We report the aggregated performance and standard deviation across
repetitions and datasets of our AutoML system using different selectors and the optimal Oracle performance. We boldface the best mean value (per optimization budget), and underline results that are not statistically different according to a Wilcoxonsignedrank test (). We report the average standard deviation across repetitions of the experiment.8 Evaluation
To thoroughly assess the impact of our proposed improvements, we now study the performance of Autosklearn (2.0) and compare it to ablated variants of itself and Autosklearn (1.0). We first describe the experimental setup in Section 8.1, conduct a largescale ablation study in Section 8.2 and then compare Autosklearn (2.0) against different versions of Autosklearn (1.0) in Section 8.3.
8.1 Experimental Setup
So far, AutoML systems were designed without any optimization budget or with a single, fixed optimization budget in mind (see Equation 5).^{4}^{4}4The OBOE AutoML system [48] is a potential exception that takes the optimization budget into consideration, but the experiments in [48] were only conducted for a single optimization budget, not demonstrating that the system adapts itself to multiple optimization budgets. Our system takes the optimization budget into account when constructing the portfolio. When choosing an AutoML system using metalearning, we select a strategy and a portfolio based on both the optimization budget and the dataset metafeatures. We will study two optimization budgets: a short, 10 minute optimization budget and a long, 60 minute optimization budget as in the original Autosklearn paper. To have a single metric for binary classification, multiclass classification and unbalanced datasets, we report the balanced error rate (), following the 1 AutoML challenge [7]
. As different datasets can live on different scales, we apply a linear transformation to obtain comparable values. Concretely, we obtain the minimal and maximal error obtained by executing
Autosklearn (2.0) without portfolios and ensembles, but with all available model selection strategies per dataset, and rescale by subtracting the minimal error and dividing by the difference between the maximal and minimal error [40]. With this transformation, we obtain a normalized error which can be interpreted as the regret of our method.As discussed in Section 4, we also limit the time and memory for each machine learning pipeline. For the time limit we allow for at most of the optimization budget, while for the memory we allow the pipeline 4GB before forcefully terminating the execution.
8.1.1 Datasets
We require two disjoint sets of datasets for our setup: (i) , on which we build portfolios and our selector and (ii) , on which we evaluate our method. The distribution of both sets ideally spans a wide variety of problem domains and dataset characteristics. For , we rely on datasets selected for the AutoML benchmark proposed in [49], which consists of datasets for comparing classifiers [50] and datasets from the AutoML challenges [7].
We collected the metatrain datasets based on OpenML [51] using the OpenMLPython API [52]. To obtain a representative set, we considered all datasets on OpenML with more than and less than samples with at least two attributes. Next, we dropped all datasets that are sparse, contain time attributes or string type attributes as the does not contain any such datasets. Then, we automatically dropped synthetic datasets and subsampled clusters of highly similar datasets. Finally, we manually checked for overlap with and ended up with a total of training datasets and used them to design our method.
We show the distribution of the datasets in Figure 5. Green points refer to and orange crosses to . We can see that spans the underlying distribution of quite well, but that there are several datasets which are outside of the distribution, which are marked with a black cross and for which our AutoML system selected a backup strategy (see Section 7.2.3). We give the full list of datasets for and in Appendix D.
For all datasets we use a single holdout test set of which is defined by the corresponding OpenML task. The remaining are the training data of our AutoML systems, which handle further splits for model selection themselves based on the chosen model selection strategy.
8.1.2 Metadata Generation
For each optimization budget we created four performance matrices, see Section 6. Each matrix refers to one way of assessing the generalization error of a model: holdout, 3fold CV, 5fold CV or 10fold CV. To obtain each matrix, we did the following. For each dataset in , we used combined algorithm selection and hyperparameter optimization to find a customized ML pipeline. In practice, we ran SMAC [15, 53] three times for the prescribed optimization budget and picked the best resulting ML pipeline on the test split of . Then, we ran the crossproduct of all ML pipelines and datasets to obtain the performance matrix.
8.1.3 Other Experimental Details
We always report results averaged across
repetitions to account for randomness and report the mean and standard deviation over these repetitions. To check whether performance differences are significant, where possible, we ran the Wilcoxon signed rank test as a statistical hypothesis test with
[54]. In addition, we plot the average rank as follows. For each dataset, we draw one run per method (out of 10 repetitions) and rank these draws according to performance, using the average rank in case of ties. We repeat this sampling times to obtain the average rank on a dataset, before averaging these into the total average.We conducted all previous results without ensemble selection to focus on the individual improvements. From now on, all results include ensemble selection (and we construct ensembles of size with replacement).
All experiments were conducted on a compute cluster with machines equipped with 2 Intel Xeon Gold 6242 CPUs with 2.8GHz (32 cores) and 128 GB RAM, running Ubuntu 18.04.3. We provide scripts for reproducing all our experimental results at https://github.com/mfeurer/ASKL2.0_experiments and we provide an implementation within Autosklearn at https://automl.github.io/autosklearn/master/.
8.2 Ablation Study
Now, we study the contribution of each of our improvements in an ablation study. We iteratively disable one component and compare the performance to the full system. These components are (1) using only a subset of the model selection strategies, (2) warmstarting BO with a portfolio and (3) using a selector to choose a model selection strategy.
8.2.1 Do we need different model selection strategies?
10 Min  60 Min  

std  std  
All  selector  2.27  0.16  1.88  0.12 
random  6.04  1.93  5.49  1.85  
oracle  1.15  0.07  0.92  0.05  
Only Holdout  selector  2.61  0.12  2.22  0.18 
random  2.67  0.12  2.22  0.13  
oracle  2.20  0.09  1.83  0.13  
Only CV  selector  4.76  0.12  4.36  0.06 
random  7.08  0.76  6.35  0.88  
oracle  3.91  0.03  3.64  0.07  
Full budget  selector  2.26  0.13  1.85  0.13 
random  6.17  1.50  5.59  1.51  
oracle  1.52  0.05  1.12  0.07  
Only SH  selector  2.26  0.15  1.80  0.09 
random  5.31  2.01  4.70  1.92  
oracle  1.39  0.09  1.11  0.07 
We now examine whether we need the different model selection strategies discussed in Section 5. For this, we build selectors on different subsets of the available model selection strategies: Only Holdout consists of holdout with and without SH; Only CV comprises 3fold CV, 5fold CV and 10fold CV, all of them with and without SH; Full budget contains both holdout and crossvalidation and assigns each pipeline evaluation the same budget; while Only SH uses successive halving to assign budgets.
In Table III, the performance of the oracle selector shows how good a set of model selection strategies could be if we could build a perfect selector. It turns out that both Only Holdout and Only CV have a much worse oracle performance than All, with the oracle performance of Only CV being even worse than the performance of the learned selector for All. Looking at the two budget allocation strategies, it turns out that using either of them alone (Full budget or Only SH) would be slightly preferable in terms of performance with a selector. However, the oracle performance of both is worse than that of All which shows that there is some complementarity in them which cannot yet be exploited by the selector.
While these results question the usefulness of choosing from all model selection strategies, we believe this points to the research question whether we can learn on the metatrain datasets which model selection strategies to include in the set of strategies to choose from. Also, with an evergrowing availability of metatrain datasets and continued research on robust selectors, we expect this flexibility to eventually yield improved performance.
8.2.2 Do we need portfolios?
Now we study the impact of the portfolio. For this study, we completely remove the portfolio from our AutoML system, meaning that we only run BO and construct ensembles – both for creating the data we train our selector on and for reporting performance. We compare this reduced system against full Autosklearn (2.0) in Table IV.
Comparing the performance of the AutoML system with and without portfolios (column 1 and 3), there is a clear drop in performance showing the benefit of using portfolios in our system. To demonstrate that the selector indeed helps and we do not only measure the impact of warmstarting with a portfolio, we also show the performance of the single best selector (columns 2 and 4), which is always worse than our learned selector.
Portfolio  No portfolio  

min  Selector  Single best  Selector  Single best 
mean  1.88  
std 
8.2.3 Do we need selection at all?
Next, we examine how much performance we gain by having a selector to decide between different AutoML strategies based on metafeatures and how to construct this selector. We compare the performance of the full system using a learned selector to using (1) a single, static learned strategy (single best) and (2) the selector without a fallback mechanism for outofdistribution datasets. As a baseline, we provide results for a random selector and the oracle selector; we give all results in Table V. All results show the performance of using a portfolio and then running BO and building ensembles for the remaining time.
An important metric in the field of algorithm selection is how much of the gap between the single best and the oracle performance one can close. We see that indeed the selector described in Section 7 is able to close most of this gap, demonstrating that there is value in using three simple metafeatures to decide on the model selection strategy.
To study how much resources we need to spend on generating training data for our selector, we consider three approaches: (P) only using the portfolio performance, (P+BO) actually running the portfolio and BO for and minutes, respectively, and (P+BO+E) additionally also constructing ensembles, which yield the most correct metadata. Running BO on all 209 datasets (P+BO) is by far more expensive than the table lookups (P); building an ensemble (P+BO+E) adds only several seconds to minutes on top compared to (P+BO).
For both optimization budgets using P+BO+E yields the best results using the selector closely followed by P+BO, see Table V. The cheapest method, P, yields the worst results showing that it is worth to invest resources into computing good metadata. Looking at the single best, surprisingly, performance gets worse when using seemingly better metadata. This is due to a few selection strategies failing on large datasets: When only looking at portfolios, the robust holdout strategy is selected as the single best model selection strategy. However, when also considering BO, there is a greater risk to overfit and thus a crossvalidation variant performs best on average on the metadatasets; unfortunately this variant fails on some test datasets due to violating resource limitations. For such cases our fallback mechanism is quite important.
10 Min  60 Min  
oracle  
random  
trained on  P  P+BO  P+BO+E  P  P+BO  P+BO+E 
single best  
selector  2.27  1.88  
selector w/o fallback 
Finally, we also take a closer look at the impact of the fallback mechanism to verify that our improvements are not solely due to this component. We observe that the performance drops for five out of six of the selectors when we not include this fallback mechanism, but that the selector still outperforms the single best. The rather stark performance degradation compared to the regular selector can mostly be explained by a few, huge dataset. Based on these observations we suggest research into an adaptive fallback strategy which can change the model selection strategy during the execution of the AutoML system so that a selector can be used on outofdistribution datasets. We conclude that using a selector is very beneficial, and using a fallback strategy to cope with outofdistribution datasets can substantially improve performance.
8.3 Autosklearn (1.0) vs. Autosklearn (2.0)
10MIN  60MIN  

std  std  
(1)  Autosklearn (2.0)  2.27  0.16  1.88  0.12 
(2)  Autosklearn (1.0)  11.76  0.09  8.59  0.13 
(3)  Autosklearn (1.0), no KND  7.68  0.72  3.31  0.34 
(4)  Autosklearn (1.0), RS  7.56  1.77  3.79  0.86 
(5)  Autosklearn (1.0), only iterative  11.82  0.10  7.29  0.14 
(6)  Autosklearn (1.0), no KND, only iterative  7.89  0.71  4.04  1.00 
(7)  Autosklearn (1.0), RS, only iterative  8.06  0.93  3.81  0.65 
In this section, we demonstrate the superior performance of Autosklearn (2.0) against the previous version, Autosklearn (1.0). In Table VI and Figure 6 we compare the performance of Autosklearn (2.0) to six different setups of Autosklearn (1.0), including the full and the reduced search space, using only BO without KND, and using random search.
Looking at the first two rows in Table VI, we see that Autosklearn (2.0) achieves the lowest error for both optimization budgets, being significantly better for the minute setting. Most notably, Autosklearn (2.0) reduces the relative error by (10m) and , respectively, which means a reduction by a factor of five. The large difference between Autosklearn (1.0) and Autosklearn (2.0) is mainly a result of Autosklearn (2.0) being very robust by avoiding ML pipelines that cannot be trained within the given time limit while the KND approach in Autosklearn (1.0)
does not avoid this failure mode. Thus, using only BO (3 and 6) or random search (4 and 7) results in better performance. It turns out that these results are skewed by three large datasets (task IDs
189873, 189874, 75193) on which the KND initialization of Autosklearn (1.0) only suggests ML pipelines that time out or hit the memory limit and thus exhaust the optimization budget for the full search space. Not taking these three datasets into account, Autosklearn (1.0) using BO and metalearning improves over the versions without metalearning. Our new AutoML system does not suffer from this problem as it a) selects SH to avoid spending too much time on unpromising ML pipelines and b) can return predictions and results even if a ML pipeline was not evaluated for the full budget or converged early.Figure 6 provides another view on the results, presenting average ranks (where failures obtain less weight compared to the averaged performance). Thus, under this view, Autosklearn (1.0) using KND and BO achieves a much better rank than the methods without metalearning. We emphasize that despite the quite different relative performances of the Autosklearn (1.0) variants under this different evaluation setup, Autosklearn (2.0) still clearly yields the best results.
9 Discussion and Conclusion
Autosklearn (2.0) constitutes the next generation of our AutoML system Autosklearn, aiming to provide a truly handsfree system which, given a new task and resource limitations, automatically chooses the best setup. We proposed three improvements for faster and more efficient AutoML: (i) we show the necessity of different model selection strategies to work well on various datasets, (ii) to get strong results quickly we propose to use portfolios, which can be built offline and thus reduce startup costs and (iii) to close the design space we opened up for AutoML we propose to automatically select the best configuration of our system.
We conducted a largescale study based on metadatasets and datasets for testing and obtained substantially improved performance compared to Autosklearn (1.0), reducing the regret by up to a factor of five and achieving a lower loss after minutes than Autosklearn (1.0) after minutes. Our ablation study showed that using a selector to choose the model selection strategy has the greatest impact on performance and allows Autosklearn (2.0) to run robustly on new, unseen datasets. In future work, we would like to compare against other AutoML systems, and the AutoML benchmark from Gijsbers et al. [49], from which we already obtained the datasets, would be a natural candidate.^{5}^{5}5However, running the AutoML benchmark from [49] in a fair way is not trivial, since it only offers a parallel setting. We focused on other research questions first and have not yet engineered our system to exploit parallel resources. Nevertheless, Autosklearn (1.0) already performed well on that benchmark and we therefore expect the improvements of Autosklearn (2.0) to transfer to this setting as well.
Our system also introduces some shortcomings since it is constructed towards a single optimization budget, a single metric and a single search space. Although all of these, along with the metatraining datasets, could be provided by a user to automatically build a customized version of Autosklearn (2.0), it would be interesting to see whether we can learn how to transfer a specific AutoML system to different optimization budgets and metrics. Also, there still remain several handpicked hyperparameters on the level of the AutoML system, which we would like to automate in future work, too, for example by automatically learning the portfolio size, learning more hyperhyperparameters of the different model selection strategies (for example of SH) and learning which parts of the search space to use. Finally, building the training data is currently quite expensive. Even though this has to be done only once, it will be interesting to see whether we can take shortcuts here, for example by using a joint ranking model [55].
Acknowledgments
The authors acknowledge support by the state of BadenWürttemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 39/9631 FUGG (bwForCluster NEMO). This work has partly been supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant no. 716721. This work was supported by the German Research Foundation (DFG) under Emmy Noether grant HU 1900/21. Robert Bosch GmbH is acknowledged for financial support. Katharina Eggensperger acknowledges funding by the State Graduate Funding Program of BadenWürttemberg.
References
 [1] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., Automatic Machine Learning: Methods, Systems, Challenges, ser. SSCML. Springer, 2019.
 [2] C. Thornton, F. Hutter, H. Hoos, and K. LeytonBrown, “AutoWEKA: combined selection and hyperparameter optimization of classification algorithms,” in Proc. of KDD’13, 2013, pp. 847–855.
 [3] B. Komer, J. Bergstra, and C. Eliasmith, “Hyperoptsklearn: Automatic hyperparameter configuration for scikitlearn,” in ICML Workshop on AutoML, 2014.
 [4] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in Proc. of NeurIPS’15, 2015, pp. 2962–2970.

[5]
R. Olson, N. Bartley, R. Urbanowicz, and J. Moore, “Evaluation of a Treebased Pipeline Optimization Tool for Automating Data Science,” in
Proc. of GECCO’16, 2016, pp. 485–492.  [6] H. Jin, Q. Song, and X. Hu, “AutoKeras: An efficient neural architecture search system,” in Proc. of KDD’19, 2019, pp. 1946–1956.
 [7] I. Guyon, L. SunHosoya, M. Boullé, H. Escalante, S. Escalera, Z. Liu, D. Jajetic, B. Ray, M. Saeed, M. Sebag, A. Statnikov, W. Tu, and E. Viegas, “Analysis of the AutoML Challenge Series 20152018,” in Automatic Machine Learning: Methods, Systems, Challenges, ser. SSCML, F. Hutter, L. Kotthoff, and J. Vanschoren, Eds. Springer, 2019, ch. 10, pp. 177–219.
 [8] M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter, “Practical automated machine learning for the automl challenge 2018,” in ICML 2018 AutoML Workshop, 2018.
 [9] I. Guyon, A. Saffari, G. Dror, and G. Cawley, “Model selection: Beyond the Bayesian/Frequentist divide,” JMLR, vol. 11, pp. 61–87, 2010.
 [10] H. Mendoza, A. Klein, M. Feurer, J. Springenberg, M. Urban, M. Burkart, M. Dippel, M. Lindauer, and F. Hutter, “Towards automaticallytuned deep neural networks,” in Automatic Machine Learning: Methods, Systems, Challenges, ser. SSCML, F. Hutter, L. Kotthoff, and J. Vanschoren, Eds. Springer, 2019, pp. 135–149.
 [11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” JMLR, vol. 12, pp. 2825–2830, 2011.
 [12] B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. de Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
 [13] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.
 [14] L. Breimann, “Random forests,” MLJ, vol. 45, pp. 5–32, 2001.
 [15] F. Hutter, H. Hoos, and K. LeytonBrown, “Sequential modelbased optimization for general algorithm configuration,” in Proc. of LION’11, 2011, pp. 507–523.
 [16] M. Reif, F. Shafait, and A. Dengel, “Metalearning for evolutionary parameter optimization of classifiers,” Machine Learning, vol. 87, pp. 357–380, 2012.
 [17] M. Feurer, J. Springenberg, and F. Hutter, “Initializing Bayesian hyperparameter optimization via metalearning,” in Proc. of AAAI’15, 2015, pp. 1128–1135.
 [18] J. Vanschoren, “Metalearning,” in Automatic Machine Learning: Methods, Systems, Challenges, ser. SSCML, F. Hutter, L. Kotthoff, and J. Vanschoren, Eds. Springer, 2019, pp. 35–61.
 [19] H. Escalante, M. Montes, and E. Sucar, “Particle Swarm Model Selection,” JMLR, vol. 10, pp. 405–440, 2009.
 [20] R. Caruana, A. NiculescuMizil, G. Crew, and A. Ksikes, “Ensemble selection from libraries of models,” in Proc. of ICML’04, 2004.

[21]
C. M. Bishop,
Neural Networks for Pattern Recognition
. Oxford University Press, Inc., 1995.  [22] S. Raschka, “Model evaluation, model selection, and algorithm selection in machine learning,” arXiv:1811.12808 [stat.ML] , 2018.

[23]
R. J. Henery, “Methods for comparison,” in
Machine Learning, Neural and Statistical Classification
. Ellis Horwood, 1994, ch. 7, pp. 107 – 124.  [24] R. Kohavi and G. John, “Automatic Parameter Selection by Minimizing Estimated Error,” in Proc. of ICML’95, 1995, pp. 304–312.
 [25] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. SpringerVerlag, 2001.

[26]
B. Bischl, O. Mersmann, H. Trautmann, and C. Weihs, “Resampling methods for metamodel validation with recommendations for evolutionary computation,”
Evolutionary Computation, vol. 20, no. 2, pp. 249–275, 2012.  [27] K. Jamieson and A. Talwalkar, “Nonstochastic best arm identification and hyperparameter optimization,” in Proc. of AISTATS’16, 2016.
 [28] Z. Karnin, T. Koren, and O. Somekh, “Almost optimal exploration in multiarmed bandits,” in Proc. of ICML’13, 2013, pp. 1238–1246.
 [29] S. Falkner, A. Klein, and F. Hutter, “BOHB: Robust and Efficient Hyperparameter Optimization at Scale,” in Proc. of ICML’18, 2018, pp. 1437–1446.
 [30] B. Huberman, R. Lukose, and T. Hogg, “An economic approach to hard computational problems,” Science, vol. 275, pp. 51–54, 1997.
 [31] C. Gomes and B. Selman, “Algorithm portfolios,” AIJ, vol. 126, no. 12, pp. 43–62, 2001.
 [32] P. Brazdil and C. Soares, “A comparison of ranking methods for classification algorithm selection,” in Proc. of ECML’00, 2000, pp. 63–74.
 [33] C. Soares and P. Brazdil, “Zoomed ranking: Selection of classification algorithms based on relevant performance information,” in Proc. of PKDD’00, 2000, pp. 126–135.
 [34] P. Brazdil, C. Soares, and R. Pereira, “Reducing rankings of classifiers by eliminating redundant classifiers,” in Proc. of EPAI’01, 2001, pp. 14–21.
 [35] M. Wistuba, N. Schilling, and L. SchmidtThieme, “Sequential ModelFree Hyperparameter Tuning,” in Proc. of ICDM ’15, 2015, pp. 1033–1038.
 [36] ——, “Learning hyperparameter optimization initializations,” in Proc. of DSAA’15, 2015, pp. 1–10.
 [37] L. Xu, H. Hoos, and K. LeytonBrown, “Hydra: Automatically configuring algorithms for portfoliobased selection,” in Proc. of AAAI’10, 2010, pp. 210–216.
 [38] M. Lindauer, H. Hoos, K. LeytonBrown, and T. Schaub, “Automatic construction of parallel portfolios via algorithm configuration,” AIJ, vol. 244, pp. 272–290, 2017.
 [39] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag, “Collaborative hyperparameter tuning,” in Proc. of ICML’13, 2013, pp. 199–207.
 [40] M. Wistuba, N. Schilling, and L. SchmidtThieme, “Scalable Gaussian processbased transfer surrogates for hyperparameter optimization,” Machine Learning, vol. 107, no. 1, pp. 43–78, 2018.
 [41] A. Krause, J. Leskovec, C. Guestrin, J. VanBriesen, and C. Faloutsos, “Efficient sensor placement optimization for securing large water distribution networks,” JWRPM, vol. 134, pp. 516–526, 2008.
 [42] A. Krause and D. Golovin, “Submodular function maximization,” in Tractability: Practical Approaches to Hard Problems, L. Bordeaux, Y. Hamadi, and P. Kohli, Eds. Cambridge University Press, 2014, pp. 71–104.
 [43] G. Nemhauser, L. Wolsey, and M. Fisher, “An analysis of approximations for maximizing submodular set functions,” Mathematical Programming, vol. 14, no. 1, pp. 265–294, 1978.
 [44] P. Brazdil, C. GiraudCarrier, C. Soares, and R. Vilalta, Metalearning: Applications to Data Mining, 1st ed. Springer Publishing Company, Incorporated, 2008.
 [45] L. Xu, F. Hutter, H. Hoos, and K. LeytonBrown, “HydraMIP: Automated algorithm configuration and selection for mixed integer programming,” in Proc. of RCRA workshop at IJCAI, 2011.
 [46] J. Bergstra and Y. Bengio, “Random search for hyperparameter optimization,” JMLR, vol. 13, pp. 281–305, 2012.
 [47] P. Probst, M. N. Wright, and A.L. Boulesteix, “Hyperparameters and tuning strategies for random forest,” WIREs Data Mining and Knowledge Discovery, vol. 9, no. 3, p. e1301, 2019.
 [48] C. Yang, Y. Akimoto, D. W. Kim, and M. Udell, “OBOE: Collaborative filtering for AutoML initialization,” in Proc. of KDD’19, 2019.

[49]
P. Gijsbers, E. LeDell, S. Poirier, J. Thomas, B. Bischl, and J. Vanschoren, “An open source automl benchmark,” in
ICML 2019 AutoML Workshop, 2019.  [50] B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. Mantovani, J. van Rijn, and J. Vanschoren, “Openml benchmarking suites,” arXiv, vol. 1708.0373v2, pp. 1–6, Sep. 2019.
 [51] J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo, “OpenML: Networked science in machine learning,” SIGKDD, vol. 15, no. 2, pp. 49–60, 2014.
 [52] M. Feurer, J. van Rijn, A. Kadra, P. Gijsbers, N. Mallik, S. Ravi, A. Müller, J. Vanschoren, and F. Hutter, “Openmlpython: an extensible python api for openml,” arXiv:1911.02490 [cs.LG], 2019.
 [53] M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp, and F. Hutter, “Smac v3: Algorithm configuration in python,” https://github.com/automl/SMAC3, 2017.
 [54] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” JMLR, vol. 7, pp. 1–30, 2006.
 [55] A. Tornede, M. Wever, and E. Hüllermeier, “Extreme algorithm selection with dyadic feature representation,” arXiv:2001.10741 [cs.LG], 2020.
 [56] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K. Ho, N. Macià, B. Ray, M. Saeed, A. Statnikov, and E. Viegas, “Design of the 2015 chalearn automl challenge,” in Proc. of IJCNN’15. IEEE, 2015, pp. 1–8.
 [57] A. Kalousis and M. Hilario, “Representational Issues in MetaLearning,” in Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 313–320.
 [58] R. Kohavi, “A study of crossvalidation and bootstrap for accuracy estimation and model selection,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence  Volume 2, ser. IJCAI’95, 1995, pp. 1137–1143.
 [59] F. Mohr, M. Wever, and E. Hüllermeier, “MLPlan: Automated machine learning via hierarchical planning,” Machine Learning, vol. 107, no. 810, pp. 1495–1515, 2018.
 [60] O. Maron and A. Moore, “The racing algorithm: Model selection for lazy learners,” Artificial Intelligence Review, vol. 11, no. 15, pp. 193–225, 1997.
 [61] A. Zheng and M. Bilenko, “Lazy Paired HyperParameter Tuning,” in Proceedings of the 23rd International Joint Conference on Artificial Intelligence, F. Rossi, Ed., 2013, pp. 1924–1931.
 [62] T. Krueger, D. Panknin, and M. Braun, “Fast crossvalidation via sequential testing,” JMLR, 2015.
 [63] A. Anderson, S. Dubois, A. Cuestainfante, and K. Veeramachaneni, “Sample, Estimate, Tune: Scaling Bayesian AutoTuning of Data Science Pipelines,” in 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2017, pp. 361–372.
 [64] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, “Hyperband: A novel banditbased approach to hyperparameter optimization,” JMLR, vol. 18, no. 185, pp. 1–52, 2018.
 [65] I. Tsamardinos, E. Greasidou, and G. Borboudakis, “Bootstrapping the outofsample predictions for efficient and accurate crossvalidation,” Machine Learning, vol. 107, no. 12, pp. 1895–1922, 2018.
 [66] K. SmithMiles, “Crossdisciplinary perspectives on metalearning for algorithm selection,” ACM, vol. 41, no. 1, 2008.
 [67] L. Kotthoff, “Algorithm selection for combinatorial search problems: A survey,” AI Magazine, vol. 35, no. 3, pp. 48–60, 2014.
 [68] P. Kerschke, H. Hoos, F. Neumann, and H. Trautmann, “Automated algorithm selection: Survey and perspectives,” Evolutionary Computation, vol. 27, no. 1, pp. 3–45, 2019.
 [69] F. Pfisterer, J. van Rijn, P. Probst, A. Müller, and B. Bischl, “Learning multiple defaults for machine learning algorithms,” arXiv:1811.09409 [stat.ML] , 2018.
 [70] J. Seipp, S. Sievers, M. Helmert, and F. Hutter, “Automatic configuration of sequential planning portfolios,” in Proc. of AAAI’15, 2015.
 [71] R. Leite, P. Brazdil, and J. Vanschoren, “Selecting classification algorithms with active testing,” in Proc. of MLDM, 2013, pp. 117–131.
 [72] N. Fusi, R. Sheth, and M. Elibol, “Probabilistic matrix factorization for automated machine learning,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 3348–3357.
 [73] A. Klein, S. Falkner, J. Springenberg, and F. Hutter, “Learning curve prediction with Bayesian neural networks,” in Proc. of ICLR’17, 2017.
 [74] M. Lindauer, H. Hoos, F. Hutter, and T. Schaub, “Autofolio: An automatically configured algorithm selector,” JAIR, vol. 53, pp. 745–778, 2015.
 [75] F. Hutter, H. Hoos, K. LeytonBrown, and T. Stützle, “ParamILS: An automatic algorithm configuration framework,” JAIR, vol. 36, pp. 267–306, 2009.
 [76] R. Caruana, A. Munson, and A. NiculescuMizil, “Getting the most out of ensemble selection,” in Proc. of ICDM’06, 2006, pp. 828–833.
 [77] A. NiculescuMizil, C. Perlich, G. Swirszcz, V. Sindhwani, Y. Liu, P. Melville, D. Wang, J. Xiao, J. Hu, M. Singh, W. Shang, and Y. Zhu, “Winning the kdd cup orange challenge with ensemble selection,” in Proceedings of KDDCup 2009 Competition, G. Dror, M. Boullé, I. Guyon, V. Lemaire, and D. Vogel, Eds., vol. 7, 2009, pp. 23–34.
 [78] R. Olson and J. Moore, TPOT: A TreeBased Pipeline Optimization Tool for Automating Machine Learning, ser. SSCML. Springer, 2019, pp. 151–160.
 [79] M. Wistuba, N. Schilling, and L. SchmidtThieme, “Automatic Frankensteining: Creating Complex Ensembles Autonomously,” in Proceedings of the 2017 SIAM International Conference on Data Mining, 2017, pp. 741–749.
 [80] B. Chen, H. Wu, W. Mo, I. Chattopadhyay, and H. Lipson, “Autostacker: A Compositional Evolutionary Learning System,” in Proc. of GECCO’18, 2018, pp. 402–409.
 [81] Y. Zhang, M. Bahadori, H. Su, and J. Sun, “FLASH: Fast Bayesian Optimization for Data Analytic Pipelines,” in Proc. of KDD’16, 2016, pp. 2065–2074.
 [82] H. Rakotoarison, M. Schoenauer, and M. Sebag, “Automated machine learning with MonteCarlo tree search,” in Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19. International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 3296–3303.
 [83] A. Alaa and M. van der Schaar, “AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning,” in Proc. of ICML’18, 2018, pp. 139–148.
 [84] L. Parmentier, O. Nicol, L. Jourdan, and M. Kessaci, “Tpotsh: A faster optimization algorithm to solve the automl problem on large datasets,” in 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), 2019, pp. 471–478.
 [85] H. Mendoza, A. Klein, M. Feurer, J. Springenberg, and F. Hutter, “Towards automaticallytuned neural networks,” in ICML 2016 AutoML Workshop, 2016.
 [86] J. Krarup and P. Pruzan, “The simple plant location problem: Survey and synthesis,” European Journal of Operations Research, vol. 12, pp. 36–81, 1983.
 [87] S. van der Walt, S. C. Colbert, and G. Varoquaux, “The numpy array: A structure for efficient numerical computation,” Computing in Science Engineering, vol. 13, no. 2, pp. 22–30, 2011.
 [88] P. Virtanen, R. Gommers, T. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. Nelson, E. Jones, R. Kern, E. Larson, C. Carey, İ. Polat, Y. Feng, E. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. Harris, A. Archibald, A. Ribeiro, F. Pedregosa, P. van Mulbregt, and S. . . Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020.
 [89] Wes McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman, Eds., 2010, pp. 56 – 61.
 [90] J. Hunter, “Matplotlib: A 2d graphics environment,” Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.
 [91] Proc. of KDD’19, 2019.
 [92] Proc. of ICML’18, 2018.
 [93] Proc. of ICML’13, 2013.
 [94] Proc. of AAAI’15, 2015.
Appendix A Related work
In this section we provide additional related work on the improvements we presented in the main paper. First, we will provide further literature on model selection strategies. Second, we give more details on existing work on portfolios of ML pipelines. Third, we give pointers to literature overviews of algorithm selection before discussing other existing AutoML systems in detail.
a.1 Related work on model selection strategies
Automatically choosing a model selection strategy to assess the performance of an ML pipeline for hyperparameter optimization has not previously been tackled, and only Guyon et al. [56] acknowledge the lack of such an approach. The influence of the model selection strategy on the validation and test performance is well known [57] and researchers have studied their impact [58]. Following up on this, the OpenML platform [51] stores which validation strategies were used for an experiment, but so far no work has operationalized this information. Recently, Mohr et al. [59] noted that the choice of the model selection strategy has an effect on the final test performance of an AutoML system but only made a general recommendation, too.
Our method, in general, is not specific to holdout, crossvalidation or successive halving and could generalize to any other method assessing the performance of a model [26, 9, 22] or allocating resources to the evaluation of a model [60, 15, 61, 62, 63, 64, 65]. While these are important areas of research, we focus here on the most commonly used methods and leave studying these extensions for future work.
a.2 Related work on Portfolios
Using portfolios has a long history [30, 31] for leveraging the complementary strengths of algorithms (or hyperparameter settings) and had applications in different subfields of AI [66, 67, 68].
Algorithm portfolios were introduced to machine learning by the name of algorithm ranking with the goal of reducing the required time to perform model selection compared to running all algorithms under consideration [32, 33], ignoring redundant ones [34]. ML portfolios can be superior to hyperparameter optimization with Bayesian optimization [35], Bayesian optimization with a model which takes past performance data into account [36] or can be applied when there is simply no time to perform full hyperparameter optimization [8]. Furthermore, such a portfoliobased modelfree optimization is both easier to implement than regular Bayesian optimization and metafeature based solutions, and the portfolio can be shared easily across researchers and practitioners without the necessity of sharing metadata [36, 35, 69] or additional hyperparameter optimization software.
The efficient creation of algorithm portfolios is an active area of research with the Greedy Algorithm being a popular choice [37, 45, 70, 35, 8] due to its simplicity. Wistuba et al. [35] first proposed the use of the Greedy Algorithm for pipelines of machine learning portfolios, minimizing the average rank on metadatasets for a single machine learning algorithm. Later, they extended their work to update the members of a portfolio in a roundrobin fashion, this time using the average normalized misclassification error as a loss function and relying on a Gaussian process model [36]. The loss function of the first method can suffer from irrelevant alternatives, while the second method does not guarantee that wellperforming algorithms are executed early on, which could be harmful under time constraints. In work parallel to our submission to the second AutoML challenge, Pfisterer et al. [69] also suggested using a set of default values to simplify hyperparameter optimization. They argued that constructing an optimal portfolio of hyperparameter settings is a generalization of the Maximum coverage problem and propose two solutions based on Mixed Integer Programming and the Greedy Algorithm which we also use as the base of our algorithm. The main difference of our work is that we demonstrate the usefulness of portfolios for highdimensional configuration spaces of AutoML systems under strict time limits and that we give concrete worstcase performance guarantees.
Extending these portfolio strategies which are learned offline, there are online portfolios which can select from a fixed set of machine learning pipelines, taking previous evaluations into account [71, 35, 35, 72, 48]. However, such methods cannot be directly combined with all resampling strategies as they require the definition of a special model for extrapolating learning curves [73, 29] and also introduce additional complexity into AutoML systems.
a.3 Related Work on Algorithm Selection
Treating the choice of model selection strategy as an algorithm selection problem allows us to apply methods from the field of algorithm selection [66, 67, 68] and we can in future work reuse existing techniques besides pairwise classification [45]. An especially promising candidate is AutoFolio [74], an AutoAI system which automatically constructs a selector for a given algorithm selection problem using algorithm configuration [75].
a.4 Related Work on AutoML systems
AutoML systems have recently gained traction in the research community and there exist a multitude of approaches with many of them being either available as supplementary material or open source software.
To the best of our knowledge, the first AutoML system which tunes both hyperparameters and chooses algorithms was an ensemble method [20]. The system randomly produces 2000 classifiers from a wide range of ML algorithms and constructs a posthoc ensemble. It was later robustified [76] and employed in a winning KDD challenge [77].
The first AutoML system to jointly optimize the whole pipeline is Particle Swarm Model Selection. Later systems started employing modelbased global optimization algorithms, such as AutoWEKA [2] and Hyperoptsklearn [3]. We extended this approach using metalearning and including ensembles in Autosklearn [4].
Relieving the limitation of a fixed search space, the treebased pipeline optimization tool (TPOT [78]) uses a pipeline grammar and grammatical evolution to construct ML pipelines of arbitrary length.
Instead of a single layer of ML algorithms followed by an ensembling mechanism, Wistuba et al. [79] proposed twolayer stacking, applying AutoML to the outputs of an AutoML system. AutoStacker went one step further, directly optimizing for a twolayer AutoML system [80].
Another strain of work on AutoML systems aims at more efficient optimization. FLASH [81] proposed a pruning mechanism to reduce the pipeline space to search through, MOSAIC [82] uses MonteCarlo Tree search to efficiently search the treestructured space of ML pipelines and MLPLAN uses hierarchical tasknetworks and a randomized depthfirst search [59]. AutoPrognosis [83] splits the optimization problem of the full pipeline into smaller optimization problems which can then be tackled by Gaussian processbased BO. TPOTSH [84], inspired by our submission to the second AutoML challenge, uses successive halving to speed up TPOT on large datasets.
Finally, while the AutoML tools discussed so far focus on ”traditional” machine learning, there is also work on creating AutoML systems that can leverage recent advancements in deep learning. AutoNet extended the AutoWEKA approach to deep neural networks
[85] and AutoKeras employs Neural Architecture Search to find wellperforming neural networks [6].Of course, there are also many techniques related to AutoML which are not used in one of the AutoML systems discussed in this section and we refer to Hutter et al. [1] for an overview of the field of Automated Machine Learning and Brazdil et al. [44] for an overview on metalearning research which predates the work on AutoML.
Appendix B Details on Greedy Portfolio Construction
b.1 Holdout as a Model Selection Strategy
In the main paper we have only defined , but not how it practically works. For holdout, is defined as:
(10) 
, while for crossvalidation we can plug in the definition for from Section 2 of the main paper. Successive Halving is a bit more involved an cannot be written in a single equation, but would require pseudocode.
b.2 Theoretical properties of the greedy algorithm
b.2.1 Definitions
Definition 1
(Discrete derivative, from Krause & Golovin [42]) For a set function and let be the discrete derivative of at with respect to .
Definition 2
(Submodularity, from Krause & Golovin [42]): A function is submodular if for every and it holds that .
Definition 3
(Monotonicity, from Krause & Golovin [42]): A function is monotone if for every .
b.2.2 Choosing on the test set
In this section we give a proof of Proposition 1 from the main paper:
Proposition 2
Minimizing the test loss of a portfolio on a set of datasets , when choosing a ML pipeline from for based on performance on , is equivalent to the sensor placement problem for minimizing detection time [41].
Following Krause et al. [41], sensor set placement aims at maximizing a socalled penalty reduction , where
are intrusion scenarios following a probability distribution
with being a specific intrusion. is a sensor placement, a subset of all possible locations where sensors are actually placed. Penalty reduction is defined as the reduction of the penalty when choosing compared to the maximum penalty possible on scenario : . In the simplest case where action is taken upon intrusion detection, the penalty is equal to the detection time (). The detection time of a sensor placement is simply defined as the minimum of the detection times of its individual members: .In our setting, we need to do the following replacements to find that the problems are equivalent:

Intrusion scenarios : datasets ,

Possible sensor locations : set of candidate ML pipelines of our algorithm , Detection time on intrusion scenario : test performance on dataset ,

Detection time of a sensor placement : test loss of applying portfolio on dataset :

Penalty function : loss function , in our case, the penalty is equal to the loss.

Penalty reduction for an intrusion scenario : the penalty reduction for successfully applying a portfolio to dataset : .
b.2.3 Choosing on the validation set
We demonstrate that choosing an ML pipeline from the portfolio via holdout (i.e. a validation set) and reporting its test performance is neither submodular nor monotone by a simple example. To simplify notation we argue in terms of performance instead of penalty reduction, which is equivalent.
b.2.4 Successive Halving
As in the previous subsection, we use a simple example to demonstrate that selecting an algorithm via the successive halving model selection strategy is neither submodular nor monotone. To simplify notation we argue in terms of performance instead of penalty reduction, which is equivalent.
Let and , where each tuple is a learning curve of validation, test performance tuples. For , we eliminate entries 2 and 3 from in the first iteration of successive halving (while we advance entries 1 and 4), and we eliminate entry 1 from . After the second stage, the performances are and , and the discrete derivatives and which violates Definition 2. The fact that the discrete derivative is negative violates Definition 3 because .
b.2.5 Further equalities
In addition, our problem can also be phrased as a facility location problem [86] and statements about the facility location problem can be applied to our problem setup as well.
Appendix C Implementation Details
c.1 Software
c.2 Configuration Space
We give the configuration space we use in Autosklearn (2.0) in Table VII.
Name  Domain  Default  Log 
Classifier  (Extra Trees, Gradient Boosting, Passive  Random Forest   
Aggressive, Random Forest, Linear Model (SGD)  
Extra Trees: Bootstrap  (True, False)  False   
Extra Trees: Criterion  (Gini, Entropy)  Gini   
Extra Trees: Max Features  0.5  No  
Extra Trees: Min Samples Leaf  1  No  
Extra Trees: Min Samples Split  2  No  
Gradient Boosting: Early Stopping  (Off, Train, Valid)  Off   
Gradient Boosting: Regularization  1e10  Yes  
Gradient Boosting: Learning Rate  0.1  Yes  
Gradient Boosting: Max Leaf Nodes  31  Yes  
Gradient Boosting: Min Samples Leaf  20  Yes  
Gradient Boosting: #Iter No Change  10  No  
Gradient Boosting: Validation Fraction  0.1  No  
Passive Aggressive: C  1  Yes  
Passive Aggressive: Average  (False, True)  False   
Passive Aggressive: Loss  (Hinge, Squared Hinge)  Hinge   
Passive Aggressive: Tolerance  0.0001  Yes  
Random Forest: Bootstrap  (True, False)  True   
Random Forest: Criterion  (Gini, Entropy)  Gini   
Random Forest: Max Features  0.5  No  
Random Forest: Min Samples Leaf  1  No  
Random Forest: Min Samples Split  2  No  
Sgd:  0.0001  Yes  
Sgd: Average  (False, True)  False   
Sgd:  0.0001  Yes  
Sgd:  0.01  Yes  
Sgd: Ratio  0.15  Yes  
Sgd: Learning Rate  (Optimal, Invscaling, Constant)  Invscaling   
Sgd: Loss  (Hinge, Log, Modified Huber, Squared Hinge, Perceptron) 
Log   
Sgd: Penalty  (, , Elastic Net)  l2   
Sgd: Power t  0.5  No  
Sgd: Tolerance  0.0001  Yes  
Balancing  (None, Weighting)  None   
Categorical Encoding: Choice  (None, One Hot Encoding)  One Hot Encoding   
Category Coalescence: Choice  (Minority Coalescer, No Coalescense)  Minority Coalescer   
Minority Coalescer: Minimum percentage samples  0.01  Yes  
Imputation (numerical only)  (Mean, Median, Most Frequent)  Mean   
Rescaling (numerical only)  (Min/Max, None, Normalize, Quantile, Standardize, Robust) 
Standardize   
Quantile Transformer: N Quantiles  1000  No  
Quantile Transformer: Output Distribution  (Uniform, Normal)  Uniform   
Robust Scaler: Q Max  0.75  No  
Robust Scaler: Q Min  0.25  No 
c.3 Successive Halving hyperparameters
We used the same hyperparameters for all experiments. First, we set to . Next, we had to choose the minimal and maximal budgets assigned to each algorithm. For the treebased methods we chose to go from to , while for the linear models (SGD and passive aggressive) we chose as the minimal budget and as the maximal budget. Further tuning these hyperparameters would be an interesting, but an expensive way forward.
Appendix D Datasets
We give the name, OpenML dataset ID, OpenML task ID and the size of all datasets we used in Table VIII.