Meta-Surrogate Benchmarking for Hyperparameter Optimization

05/30/2019 ∙ by Aaron Klein, et al. ∙ Amazon University of Freiburg 0

Despite the recent progress in hyperparameter optimization (HPO), available benchmarks that resemble real-world scenarios consist of a few and very large problem instances that are expensive to solve. This blocks researchers and practitioners not only from systematically running large-scale comparisons that are needed to draw statistically significant results but also from reproducing experiments that were conducted before. This work proposes a method to alleviate these issues by means of a meta-surrogate model for HPO tasks trained on off-line generated data. The model combines a probabilistic encoder with a multi-task model such that it can generate inexpensive and realistic tasks of the class of problems of interest. We demonstrate that benchmarking HPO methods on samples of the generative model allows us to draw more coherent and statistically significant conclusions that can be reached orders of magnitude faster than using the original tasks. We provide evidence of our findings for various HPO methods on a wide class of problems.



There are no comments yet.


page 7

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated Machine Learning (AutoML)

(Hutter et al., 2018) is an emerging field that studies the progressive automation of machine learning. A core part of an AutoML system is the hyperparameter optimization (HPO) of a machine learning algorithm. It has already shown promising results by outperforming human experts in finding better hyperparameters (Snoek et al., 2012), an thereby, for example, substantially improved AlphaGo (Chen et al., 2018).

Despite recent progress (see e. g. the review by Feurer and Hutter (2018)), during the phases of developing and evaluating new HPO methods one frequently faces the following problems:

  • Evaluating the objective function is often expensive in terms of wall-clock time; e.g., the evaluation of a single hyperparameter configuration may take several hours or days. This renders extensive HPO or repeated runs of HPO methods computationally infeasible.

  • Even though repositories of datasets such as OpenML (Vanschoren et al., 2014) provide thousands of datasets, a large fraction cannot meaningfully be used for HPO since they are too small or too easy (in the sense that even simple methods achieve top performance). Hence, useful available datasets are scarce, making it hard to produce a comprehensive evaluation of how well a HPO method will generalize across tasks.

Figure 1: Common pitfalls in the evaluation of HPO methods: we compare two different HPO

methods for optimizing the hyperparameters of XGBoost on three UCI regression datasets (see Appendix B for more datasets). The small number of tasks makes it hard to draw any conclusions, since the ranking between the methods varies between the tasks. Furthermore, a full run might take several hours which makes it prohibitively expensive to average across a large number of runs.

Due to these two problems researchers can only carry out a limit number of comparisons within a reasonable computational budget. This delays the progress of the field as statistically significant conclusions about the performance of different HPO methods may not be possible to draw. See Figure 1 for an illustrative experiment of the HPO of XGBoost (Chen and Guestrin, 2016). It is well known that Bayesian optimization with Gaussian processes (BO-GP) (Shahriari et al., 2016) outperforms naive random search (RS) in terms of number of function evaluations on most HPO problems. While we show clear evidence for this in Appendix B on a larger set of datasets, this conclusion cannot be reached when optimizing on the three "unlucky" picked datasets in Figure 1. Surprisingly, the community has not paid much attention to this issue of proper benchmarking, which is a key step required to generate new scientific knowledge but also to foster reproducibility.

Figure 2: The three blue bars on the left show the total wall-clock time of executing 20 independent runs of GP-BO, RS and Bohamiann (see Section 5) with 100 function evaluation for the HPO

of a feed forward neural network on MNIST. The orange bars show the same for optimizing a tasks sampled from our proposed meta-model, where benchmarking is orders of magnitude cheaper in terms of wall-clock time than the original benchmarks, thereby the computational time is almost exclusively spend for the optimizer overhead (hence the larger bars for GP-BO and Bohamiann compared to RS).

In this work we present a generative meta-model that, conditioned on off-line generated data, allows to sample an unlimited number of new tasks that share properties with the original ones. There are several advantages to this approach. First, the new problem instances are inexpensive to evaluate as they are generated with a parameteric form, which drastically reduces the resources needed to compare HPO methods, bounded only by the optimizer’s computational overhead. See Figure 2 for an example. Second, there is no limit in the number of tasks that can be generated, which helps to draw statistically more reliable conclusions. Third, the shape and properties of the tasks are not predefined but learned using a few real tasks of an HPO problem. While the global properties of the initial tasks are preserved in the samples, the generative model allows the exploration of instances with diverse local properties making comparisons more robust and reliable (see Appendix D for some example tasks).

In light of the recent call for more reproducibility, we are convinced that our meta-surrogate benchmarks enable more reproducible research in AutoML: First of all, these cheap-to-evaluate surrogate benchmarks allows researches to reproduce experiments or perform many repeats of their own experiments without relying on tremendous computational resources. Second, based on our-proposed method, we provide a more thorough benchmarking protocol that reduces the risk of extensively tuning an optimization method on single tasks. Third, surrogate benchmarks in general are less dependent on hardware and technical details, such as complicated training routines or preprocessing strategies.

2 Related Work

The use of meta-models that learn across tasks has been investigated by others before. To warm-start HPO on new tasks from previously optimized tasks, Swersky et al. (2013) extended Bayesian optimization to the multi-task setting by using a Gaussian process that also models the correlation between tasks. Instead of a Gaussian process, Springenberg et al. (2016) used a Bayesian neural network inside multi-task Bayesian optimization which learns an embedding of tasks during optimization. Similarly Perrone et al. (2018)

used Bayesian linear regression, where the basis functions are learned by a neural network, to warm-start the optimization from previous tasks.

Feurer et al. (2015b) used a set of dataset statistics as meta-features to measure the similarity between tasks, such that hyperparameter configurations that were superior on previously optimized similar tasks can be evaluated during the initial design before the actual optimization procedure starts. This technique is also applied inside the auto-sklearn framework (Feurer et al., 2015a). In a similar vein Fusi et al. (2018) proposed to use a probabilistic matrix factorization approach to exploit knowledge gathered on previously seen tasks. van Rijn and Hutter (2018)

evaluated random hyperparameter configurations on a large range of tasks to learn priors for a support vector machine, random forest and Adaboost. The idea of using a latent variable to represent correlation among multiple outputs of Gaussian process has been exploited by 

Dai et al. (2017).

In the context of benchmarking HPO methods, HPOlib (Eggensperger et al., 2013) is a benchmarking library that provides a fixed and rather small set of problems that have been used to compare several Bayesian optimization tools. In earlier work, Eggensperger et al. (2015) also used surrogates to speed up the empirical benchmarking of HPO methods. Similar to our work, these surrogates are trained on data generated in an off-line step. Afterwards, function evaluations require only prediction of the surrogate model instead of actually running the benchmark. However, these surrogates only mimic one particular task and do not allow for generating new tasks as presented in this work. Recently, tabular benchmarks were introduced for neural architecture search (Ying et al., 2019) and hyperparameter optimization (Klein and Hutter, 2019), which first perform an exhaustive search of a discrete benchmark problem to store all results in a database and then replace expensive function evaluations by efficient table lookups. While this does not introduce any bias due to a model (see Section 6 for a more detailed discussion), tabular benchmarks are only applicable for problems with few, discrete hyperparameters. Related to our work, but for benchmarking general blackbox optimization methods, is the COCO platform (Hansen et al., 2016b). However, compared to our approach, it is based on handcrafted synthetic functions that do not resemble real world HPO problems.

3 Benchmarking Hpo methods with generative models

We now describe the generative meta-model to create HPO tasks. First we give a formal definition of benchmarking HPO methods across tasks sampled from a unknown distribution and then describe how we can approximate this distribution by our new proposed meta-model.

3.1 Problem Definition

We denote to be a set of related objectives/tasks with the same input domain . We assume that each for , is an instantiation of an unknown distribution of tasks . Every task has an associated objective function where represents a hyperparameter configuration and we assume that we can observe only through noise: .

Let us denote by the performance of an optimization method on a task ; for instance, a common example for is the regret of the best observed solution (called incumbent). To compare two different methods and , the standard practice is to compare with on a set of hand-picked tasks . However, to draw statistically more significant conclusions, we would ideally like to integrate over all tasks:


Unfortunately, the above integral is intractable as is unknown. The main contribution of this paper is to approximate with a generative meta-model based on some off-line generated data . This enables us to sample an arbitrary amount of tasks in order to perform a Monte-Carlo approximation of Equation 1.

3.2 Meta-Model for Task Generation

In order to reason across tasks, we define a probabilistic encoder that learns a latent representation of a task .

More precisely, we use Bayesian GP-LVM (Titsias and Lawrence, 2010) which assumes that the target values that belong to the task , stacked into a vector follow the generative process:


where is the covariance function of the GP. By assuming that the latent variable has an uninformative prior , the latent embedding of each task is inferred as the posterior distribution . The exact formulation of the posterior distribution is intractable, but following the variational inference presented in Titsias and Lawrence (2010)

, we can estimate a variational posterior distribution

for each task .

Similar to Multi-Task Bayesian Optimization (Swersky et al., 2013; Springenberg et al., 2016), we define a probabilistic model for the objective function across tasks which gets as an additional input a task embedding based on our independently trained probabilistic encoder. Following (Springenberg et al., 2016), we use a Bayesian neural network with weight vectors to model


where is sampled from the posterior of the neural network weights.

By approximating

to be Gaussian, we can compute the predictive mean and variance by

(Springenberg et al., 2016):

where and are the output of a single neural network with parameters 111Note that we model an homoscedastic noise, because of that, does not depend on the input. To get a set of weights , we use stochastic gradient Hamiltonian Monte-Carlo (Chen et al., 2014) to sample from:

with and the number of samples we draw from the latent space .

3.3 Sampling New Tasks

In order to generate a new task , we need the associated objective function in a parameteric from such that we can evaluate it later on any .

Given the meta-model above, we perform the following steps: (i) we sample a new latent task vector ; (ii) given we pick a random from the set of weights of our Bayesian neural network and set the new task to be .

Note that using makes our new task unrealisticly smooth. Instead, we can emulate the typical noise appearing in HPO benchmarks by returning , which can be done at an insignificant cost.

4 Profet

Figure 3: Latent space representations of two different problem classes. Left: Representation of eleven pairs of datasets generated by partitioning eleven datasets from the fully connected networks benchmark detailed in Section 4.1

. Pairs of tasks are represented with the same colour. The mean of the task are represented with different markers. The ellipses represent 4 standard deviations around the mean of the tasks.

Right: Latent space learned for a model where the input tasks are generated by training a support vector machine on subsets of a target dataset (approximated by a random forest surrogate from Klein et al. (2017a)). One can see that our probabilistic encoder learns a meaningful embedding of different tasks.

We now present our probabilistic data-efficient experimentation tool, called Profet, a benchmarking suite for HPO methods. The following section describes first how we collected the data to train our meta-model based on three typical HPO problems classes. We then explain how we generated different tasks for each problem class from our meta-model. As described above, we provide a noisy and noiseless version of each task. Last, we discuss two ways that are commonly used in the literature to assess and aggregate the performance of HPO methods across tasks. To reproduce our experiments as well as benchmarking and developing new HPO methods, an open-source implementation of Profet is available here:

4.1 Data Collection

We consider three different HPO problems, two for classification and one for regression, with varying dimensions . For classification we considered a support vector machine (SVM) with hyperparameters and a feed forward neural network (FC-Net) with hyperparameters on 16 OpenML (Vanschoren et al., 2014)

tasks each. We used gradient boosting (XGBoost)

222We used the implementation from Chen and Guestrin (2016) with hyperparameters for regression on 11 different UCI datasets (Lichman, 2013). For further details about the datasets and the configuration spaces see Appendix A. To make sure that our meta-model learns a descriptive representation we need a solid coverage over the whole input space. For that we drew pseudo randomly generated configurations from a Sobol grid (Sobol, 1967).

Details of our meta-model are described in Appendix F. We show some qualitative examples of our probabilistic encoder in Section 5.1. We can also apply the same machinery to model the cost in terms of computation time for evaluating a hyperparameter configuration to use time rather than function evaluations as budget. This enables future work to benchmark or develop HPO methods that explicitly take the cost into account (e. g. EIperSec by Snoek et al. (2012)).

4.2 Performance Assessment

To assess the performance of a HPO method aggregate over tasks, we consider two different ways commonly used in the literature. First, we measure the runtime that a HPO method needs to find a configuration that achieves a performance that is equal or lower than a certain target value on task (Hansen et al., 2016a). Here we define runtime either in terms of function evaluations or estimated wall-clock time predicted by our meta-model. Using a fixed target approach allows us to make quantitative statements, such as: method A is, on average, twice as fast than method B. See Hansen et al. (2016a) for a more detailed discussion. We average across target values with a different complexity by evaluating the Sobol grid from above on each generated task. We use the corresponding function values as targets, which, with the same argument as described in Section 4.1

, provides a good coverage of the error surface. To aggregate the runtime we use the empirical cumulative distribution function (ECDF) 

(Moré and Wild, 2009), which, intuitively, shows for each budget on the x-axis the fraction of solved tasks and target pairs on the y-axis (see Figure 5 left for an example).

Another common way to compare different HPO methods is to compute the average ranking score in every iteration and for every task (Bardenet et al., 2013). We follow the procedure described by Feurer et al. (2015b) and compute the average ranking score as follows: assuming we run different HPO methods times for each task, we draw a bootstrap sample of 1000 runs out of the possible combinations. For each of these samples, we compute the average fractional ranking (ties are broken by the average of the ordinal ranks) after each iteration. At the end, all the assigned ranks are further averaged over all tasks. Note that averaged ranks are a relative performance measurement and can worsen for one method if another method improves (see Figure 5 right for an example).

5 Experiments

In this section we present: (i) some qualitative insights of our meta-model by showing how it is able to coherently represent a sets of tasks in its latent space, (ii) an illustration of why Profet helps to obtain statistically meaningful results and (iii) a comparison of various methods from the literature on our new benchmark suite. In particular, we show results for the following state-of-the-art Bayesian optimization (BO) methods for HPO

as well as two popular evolutionary algorithms for general continuous black-box optimization:

  • BO with Gaussian processes (BO-GP) (Jones et al., 1998). We used expected improvement as acquisition function and marginalize over the Gaussian process’ hyperparameters as described by Snoek et al. (2012).

  • SMAC (Hutter et al., 2011): which is a variant of BO that uses random forests to model the objective function and stochastic local search to optimize expected improvement.
    We use the implementation from

  • The BO method TPE by Bergstra et al. (2011)

    which models the density of good and bad configurations in the input space with a kernel density estimators. We used the implementation provided from the Hyperopt package 

    (Komer et al., 2014)

  • BO with Bayesian neural networks (BOHAMIANN) as described by Springenberg et al. (2016). To avoid introducing any bias, we used a different architecture with less parameters (3 layers, 50 units in each) than we used for our meta-model (see Section 3).

  • Differential Evolution (DE) (Storn and Price, 1997) (we used our own implementation) with rand1 strategy for the mutation operators and a population size of 10.

  • Covariance Matrix Adaption Evolution Strategy (CMA-ES) by Hansen (2006) where we used the implementation from

  • Random Search (RS) (Bergstra and Bengio, 2012) which samples configurations uniformly at random.

For BO-GP, BOHAMIANN and RS we used the implementation provided by the RoBO package (Klein et al., 2017b). We provide more details for every method in Appendix E.

5.1 Tasks Representation in the Latent Space

We demonstrate the interpretability of the learned latent representations of tasks in two examples. For the first experiment we used the fully connected network benchmark described in Section 4.1. To visualize that our meta-model learns a meaningful latent space, we doubled 11 out of the 18 original tasks to train the model by splitting each one of them randomly in two of the same size. Thereby, we guarantee that there are pairs of tasks that are similar to each other. In Figure 3 (left), each color represents the partition of the original task and each ellipse represents the mean and four times the standard deviation of the latent task representations. One can see that the closest neighbour of each task is the other task that belongs to the same original task.

The second experiment targets multi-fidelity experiments that arise when a machine learning model needs to be trained on a very large dataset and approximate versions of the target objective are generated by considering subsamples of different sizes. For this experiment we used the SVM surrogate for different dataset subsets from Klein et al. (2017a). The surrogate consists of a random forest trained on a grid of hyperparameter configurations of a SVM evaluated on different subsets of the training data. In particular, we defined the following subsets: as tasks and sampled 100 configurations per task to train our meta-model. Note that we only provide the observed targets and not the subset size to our model. Figure 3 (right) shows the latent space of the trained meta-model: the latent representation of the model captures that similar data subsets are also close in the latent space. In particular, the first latent dimension coherently captures the sample size, which is learned using exclusively the correlation between the datasets and with no further information about their size.

5.2 Benchmarking with Profet

Figure 4: Heatmaps of the p-values of the pairwise comparisons across the methods in the three scenarios using a Mann-Whitney U test. Small p-values should be interpreted as the test finding evidence that the method in the column improves the method in the row. Using tasks from our meta-model instead lead to results that are very close to using the large set of original tasks from the original distribution. Left: results with 1000 real tasks. Middle: subset of only 9 reals tasks. Right: results with 1000 tasks generated from our meta-model.

Comparing HPO methods using a small number of instances affects our ability to properly perform statistical tests. To illustrate this we consider a distribution of tasks that are variations of the Forrester function for parameters and . We generated 1000 tasks by uniformly sampling random and in and compared six HPO methods: RS, DE, TPE, SMAC, BOHAMIANN and BO-GP (we left CMA-ES out because the python version does not support 1-dimensional optimization problems).

Figure 4

(left) shows the p-values of all pairwise comparisons with the null hypothesis

achieves a higher error after 50 function evaluations averaged over 20 runs than ’ for the Mann-Whitney U test. Squares in the figure with a p-value smaller than are comparisons in which with a 95% confidence we have evidence to show that the method in the column is better that the method in the row (we have evidence to reject the null hypothesis). To reproduce a realistic setting where one has access to only a small set of tasks, we picked 9 out of the 1000 tasks randomly. Now, in order to acquire a comparable number of samples to perform a statistical test, we performed 2220 runs of each method on every task, and then computed the average of groups of 20 runs, such that we obtained 999 samples per method to compute the statistical test. One can see in Figure 4 (middle), that although the results are statistically significant, they are misleading: for example, BOHAMIANN is dominating all other methods (except BO-GP), whereas it is significantly worse than all other methods if we consider all 1000 tasks.

To solve this issue and obtain more information from the same limited number of a subset of 9 tasks, we use Profet. We first train the meta-model on the same 9 selected tasks and then use it to generate 1000 new surrogate tasks (see Appendix C for a visualization). Next, we use these tasks to run the comparison of the HPO methods. Results are shown in Figure 4 (right). The heatmap of statistical comparisons reaches very similar conclusions to those obtained with the original 1000 tasks, contrary to what happened when we did the comparisons with 9 tasks only (i. e. p-values are closer to the original ones). We conclude that using samples from the meta-model (generated based on a subset of tasks) allows us to draw conclusion that are more in line with experiments on the full dataset of tasks than running directly on the subset of tasks.

5.3 Comparing State-of-the-art Hpo Methods

Figure 5: Comparison of various HPO methods on 1000 tasks of the noiseless SVM benchmark. See Appendix D for the results on all benchmark problems. Left: the ECDF for the runtime. Right: the ranking of each method averaged across all tasks.

We conducted 20 independent runs for each method on each task of all three problem classes described in Section 4.1 with a different random seed. Each method had a budget of 200 function evaluations per task, except for BO-GP and BOHAMIANN, where, due to their computational overhead, we were only able to perform 100 function evaluations. Note that conducting this kind of comparison on the original benchmarks would have been prohibitively expensive. In Figure 5 we show the ECDF curves and the average ranking for the noiseless version of the SVM benchmark. The results for all other benchmarks are shown in Appendix E. We can make the following observations:

  • Given enough budget, all methods are able to outperform RS. BO approaches can exploit their internal model such that they start to outperform RS earlier than evolutionary algorithms (DE, CMA-ES). Thereby, more sophisticated models, such as Gaussian processes or Bayesian neural networks are more sample efficient than somewhat simpler methods, e. g. random forests or kernel density estimators.

  • The performance of BO methods that model the objective function (BO-GP, BOHAMIANN, SMAC) instead of just the distribution of the input space (TPE) decays if we evaluate the function through noise. Also evolutionary algorithms struggle with noise.

  • Standard BO (BO-GP) works superior on these benchmarks but its performance decays rapidly with the number of dimensions.

  • Runner-up is BOHAMIANN which works slightly worse than BO-GP but seems to suffer less under noisy function values. Note that this result can only be achieved by using Profet as we could not have evaluated with and without noise on the original datasets.

  • Given a sufficient budget, DE starts to outperform CMA-ES as well as BO with simpler (and cheaper) models of the objective function (SMAC, TPE), making it a competitive baseline particularly for higher dimensional benchmarks.

6 Discussion and future work

We presented Profet, a new tool for benchmarking HPO algorithms. The key idea is to use a generative meta-model, trained on offline generated data, to produce new tasks, possibly perturbed by noise. The new tasks retain the properties of the original one but can be evaluated inexpensively, which represents a major advance to speed up comparisons of HPO methods. In a battery of experiments we have illustrated the representation power of Profet and its utility when comparing HPO methods in families of problems where only a few tasks are available. While in this work we have focused on HPO methods, the same idea can be generalized to other optimization problems.

Besides these strong benefits, there are certain drawbacks of our proposed method that we would like to explicitly mention. First, since we encode new tasks based on a machine learning model, our approach is based on the assumptions that come with this surrogate model. Second, while we show in Section 5 empirical evidence that conclusions based on our proposed method are virtually identical to the one based on the original tasks, there are no theoretical guarantees that results translate one-to-one to the original benchmarks. Nevertheless, we believe that Profet sets the ground for further research in this direction to provide much more realistic use-cases than commonly used synthetic functions, e. g. Branin, such that future work on HPO can rapidly perform reliable experiments during development and only execute the final evaluation on expensive real benchmarks. Ultimately, we think this is an important step towards more reproducibility, which is paramount in such an empirical-driven field as AutoML.

A possible extension of Profet would be to consider multi-fidelity benchmarks (Klein et al., 2017a; Kandasamy et al., 2017; Klein et al., 2017c) where cheap, but approximate fidelities of the objective function are available, e. g. learning curves or dataset subsets. Furthermore, since Profet also provides gradient information it could serve as a training distribution for learning-to-learn approaches (Chen et al., 2017; Volpp et al., 2019).


  • Bardenet et al. (2013) Bardenet, R., Brendel, M., Kégl, B., and Sebag, M. (2013). Collaborative hyperparameter tuning. In Proceedings of the 30th International Conference on Machine Learning (ICML’13).
  • Bergstra et al. (2011) Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Advances in Neural Information Processing Systems (NIPS’11).
  • Bergstra and Bengio (2012) Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research.
  • Chen et al. (2014) Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In Proceedings of the 31th International Conference on Machine Learning, (ICML’14).
  • Chen and Guestrin (2016) Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
  • Chen et al. (2017) Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and de Freitas, N. (2017). Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine Learning (ICML’17).
  • Chen et al. (2018) Chen, Y., Huang, A., Wang, Z., Antonoglou, I., Schrittwieser, J., Silver, D., and de Freitas, N. (2018). Bayesian optimization in AlphaGo. arXiv:1812.06855 [cs.LG].
  • Dai et al. (2017) Dai, Z., Álvarez, M. A., and Lawrence, N. (2017).

    Efficient modeling of latent information in supervised learning using gaussian processes.

    In Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems (NIPS’17).
  • Eggensperger et al. (2013) Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., and Leyton-Brown, K. (2013). Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. In NIPS Workshop on Bayesian Optimization (BayesOpt’13).
  • Eggensperger et al. (2015) Eggensperger, K., Hutter, F., Hoos, H., and Leyton-Brown, K. (2015). Efficient benchmarking of hyperparameter optimizers via surrogates. In

    Proceedings of the 29th National Conference on Artificial Intelligence (AAAI’15)

  • Feurer and Hutter (2018) Feurer, M. and Hutter, F. (2018). Hyperparameter optimization. In Automatic Machine Learning: Methods, Systems, Challenges. Springer.
  • Feurer et al. (2015a) Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. (2015a). Efficient and robust automated machine learning. In Proceedings of the 28th International Conference on Advances in Neural Information Processing Systems (NIPS’15).
  • Feurer et al. (2015b) Feurer, M., Springenberg, T., and Hutter, F. (2015b). Initializing bayesian hyperparameter optimization via meta-learning. In Proceedings of the 29th National Conference on Artificial Intelligence (AAAI’15).
  • Fusi et al. (2018) Fusi, N., Sheth, R., and Elibol, M. (2018). Probabilistic matrix factorization for automated machine learning. In Proceedings of the 31th International Conference on Advances in Neural Information Processing Systems (NIPS’18).
  • GPy (2012) GPy (since 2012). GPy: A gaussian process framework in python.
  • Hansen (2006) Hansen, N. (2006). The CMA evolution strategy: a comparing review. In

    Towards a new evolutionary computation. Advances on estimation of distribution algorithms

    . Springer Berlin Heidelberg.
  • Hansen et al. (2016a) Hansen, N., Auger, A., Brockhoff, D., Tusar, D., and Tusar, T. (2016a). COCO: performance assessment. arXiv:1605.03560 [cs.NE].
  • Hansen et al. (2016b) Hansen, N., Auger, A., Mersmann, O., Tušar, T., and Brockhoff, D. (2016b). COCO: A platform for comparing continuous optimizers in a black-box setting. arXiv:1603.08785 [cs.AI].
  • Hutter et al. (2011) Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. In Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION’11).
  • Hutter et al. (2018) Hutter, F., Kotthoff, L., and Vanschoren, J., editors (2018). Automatic Machine Learning: Methods, Systems, Challenges. Springer.
  • Jones et al. (1998) Jones, D., Schonlau, M., and Welch, W. (1998). Efficient global optimization of expensive black box functions. Journal of Global Optimization.
  • Kandasamy et al. (2017) Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, B. (2017). Multi-fidelity bayesian optimisation with continuous approximations. In Proceedings of the 34th International Conference on Machine Learning (ICML’17).
  • Klein et al. (2017a) Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. (2017a). Fast Bayesian hyperparameter optimization on large datasets. In Electronic Journal of Statistics.
  • Klein et al. (2017b) Klein, A., Falkner, S., Mansur, N., and Hutter, F. (2017b). Robo: A flexible and robust bayesian optimization framework in python. In NIPS Workshop on Bayesian Optimization (BayesOpt’17).
  • Klein et al. (2017c) Klein, A., Falkner, S., Springenberg, J. T., and Hutter, F. (2017c). Learning curve prediction with Bayesian neural networks. In International Conference on Learning Representations (ICLR’17).
  • Klein and Hutter (2019) Klein, A. and Hutter, F. (2019). Tabular benchmarks for joint architecture and hyperparameter optimization. arXiv:1905.04970 [cs.LG].
  • Komer et al. (2014) Komer, B., Bergstra, J., and Eliasmith, C. (2014). Hyperopt-sklearn: Automatic hyperparameter configuration for scikit-learn. In ICML 2014 AutoML Workshop.
  • Lichman (2013) Lichman, M. (2013). UCI machine learning repository.
  • Melis et al. (2018) Melis, G., Dyer, C., and Blunsom, P. (2018). On the state of the art of evaluation in neural language models. In International Conference on Learning Representations (ICLR’18).
  • Moré and Wild (2009) Moré, J. J. and Wild, S. M. (2009). Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization.
  • Perrone et al. (2018) Perrone, V., Jenatton, R., Seeger, M., and Archambeau, C. (2018).

    Scalable hyperparameter transfer learning.

    In Proceedings of the 31th International Conference on Advances in Neural Information Processing Systems (NIPS’18).
  • Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R., and de Freitas, N. (2016). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE.
  • Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Advances in Neural Information Processing Systems (NIPS’12).
  • Sobol (1967) Sobol, I. M. (1967). Distribution of points in a cube and approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics.
  • Springenberg et al. (2016) Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. (2016). Bayesian optimization with robust bayesian neural networks. In Proceedings of the 29th International Conference on Advances in Neural Information Processing Systems (NIPS’16).
  • Storn and Price (1997) Storn, R. and Price, K. (1997).

    Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces.

    Journal of Global Optimization.
  • Swersky et al. (2013) Swersky, K., Snoek, J., and Adams, R. (2013). Multi-task Bayesian optimization. In Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems (NIPS’13).
  • Titsias and Lawrence (2010) Titsias, M. and Lawrence, N. (2010). Bayesian Gaussian process latent variable model. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS’10).
  • van Rijn and Hutter (2018) van Rijn, J. and Hutter, F. (2018). Hyperparameter importance across datasets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18).
  • Vanschoren et al. (2014) Vanschoren, J., van Rijn, J., Bischl, B., and Torgo, L. (2014). OpenML: Networked science in machine learning. SIGKDD Explorations.
  • Volpp et al. (2019) Volpp, M., Fröhlich, L., Doerr, A., Hutter, F., and Daniel, C. (2019). Meta-learning acquisition functions for bayesian optimization. arXiv:1904.02642 [stat.ML].
  • Ying et al. (2019) Ying, C., Klein, A., Real, E., Christiansen, E., Murphy, K., and Hutter, F. (2019). NAS-Bench-101: Towards reproducible neural architecture search. arXiv:1902.09635 [cs.LG].

Appendix A Classification Benchmarks

In Table 1 we list all OpenML dataset that we used to generate the Meta-SVM and Meta-FCNet benchmarks and in Table 2 the UCI datasets that we used for the Meta-XGBoost benchmark. The ranges of the hyperparameters for all benchmarks are given in Table 3. Figure 6 shows the empirical cumulative distribution over the observed target values based on the Sobol grid for all tasks.

Name OpenML Task ID number of features number of datapoints
kr-vs-kp 3 37 3196
covertype 2118 55 110393
letter 236 17 20000
higgs 75101 29 98050
optdigits 258 65 5620
electricity 336 9 45312
magic telescope 75112 12 19020
nomao 146595 119 34465
gas-drift 146590 129 13910
mfeat-pixel 250 241 2000
car 251 7 1728
churn 167079 101 1212
dna 167202 181 3186
vehicle small 283 19 846
vehicle 75191 101 98528
MNIST 3573 785 50000
Table 1: OpenML dataset we used for the FC-Net and SVM classification benchmarks
Name number of features number of datapoints
boston housing 13 506
concrete 9 1030
parkinsons telemonitoring 26 5875
combined cycle power plant 4 9568
energy 8 768
naval propulsion 16 11934
protein structure 9 45730
yacht-hydrodynamics 7 308
winequality-red 12 4898
slice localization 386 53500
Table 2: UCI regression dataset we used for the XGBoost benchmark. All dataset can be found at
Name Range log scale
FC-Net learning rate
batch size
units layer 1
units layer 2
drop. rate l1 -
drop. rate l2 -
XGBoost learning rate
gamma -
L1 regularization
L2 regularization
number of estimators -
subsampling -
max. depth -
min. child weight -
Table 3: Hyper-parameter configuration space of the support vector machine (SVM), fully connected neural network (FC-Net) and the gradient tree boosting (XGBoost) benchmark.
Figure 6: The empirical cumulative distribution plots of all observed target values for all tasks.

Appendix B Comparison Random Search vs. Bayesian Optimization on XGBoost

For completeness we show in Figure 7 the comparison of random search (RS) and Bayesian optimization with Gaussian processes (BO-GP) on several UCI regression datasets. Out of the 10 datasets, GP-BO perform better than RS on 10, worse on one, and ties on 2 and hence performs overall better than RS which is inline with the results obtained from out meta-model. However, if we would look only on the first three datasets: Boston-Housing, PowerPlant and Concrete it would be much harder to draw strong conclusions.

Figure 7: Comparisons Bayesian optimization with Gaussian processes (GP-BO) and random search (RS) for optimizing the hyperparameters of XGBoost.

Appendix C Details about the Forrester benchmark

Figure 8 shows the original 9 tasks (left), their representation on the latent space of the model (middle) and an example of 10 new generated task (right), that resemble the original ones.

Figure 8: Visualizing the concept of our meta-model on the one-dimensional Forrester function. Left: 9 different tasks (solid lines) coming from the same distribution. Middle: We use a probabilistic encoder to learn a two-dimensional latent space for the task embedding. Right: Given our encoder and the multi-task model we can generate new task (dashed lines) that, based on the collected data, resemble the original tasks.

Appendix D Samples for the Meta-SVM benchmark

In Figure 9 and Figure 9 we show additional randomly sampled function sampled with and without noise. One can see that, while the general characteristics of the original objective function, i. e. bowl shaped around the lower right corner, remains, the local structure changes across samples.

Figure 9: Noisy samples from our meta-model for the SVM benchmark
Figure 10: Noiseless samples from our meta-model for the SVM benchmark

Appendix E Comparison of Hpo Methods

We now described the specific detail of each optimizer in turn.

Random search (RS) Bergstra and Bengio (2012)

We defined a uniform distribution over the input space and in each iteration randomly sampled a datapoint from this distribution.

Differential Evolution (DE) (Storn and Price, 1997)

maintains a population of data points and generates new candidate points by mutation random points from the population. We defined the probability for mutation and crossover to be 0.5. The population size was 10 and we sampled new candidate points based on the ’rand/1/bin’ strategy.

Tree Parzen Estimator (TPE) (Bergstra et al., 2011) is a Bayesian optimization method that uses kernel density estimators (KDE) to model the probability of ’good’ points in the input space that achieve a function value that is lower than a certain value and ’bad’ points that achieve a function value than a certain value. TPE computes the acquisition as the ration between the likelihood of the two KDE which is equivalent to expected improvement. We used the default provided by the hyperopt ( package.

SMAC (Hutter et al., 2011) is also a Bayesian optimization methods that uses random forests to model the objective function and stochastic local search to optimize the acquisition function. We followed the default of SMAC and set the number of trees for the random forest to 10.

CMA-ES (Hansen, 2006)

is an evolutionary strategy that models a population as a multivariate normal distribution. We used the open source pycma package ( We set the initial standard deviation to 0.6.

Gaussian Process based Bayesian optimization (BO-GP) as described by Snoek et al. (2012). We used expected improvement as acquisition function and an adapted random search strategy, which given a maximum number of allowed points samples first 70% uniformly at random and the rest from a Gaussian with a fixed variance around the best observed point. While other methods such as gradient ascent techniques or continuous global optimization methods could also be used, we found this to work faster and more robustly. We marginalized the acquisition function over the Gaussian process hyperparameters (Snoek et al., 2012) and used the emcee package ( to sample hyperparameter configuration from the marginal log-likelihood. We used a Matern 52 kernel for the Gaussian process.

BOHAMIANN (Springenberg et al., 2016) uses a Bayesian neural network inside Bayesian optimization where the weights are sampled based on stochastic gradient Hamiltonian Monte-Carlo (Chen et al., 2014). We use a step length of 1E-2 for the MCMC sampler and increased the number of burnin step by a factor of 100 times the number of observed data points. In each iteration we sampled 100 weight vectors over 10000 MCMC steps. We used the same random search method to optimize the acquisition function as for BO-GP.

All methods started from a uniformly sampled point and we estimated the incumbent after each function evaluation as the point with the lowest observed function value.

In Figure 11 and Table 4 we show the aggregated results based on the runtime and the ranking for all methods on all three benchmarks. We also show in Figure 11 the p-values of the Mann-Whitney U test between all methods. For a detailed analysis of the results see Section 5.3 in the main paper.

Figure 11: Comparison of various different methods on all three HPO problems. From above to below 2-dimensional support vector machine, 6-dimensional feed-forward neural network and 8-dimensional gradient boosting. The two columns on the left show the ECDF and ranking for the noiseless version of each HPO problem (same for the noisy version).
Meta-SVM (noiseless)
Meta-SVM (noise)
Meta-FCNet (noiseless)
Meta-FCNet (noise)
Meta-XGBoost (noiseless)
Meta-XGBoost (noise)
Meta-SVM (noiseless)
Meta-SVM (noise)
Meta-FCNet (noiseless)
Meta-FCNet (noise)
Meta-XGBoost (noiseless)
Meta-XGBoost (noise)
Table 4: Top: Each element of the table show the averaged runtime after 100 function evaluations for each method-benchmark pair. Bottom: Same but for the ranking of the methods.

Appendix F Details of the Meta-Model

The neural network architecture for our meta-model consisted of fully connected layers with

units each and tanh activation functions. The step length for the MCMC sampler was set to

and we used the first 50000 steps as burn-in. For the probabilistic encoder, we used Bayesian GP-LVM 333We used the implementation from GPy (2012)(Titsias and Lawrence, 2010) with a Matern52 kernel to learn a dimensional latent space for the task description.