Automatic prior selection for meta Bayesian optimization with a case study on tuning deep neural network optimizers

09/16/2021 ∙ by Zi Wang, et al. ∙ Google 9

The performance of deep neural networks can be highly sensitive to the choice of a variety of meta-parameters, such as optimizer parameters and model hyperparameters. Tuning these well, however, often requires extensive and costly experimentation. Bayesian optimization (BO) is a principled approach to solve such expensive hyperparameter tuning problems efficiently. Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized. In this work, we consider the scenario where we have data from similar functions that allows us to specify a tighter distribution a priori. Specifically, we focus on the common but potentially costly task of tuning optimizer parameters for training neural networks. Building on the meta BO method from Wang et al. (2018), we develop practical improvements that (a) boost its performance by leveraging tuning results on multiple tasks without requiring observations for the same meta-parameter points across all tasks, and (b) retain its regret bound for a special case of our method. As a result, we provide a coherent BO solution for iterative optimization of continuous optimizer parameters. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The careful tuning of a variety of meta-parameters, such as optimizer parameters and model hyperparameters has become a basic necessity for deep learning 

(Bergstra et al., 2011; Feurer et al., 2015). Such tuning requires extensive experimentation, retraining models repeatedly with different configurations, and can be challenging at realistic budgets because the tuning landscape is typically non-stationary, noisy and ill-behaved. Tuning has become sufficiently costly that finding more efficient and effective tuning procedures has the potential to save a substantial amount of resources or, alternatively, improve the accuracy of the final models at a given budget.

Some hyperparameters might show up again and again across a large number of tuning problems. In particular, we tend to use the same optimization algorithm across many different problems. For example, the Adam optimizer, with a learning rate and hyperparameters , is used across many deep learning applications and requires careful tuning Nado et al. (2021). Thus, if we have access to the performance of different optimizer-specific hyperparameters on different model training tasks, we may be able to transfer the knowledge among those tasks. This kind of meta-level learning is common among practitioners themselves: when faced with a new tuning problem, we might first try reusing hyperparameter settings that worked well on another (ideally similar) problem. The underlying assumption is that hyperparameters should perform similarly across tasks. In this work, we aim to formalize this assumption and automate optimizer hyperparameter tuning by leveraging knowledge from previous experiments. Although our experiments consider optimizer parameter tuning as a practically important sub-problem of hyperparameter tuning for deep neural networks, our method applies to any hyperparameters that are common across multiple tasks.

Bayesian optimization (BO) has become a popular methodology for optimizing the hyperparameters of machine learning models 

(Snoek et al., 2012; Bergstra et al., 2011) and represents the state-of-the-art (Turner et al., 2021). BO involves specifying a probabilistic model over the function to be optimized and using this to reason about the location of the optimum. The optimization proceeds by iteratively updating the model with new data and using the posterior distribution to reason about where to next evaluate, trading off exploration and exploitation. The model is typically specified with only a-priori assumptions of smoothness, for example using a Gaussian process with a smooth covariance function. Even if the model is well-specified, BO can be slow to converge due to the generality of the prior assumptions. This would seem wasteful for problems that are repeated often or share considerable structure with previous experiments.

One natural option is to cast our problem as meta Bayesian optimization, where the goal is to learn to optimize a black-box function by generalizing from past experience with other similar functions. Indeed, quite a few meta BO methods exist in the literature, but they are unsuitable for our scenario where we envision potentially thousands of related tasks within e.g. the context of a hyperparameter tuning service. Existing meta BO methods either scale cubically in the number of evalutations and tasks (Swersky et al., 2013; Bardenet et al., 2013) (See §4.3 for more details), impose a restrictive set of assumptions on the available data (Wang et al., 2018b; Swersky et al., 2013) to obtain efficient solutions, or make assumptions on the availability of Gaussian process (GP) parameters (Volpp et al., 2020) or descriptive task-level features (Brazdil et al., 1994; Bardenet et al., 2013; Yogatama and Mann, 2014).

To address these issues, we introduce HyperBO: a meta BO method that builds upon Wang et al. (2018b) with a relatively simple assumption: all the related functions being optimized are samples from the same Gaussian process prior distribution over functions. Concretely, HyperBO assumes the functions are conditionally independent given the hyperparameters, mean function, and covariance function of the GP. Compared to Wang et al. (2018b), HyperBO does not impose any strict conditions on data or model structures, and a special case of HyperBO retains regret bounds similar to those of Wang et al. (2018b). From a computational perspective, HyperBO scales linearly in the number of tasks during training, and does not depend on the number of tasks when deployed. HyperBO does not impose any assumptions about the conditions under which data is collected, and thus can be used with large offline datasets or a few trajectories of BayesOpt. Practitioners with fewer resources can also benefit from using data collected elsewhere.

To evaluate HyperBO, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as on a protein sequence dataset. We compared HyperBO to several hyperparameter tuning baselines in the sequential BO setting. Our results showed that optimizers which use hyperparameters suggested by our method are able to obtain better performing models requiring at least 3 times fewer function evaluations than other baselines.

Our main contributions are two-fold: (a) a practical meta BO approach that makes minimal assumptions; and (b) a large multi-task hyperparameter tuning dataset that not only benefits our method but also serves as an ideal benchmark to test future multi-task or meta-learning BO methods.111

We are working on open-sourcing the code base and dataset. The dataset is collected based on an open-sourced code base 

(Gilmer et al., 2021).

2 Related work

There is a rich literature of innovative methodologies to improve the efficiency of BO given related tasks or additional context. Here we discuss the most closely related work and explain why these don’t solve the specific scenario which we envision. Specifically, our goal is a methodology that is scalable enough to share information across thousands of tasks, each with potentially hundreds of observations, such as in the context of a large BO service or library.

Several methods, including that which HyperBO extends, refer to their method as “meta-BO” (Wang et al., 2018b; Volpp et al., 2020). However, in this work we use the term more generally to refer to the class of BO methods that use data from existing tasks to optimize a new task. Since standard BO is a learning process, it is consistent to call those methods meta BO methods since they learn how to learn. Under this viewpoint, multi-task BO (Swersky et al., 2013; Poloczek et al., 2017; Yogatama and Mann, 2014) and transfer learning BO using contextual GPs (Krause and Ong, 2011; Bardenet et al., 2013; Poloczek et al., 2016) are both meta BO approaches. Some meta BO methods have also been studied for hyperparamter tuning tasks in machine learning (Feurer et al., 2015).

While both multi-task and contextual BO rely heavily on the assumption that tasks are related, HyperBO assumes all tasks are independent (after conditioning on the GP). Both multi-task and contextual BO methods scale cubically in both the number of tasks and observations in each task, meaning that they cannot gracefully handle tens of tasks with thousands of data points each without heavy approximations. When assuming that all inputs are equal across tasks, multi-task BO can be sped up using a Kronecker decomposition of the kernel to a task kernel and an input kernel which can be inverted separately; a similar assumption is made by Wang et al. (2018b). In comparison, HyperBO scales linearly in the number of tasks (see §4.3).

End-to-end learning (Chen et al., 2017; Volpp et al., 2020) is another popular meta BO approach for hyperparameter tuning; end-to-end learning learns a strategy to suggest new query points based on past history of BO. One limitation of such approaches is that the total number of BO iterations must be determined a-priori. Furthermore, by nature of using a giant model to train the strategy, we lose the interpretability of intermediate steps that GPs and acquisition functions provide.

Our proposed idea directly builds upon Wang et al. (2018b) and Kim et al. (2017, 2019). We resolve their issues with optimizing over a continuous space rather than a discrete set, and limitations on using the same set of inputs across tasks. Kim et al. (2017, 2019)

first proposed to estimate a multivariate Gaussian that models values for search strategies in robot manipulation tasks. Hence it was natural not to consider continuous inputs in their context. The main contributions of

Wang et al. (2018b) were the regret bounds for Kim et al. (2017, 2019), whose method was then identified as meta BO without the knowledge of the mean or kernel of the GP. For both finite discrete search spaces and continuous ones, Wang et al. (2018b) requires observations on the same set of inputs across tasks, which is the assumption that is not required for HyperBO. Whereas, HyperBO still inherits the same regret bound as Wang et al. (2018b) for special cases when the same-inputs assumption is satisfied.

3 Problem formulation

We consider the standard black-box function optimization scenario: given a real-valued function defined over a compact, hyper-rectangular space and given observations of similar functions , we seek an optimizing . We inherit our problem formulation from Wang et al. (2018b), but we relax impractical assumptions on data availability (we do not require all observations to be made on the same inputs across tasks) and model restrictions.

Assumptions and the goal.

Concretely, we assume that there exists a Gaussian process with unknown mean function  and kernel . Let be the number of tasks and let be the number of observations we have for the th task. Conditioned on independent function samples and inputs , we observe evaluations perturbed by i.i.d. additive Gaussian noise

with known variance. Taken together, the collection of sub-datasets

define a dataset . Finally, our goal is to maximize a new function independently sampled from the same GP, ; that is, solve .

An example.

In our optimizer hyperparameter tuning application, a task corresponds to finding the best optimizer hyperparameters to train a particular neural net model on a particular dataset,222Technically, we also consider different batch sizes to be different tasks. e.g. training a specific ResNet (He et al., 2016)

on ImageNet 

(Russakovsky et al., 2015). Notice that we do not assume that the mean function , kernel and noise variance are given. This is consistent with the reality of solving real-world black-box optimization problems including hyperparameter tuning tasks in deep learning. Essentially, we are trying to learn those unknown functions and parameters from data. However, in practice, searching in functional spaces to find the right mean or kernel is a daunting task. Hence for practical concerns, a well defined search space for functions is required. More details on this can be found at §4.1.

Metrics.

For simplicity, throughout this paper, we focus on the setting where the target function can only be optimized by iteratively choosing where to evaluate, and defer batch evaluation setups to Sec. 6. As we run BO on the target function for iterations, we accumulate a set of observations , . We evaluate the quality of the optimization using the simple regret metric: , where is the final recommendation at the end of the optimization process. There are various ways of setting based on the observations ; we use the input that achieved the best evaluation: .

Bayesian viewpoint.

As mentioned above, the observed functions and the evaluation target are assumed to be independent draws from the same GP. This assumption is consistent with a hierarchical Bayes interpretation (Fig. 1), where all observed functions are independent conditioned on the GP. Notice that for BO, each selected input depends on all previous observations. But we only describe the generative model of a hierarchical GP for simplicity.

More specifically, we assume that the overall setting of the hyperparameter optimization task is defined by a parameter ; mean and kernel functions and are drawn from . The independent function samples are themselves draws from . The generative story is as follows:

  • Draw GP parameter from and observation noise parameter from .

  • Draw mean function and kernel function from .

  • For each task from to ,

    • Draw a function from .

    • For each data point from to ,

      • Given input , we draw the observation .

Figure 1: Graphical model for a hierarchical Gaussian process.

We simplify this hierarchical setting by defining to be a sum of Dirac delta functions: both mean function and kernel are deterministic functions parameterized by . Thus, we can infer GP parameter and noise from their posterior and obtain an informed prediction for the target function

In other words, we learn function from observations on all other conditionally i.i.d. function samples . We forgo a fully Bayesian approach that samples from the posterior over at every BO iteration, although our method, HyperBO, can be viewed as a type-II maximum likelihood approximation of such a Bayesian solution.

Notations.

Let denote . For conciseness, we write the evaluation of a function

on vector

as . Similarly, for two vectors , we write the corresponding kernel matrix as , and shorten .

We denote a (multivariate) Gaussian distribution with mean

and variance by , and a Gaussian process (GP) with mean function and covariance function by . Let be the noise variance in observations. Given a set of observations and , we denote the corresponding conditional GP distribution as . Recall that the conditional distribution , is given for any as

(1)
(2)

where we set .

4 Our method

1:function HyperBO ()
2:     
3:     
4:     for  do
5:         
6:          Observe
7:         
8:     end for
9:     return
10:end function
Algorithm 1 HyperBO with acquisition function .

As shown in Alg. 1, our approach trains the GP hyperparameters on a representative set of datasets and then fixes them for the duration of the optimization procedure; we refer to this approach as HyperBO. HyperBO runs in two steps. First, we learn a GP model to approximate the ground-truth (unknown) GP that generated the dataset . Then, we do standard BO to optimize a new function , with the learned GP . The initial learning process (Alg. 1, line 2) is the critical difference between HyperBO and standard BO algorithms, as well as the key contribution of this paper.

Based on the Bayesian graphical model interpretation (Fig. 1), our goal is to obtain a point estimate for the parameter . Given this estimate, we can then estimate the mean function and the kernel , which defines our learned model . During the BO iterations (Alg. 1, lines 4-8), we update the conditional GP, but do not re-estimate the GP mean and kernel. By separating the data for conditional GP update and GP parameter training, we minimize the computational cost while still maintaining good performance both theoretically and empirically. Moreover, we avoid the BO chicken-and-egg dilemma (Wang et al., 2018b) where the search strategy is trained on data collected in the BO process and the data points are selected by the search strategy.

Next, we introduce our GP training strategy based on two types of objectives: marginal data likelihood (Sec. 4.1) and distance between estimates and model predictions (Sec. 4.2). In Sec. 4.4 we analyze the theoretical implications for regret bounds of a special case in our approach.

4.1 Marginal likelihood

A straightforward way to train a GP is by optimizing the log marginal likelihood over the GP’s hyperparameters. This is also known as type II maximum likelihood approximation (Rasmussen and Williams, 2006). In our case, we derive the data likelihood for the observations from multiple functions that are assumed to be given, which is a key difference to regular GP or BO setups. The log marginal likelihood for our method is

(3)

where , , and .

Our solution to the choice of mean function, kernel function and noise variance then becomes

(4)

For the mean function and the kernel , this optimization is done in functional space. While methods exist to search for functional structures (Kemp and Tenenbaum, 2008; Malkomes and Garnett, 2018), we opt instead for a simple search strategy within a group of functional structures (e.g. mean and kernel ). For all combinations of mean/kernel structures or functional classes, we then optimize the parameterization of them and noise variance to eventually solve Eq. 4. Details of how we defined the search space can be found in §5.

4.2 Distance between estimates and model predictions

Although the marginal likelihood is a straightforward objective to optimize, it may not be straightforward to interpret how high of a likelihood is high enough for us to stop our search for a decent model. Nevertheless, we may be able to directly estimate the sample mean and covariance, and the distance between those estimates and model predictions could be a good indicator of how good the model is. We will show in §4.4 that a distance objective may lead us to better theoretical properties.

Here we consider a special case of dataset where part of it has matching inputs across some sampled functions. More formally, suppose we have a matching dataset where is a positive integer, and . Empirically, dataset can be constructed by querying a set of functions at the same set of input locations to obtain an observation matrix .

By definition of our model , the vector of all function queries is distributed according to a multivariate Gaussian distribution . With our observation model, we get the distribution for observations for some unknown mean function and kernel .

However, given that we have access to all observations , we can estimate the mean on inputs as and estimated covariance as ; here is a column vector of size filled with s. We use a biased estimate of covariance to be consistent with the corresponding maximum likelihood estimator in Eq. 4. 333One may choose to re-scale learned kernel by to be unbiased. Notice that the estimated covariance includes in diagonal terms the variance of the observation noise.

For any distance function between the estimate and model prediction , we obtain an objective to minimize, . While there are different measures of distributional discrepancy, we adopt the KL divergence. Let and . The KL divergence is defined as

(5)

and we can estimate the mean, kernel and noise variance to be

While it is difficult to gauge how much a probability density is enough to obtain a good model, Eq. 

5 is a distance that goes to 0 as the difference between two distributions reduces. One may choose to do early stopping or model selection based on how close Eq. 5 is to . Through information theory, we also know that the KL divergence in Eq. 5 describes the number of extra bits (or nats) to encode the multivariate normal . Overall we found the KL divergence in Eq. 5 relatively more interpretable than the marginal likelihood in Eq. 4.

The KL divergence in Eq. 5 introduces a different optimization landscape than the marginal likelihood in Eq. 4. The KL divergence also makes use of the matching dataset in a way that the marginal likelihood cannot. In fact, all matching inputs in the marginal likelihood in Eq. 4 are implicit: all inputs are passed in to mean/kernel functions, and so there is no way that Eq. 4 can be informed that some inputs are the same across tasks. As shown in §5, the KL divergence in Eq. 5 interestingly led to better results in our experiments.

4.3 Computational complexity

The marginal likelihood in Eq. 3 naturally decomposes into a sum of GP data likelihood terms on each sub-dataset . The time complexity to compute Eq. 3 is , where is the number of sub-datasets and is the maximum number of data points for these sub-datasets. Notice that our method scales linearly in the number of tasks, , in contrast to the cubic scaling of multi-task or contextual BO methods (Swersky et al., 2013; Bardenet et al., 2013; Poloczek et al., 2016; Yogatama and Mann, 2014). The only cubic cost of HyperBO is on the number of data points in sub-datasets.

To train a GP with optimization steps on Eq. 3, the time complexity is . The distance regularizers introduced in §4.2 requires estimating mean and covariance, which takes for matrix multiplication. The KL divergence in Eq. 5 has complexity to compute and to optimize.

If there is any better probabilistic model than a GP to fit the data with less compute time, we can easily swap it in and reduce the complexity that the GP contributed to the complexity of Eq. 3. For example, if we approximate a GP with a linear model on random features (Rahimi et al., 2007), the complexity of Eq. 3 becomes . Another example is to train Eq. 3 with stochastic optimization methods, where the complexity of Eq. 3 on the full dataset can be reduced to , where is the mini-batch size. Running stochastic optimization will then take , where

is the number of optimization epochs.

4.4 Theoretical analyses

While it is nontrivial to prove regret bounds for general scenarios without strict assumptions, it is straightforward to show a regret bound for our method with objective of Eq. 5 in the matching dataset case where BO is running on a finite set of inputs.

Proposition 1.

For any and , there exists a Gaussian process such that .

Proposition 1 is easy to show. We can train a simple memory based model for mean function and kernel . The model stores each element of vector and matrix at the corresponding locations in input . When making a prediction on a new input , the model simply retrieves the values of the closest element in . Given Proposition 1, a regret bound follows (Wang et al., 2018b).

Theorem 2.

Let . With probability at least , simple regret in iterations of Alg. 1 with special cases of either GP-UCB or PI satisfies

(6)

where .

We describe the proof and the special cases of GP-UCB and PI in Appendix B. Theorem 2 shows that the regret bound always have a linear dependency on the observation noise . This is expected because in practice, we select the best observation rather than best function value (before observing a noisy version of it) to compute the simple regret. Another reason is that we learn the noise parameter jointly with the kernel, which is clear in Eq. 5. Hence when computing acquisition functions, the noise is always included in the predicted variance.

Intuitively, the more sub-datasets we have in the dataset, the larger is, the better we are able to estimate the GP model, and the closer the regret bound is to the case where the GP model is assumed known. Interestingly, the number of BO iterations makes the regret smaller in the second term but larger in the first term in Eq. 6. Usually as we get more observations, we get more information about the maximizer, and we are able to optimize the function better. However, as we get more observations on the new function, GP conditional predictions have more freedom to deviate from the ground truth (see Lemma 1 of Wang et al. (2018b)). As a result, we get less and less confident about our predictions, which is eventually reflected in a looser regret upper bound.

It is tempting to prove similar bounds for more general settings where inputs are not the same across all sub-datasets and BO happens in continuous space. Though the only prerequisite is to show that the difference between the learned mean/kernel and the ground truth mean/kernel is small, this prerequisite is as difficult as showing we can find a model that has bounded generalization error across the entire continuous input space of an arbitrary function. Instead of making unrealistic assumptions just to satisfy such prerequisite, we leave the regret bound for general settings as an open question.

5 Experiments

Our goal in this paper is to provide a practical approach for hyperparameter optimization when we are given data on a range of tasks over the same search space. To analyze the effectiveness of our proposal, we take the optimizer hyperparameter tuning problem in deep learning as a case study. Our implementation of HyperBO is based on JAX (Bradbury et al., 2018).444We are working on open-sourcing our code as well as trained GP models.

For empirical validation, we first collected a dataset composed of hyperparameter evaluations on various deep neural network training tasks. The tasks included optimizing deep models on image, text, and other datasets (see more details in Sec. 5.1). We then compared our method to several competitive baselines in realistic hyperparameter tuning scenarios in deep neural net optimizers to understand HyperBO’s properties better.

To reduce ambiguity, we distinguish between datasets that individual neural networks are trained on and the dataset we collected that includes optimizer hyperparameter points with their validation errors (and other metrics). We will call the former (e.g. MNIST, CIFAR10) task datasets and call the latter the tuning dataset. The tuning dataset is what we described as dataset

in §3.

5.1 Hyperparameter tuning dataset

In order to collect our hyperparameter tuning dataset, the PD1 Neural Net Tuning Dataset, we defined a set of 24 neural network tuning tasks555The batch size 1024 ResNet50 ImageNet task only has 100 hyperparameter points because we abandoned it when scaling up data collection in order to save compute resources. It is used in training, but not evaluation.

and a single, broad search space for Nesterov momentum. Each task is defined by a task dataset (e.g. ImageNet), a specific neural network model (e.g. ResNet50), and a batch size. Tab. 

2 shows all the tasks that we consider in the tuning dataset. We used an existing code base (Gilmer et al., 2021) for neural network model training.

For each task, we trained the model on the task dataset repeatedly using Nesterov momentum (Nesterov, 1983; Sutskever et al., 2013), with the task’s minibatch size, with different hyperparameter settings drawn from the 4-dimensional search space detailed in Tab. 2. We tuned the base learning rate, , on a log scale, the momentum, , with on a log scale, and the polynomial learning rate decay schedule power and decay steps fraction . We used a polynomial decay schedule with the following form:

(7)

where is the training step and is the total number of training steps for the task.

We collected two types of data: matched and unmatched data. Matched data used the same set of uniformly-sampled hyperparameter points across all tasks and unmatched data sampled new points for each task. All other training pipeline hyperparameters were fixed to hand-selected, task-specific default values. All of our tasks are classification problems, so they all used the same training loss, although occasionally task-specific regularization terms were added. For each trial (training run for a single hyperparameter point), we recorded validation error (both cross entropy error and misclassification rate). In many cases, poor optimizer hyperparameter choices can cause training to diverge. We detected divergent training when the training cost became NaN and then marked the trial but did not discard it. Please see the Appendix, supplementary material, and code (Onomous, 2021) for additional details about the tasks and training procedure. The different tuning tasks vary in difficulty and numbers of data points, but generally there are roughly 500 matched datapoints and 1500 unmatched datapoints per tuning task. For unmatched data only, we attempted to generate roughly similar numbers of non-divergent points across tasks, so tasks with a higher probability of sampling a hyperparameter point that causes training to diverge will tend to have more trials.

Task Dataset Model Batch Sizes CIFAR10 Wide ResNet {256, 2048} CIFAR100 Wide ResNet {256, 2048} Fashion MNIST Max pool CNN ReLU {256, 2048} Fashion MNIST Max pool CNN tanh {256, 2048} Fashion MNIST Simple CNN {256, 2048} ImageNet ResNet50 {512, 1024, 2048} LM1B Transformer {2048} MNIST Max pool CNN relu {256, 2048} MNIST Max pool CNN tanh {256, 2048} MNIST Simple CNN {256, 2048} SVHN (no extra) Wide ResNet {256, 1024} WMT15 German-English xformer {64} uniref50 Transformer {128} Hyperparameter Range Scaling Log Linear Log Linear
Table 1: Tasks
Table 2: 4-dimensional input search space (see Eq.7)

5.2 Description of all compared methods

Our method HyperBO has several variants including using different acquisition functions and different objectives. In §5, unless otherwise mentioned, we used a thresholded probability of improvement (PI) as the acquisition function666We empirically evaluated a variety of acquisition functions, but found PI thresholded at 0.1 to be surprisingly effective. Because we model the observations as log error rate, this actually trades off exploration and exploitation - i.e. with larger error rates this seeks relatively more substantial improvements than with small error rates. The list of 5 different acquisition functions we tested is as follows: PI with 0.1 threshold, expected improvement and UCB with 2, 3, 4 coefficients. More results can be found in Appendix C.2.. We set PI in line 5 of Alg. 1 as .

  • H* NLL: HyperBO with PI as the acquisition function and negative log marginal likelihood as the objective function.

  • H* KL: HyperBO with PI as the acquisition function and KL divergence on matching datapoints as the objective function.

These two settings of HyperBO are relatively representative of the performance of variants of HyperBO. We provide more comparisons over acquisition functions and other objective functions in Appendix C.

Our first set of baselines include those that do not use information from training tasks:

  • Rand: Random search in the corresponding scaled space in Tab. 2.

  • STBO: Single task BO with a constant mean function, Matern32 kernel and PI acquisition function (same as above). Every BO iteration, STBO optimizes the GP hyperparameters via marginal likelihood on data of the test task. This implementation corresponds to the basic off-the-shelf BO setups.

  • STBOH: Single task GP-UCB (coefficient=1.8) with constant mean, Matern52 kernel and hand-tuned

    prior on hyper-parameters including UCB coefficient. Specifically, log amplitude follows Normal(-1, 1), log length scale (one per input parameter) follows Normal(0,1), and log observation noise variance follows Normal(-6, 3). The hyperparameters are post-processed by tensorflow-probability’s

    SoftClip

    bijector to constrain the values between 1-st and 99-th quantiles. These prior distributions were manually tuned to obtain reasonable convergence rates on 24 analytic functions in COCO

    (Hansen et al., 2021). The GP parameters are then optimized via maximum marginal likelihood every BO iteration.

For multi-task BO baselines, we included scalable methods that replace the GP with a regression model that can be trained using SGD and thus scales linearly in the number of observations. Following the multi-task setup of Springenberg et al. (2016), we jointly trained a 5-dimensional embedding of each task, which was then added to the input of the following two models.

  • MIMO: We trained an ensemble of feedforward neural networks with shared subnetworks (Havasi et al., 2020). We used 1 shared dense layer of size 10 and 2 unshared layers of size 10. We used tanh activation based on (Snoek et al., 2015, Figure 2). The network has one output unit with linear activation and another with

    activation, corresponding respectively to the mean and standard deviation parameters of a normal distribution. We trained for 1000 epochs using the Adam optimizer with learning rate

    and batch size 64.

  • RFGP: We used the open-source implementation of approximate GP by Liu et al. (2020). We trained for 1000 epochs using the Adam optimizer with learning rate and batch size 64.

All methods share the same input and output warping. The input warping is done according to the scaling function in Tab. 2: . The output warping is done for the validation error rate .

5.3 Results on offline optimizer hyperparameter tuning tasks

Many tasks in §5.1 can use up a lot of compute resources and time, which makes it infeasible to perform a wide variety of experiments to analyze the characteristics of BO methods. Hence we adopt an offline approximation, which runs BO only on the finite set of points that each tuning sub-dataset contains. In §5.4, we show some BO comparisons in the online setting.

In all the experiments in this section, we ran offline BO on the data from the test task starting from zero initial data from this task. Each method was repeated 5 times with different random seeds to initialize its model. We ran all methods without de-duplication to best simulate online BO settings. We evaluate methods on regret on error rate which denotes the simple regret on the finite set of data points in each tuning sub-dataset.

5.3.1 Holding out relevant tasks

We first conducted experiments in a setting where a new task dataset is presented, and a BO method is trying to tune the optimizer hyperparameters for a selected model on that task dataset. A training dataset for meta BO is composed of at most 18 tuning sub-datasets on training tasks that do not involve the same task dataset as the test task. All methods then proceed to solve the test task on the new task dataset.

Figure 2: Performance profiles for outperforming the median of best error rates at the (a) 25th BO iteration, (b) 50th BO iteration and (c) 100th BO iteration.

Fig. 2 shows performance profiles of the BO methods described in §5.2. The performance profiles show the fraction of all test tasks that each method is able to outperform a baseline criterion at each BO iteration.777We show performance relative to a baseline because of varying scales across tasks. We chose the criterion to be the median of best error rates achieved by all methods at 3 different BO iterations: 25th, 50th or 100th. The larger the fraction of tasks at each BO iteration, the better the method is. From all 3 criteria, we can see that MIMO is able to outperform other methods in the beginning 10 to 20 BO iterations, but its leading position soon gets surpassed by HyperBO (H* NLL and H* KL) and STBOH. HyperBO methods are gaining a similar if not larger fraction than the best alternative, STBOH, throughout BO iterations. Fig. 2 (c) has the most stringent performance criterion, and it shows that HyperBO with the KL objective outperforms HyperBO with the NLL objective in this set of experiments with a small margin. And both methods in HyperBO are doing considerably better than others.

Figure 3: The left most is a summary of the BO convergence of all methods: the median and 20/80 percentiles of the regrets on error rates over 115 BO runs: 23 tasks and each with 5 repeats of different random seeds. We also show violin plots on its two vertical slices at 50th and 100th iteration, where the white dot is the median and the black line is the 20/80 percentile. Overall, HyperBO methods H* NLL and H* KL are able to achieve the lowest regret on error rate on the majority of tasks.

Fig. 3 illustrates the BO convergence curves of all competing methods, together with the vertical slices at the 50th and 100th iterations. RFGP and STBO are both falling much behind Random search. STBO trains the GP on the data that the GP suggests to query, which creates a loop that could be harmful for data acquisition. Optimizing the marginal data likelihood on at most 100 datapoints in fact may not lead to a better model than random initialization (see Tab. 5 in §6). Surprisingly, RFGP, though equipped with the tuning dataset and initially reached some good values, performed similarly to STBO in the end. Clearly, the contextual information learned by RFGP did not generalize to a new task. On the other hand, MIMO is able to obtain a slightly better error rate than STBOH.

Fig. 2 and Fig. 3 both show that learning the GP prior through data like what HyperBO does is performing much better than other meta BO methods, and it is a more principled and effective approach to obtain the GP prior than hand-tuning. As a reference, we include Tab. 3 which shows the task-wise best validation error rates obtained by the top 5 methods in 100 BO iterations.

Rand STBOH MIMO H* NLL H* KL
WMT XFormer 64
Uniref50 Transformer 128
LM1B Transformer 2048
SVHN WRN 1024
SVHN WRN 256
ImageNet ResNet50 256
ImageNet ResNet50 512
MNIST CNNPoolTanh 2048
MNIST CNNPoolTanh 256
MNIST CNNPoolReLU 2048
MNIST CNNPoolReLU 256
MNIST CNNReLU 2048
MNIST CNNReLU 256
Fashion CNNPoolTanh 2048
Fashion CNNPoolTanh 256
Fashion CNNPoolReLU 2048
Fashion CNNPoolReLU 256
Fashion CNNReLU 2048
Fashion CNNReLU 256
CIFAR100 WRN 2048
CIFAR100 WRN 256
CIFAR10 WRN 2048
CIFAR10 WRN 256
Table 3:

The mean and standard error of best validation error rates (

) for each test task in the offline optimizer hyperparameter tuning experiments. Meta BO methods including MIMO and HyperBO variants (H* NLL and H* KL) have access to training tasks that do not share the same task dataset as the test task. We show results of the top 5 methods, and we highlight the lowest error rates in bold.

To more precisely quantify HyperBO’s advantage, we also computed how much faster HyperBO can get a better error rate than best alternatives, which can be different from task to task. We found that on average, on over tasks, H* NLL is at least 2.86 times faster than best non-HyperBO alternatives; while on over tasks, H* KL is at least 3.26 times faster than best non-HyperBO alternatives. Moreover, on over tasks, H* NLL is at least 7.74 times faster than random search; and on over tasks, H* KL is at least 6.07 times faster than random search.

5.3.2 Effect of number of training tasks

Figure 4: Aggregated BO results on 23 tasks (all in Table 2 except ImageNet ResNet50 2048 because of insufficient data) that uses models trained on 3 to 23 training tasks. Note that the models are never trained on the data from the test task that we run BO on. If the number of training tasks is less than 23, we first remove the tasks that involve the same task dataset as the test task and then remove others randomly until we reach the designated number of training tasks. The top left shows the median and 20/80 percentiles of regret on best validation error rate for each method. The rest are violin plots showing the regret for MIMO, H* NLL and H* KL, where white dots indicate the median and black lines the 20/80 percentiles.

We now investigate the impact of number of training tasks on the performance of meta BO methods. In Fig 4 we show the BO simple regrets on tasks from Table 2 (except ImageNet ResNet50 2048) that use meta BO models trained on different number of training tasks. To analyze the performance of all methods on less-related tasks, we first remove training tasks that have the same task dataset as our current tuning task for testing, and then remove randomly selected training datasets from the rest.

HyperBO variants were able to reduce the simple regret as more training tasks are given. Interestingly, both H* NLL and H* KL are already slightly better than Rand and STBOH when they started off with only 3 training tasks. There are reasonable fluctuations in the results but overall the trend of regret is going down as the number of training tasks increases. MIMO also reduced regret when the number of tasks increased from 8 to 18. RFGP, however, fails to learn from training tasks possibly because it did not learn good task embeddings for GP regression models.

5.3.3 Effect of number of data points in training tasks

Figure 5: Aggregated BO results on 23 tasks (all in Table 2 except ImageNet ResNet50 2048 because of insufficient data) that uses models trained on to of data in each task. Note that the models are never trained on the data from the test task that we run BO on. The top left is the median and 20/80 percentiles of simple regret in log scale. The rest of the figures are simple regret violin plots for MIMO and H* NLL

One remaining question is, how does in §3, the number of data points in each training tasks, affect the performance of meta BO methods. We analyze the impact of by removing a portion of all data that we have access to for each task. In particular, we set the percentage of remaining data to be . Remaining datapoints are selected uniformly randomly, which breaks the structure of matching data. Hence we do not include H* KL in this comparison, as H* KL only makes use of matching data.

Fig. 5 shows how the simple regret changes as the fraction of training data grows. Below training data, we observe clear trend that more data lead to lower regret for both H* NLL and MIMO, and relatively no change for RFGP. We also found that the performance of HyperBO (H* NLL) does not change much as the fraction of training data increases from to . However, MIMO and RFGP suffers significantly from more data as the fraction of training data increases from to

. It is not entirely clear why MIMO and RFGP have such behaviors. One conjecture is that neural network based Bayesian linear regression models may get too confident once the amount of data reaches a certain threshold. This means much less exploration if those models are used for BO.

5.3.4 Training on all but one task

We also studied the case where meta BO approaches have access to both training tasks that do not use the same task dataset and training tasks that use the same task dataset but different model configurations. This is especially common when we do architecture search: we aim to find the best model and we are tuning the optimizer hyperparameters for a new machine learning model given tuning data of the same task dataset on some other models.

For this section only, we added a new baseline, MAF: We refer to the meta BO method from Volpp et al. (2020)

as MAF (Meta Acquisition Function) to avoid confusion. MAF used reinforcement learning to learn an acquisition function modeled by a neural network over a set of transfer learning tasks. All MAF results were generated using the code from

Volpp et al. (2020). See App. C.3 for experimental details. As MAF takes significantly longer to run than HyperBO and other methods, we only include its results for this section.

Figure 6: Aggregated leave-one-out BO convergence results on 23 tasks, each with 5 repeats using different random seeds. The left most is the median and 20/80 percentiles of the regrets on error rates. We also show violin plots on its two vertical slices at 50th and 100th iteration, where the white dot is the median and the black line is the 20/80 percentile.

We carried out a series of leave-one-out experiments, where we picked one task as the BO test function and let meta BO methods train on the remaining tasks. In Fig. 6, we aggregated results from all 23 tasks to show the trend of how each method performs.

The conclusions are similar to those from §5.3.1. As expected, STBO without any tricks to avoid pitfalls of vanilla BO did not show very good results. We inspected its learned GP which mimicked a Dirac function that is flat almost everywhere except some locations, and hence it got very confident that it landed at a good spot and lost its ability to explore.

STBOH, on the other hand, achieved very competitive results. This is because it used hand-tuned priors on all of its GP parameters. STBOH hence represents meta BO where meta learning is performed by experts with years of experience. All of our meta BO methods here, however, trains for at most a few hours. As part of the goals of meta learning, we would like to show that it is possible for meta BO methods to exceed or at least match STBOH.

Both HyperBO variants obtained better results than the hand-tuned STBOH. Especially in the beginning few BO iterations, it is able to locate much better hyperparameters than all other methods.

Tab. 4 presents mean and standard error of the best validation error rates achieved in 100 BO iterations on the 23 tasks. HyperBO and its variants were able to achieve the best performance on 20 out of 23 tasks. In Fig. 7, we show the optimization curves of 4 individual tasks that are considered most difficult because few similar task datasets are present in their training data. On all of these 4 difficult tasks, HyperBO identified good hyperparameters much sooner than its competitors.

Figure 7: Leave-one-out log regret mean and standard deviation results on ImageNet ResNet50 512, LM1B Transformer 2048, WMT XFormer 64 and Uniref50 Transformer 128. All methods were repeated 5 times with different random seeds to initialize their models. In LM1B Transformer 2048, H* NLL and H* KL disappeared around 60 to 80 BO iterations because they reached 0 regret.
Rand STBOH MIMO MAF H* NLL H* KL
WMT XFormer 64
Uniref50 Transformer 128
LM1B Transformer 2048
SVHN WRN 1024
SVHN WRN 256
ImageNet ResNet50 256
ImageNet ResNet50 512
MNIST CNNPoolTanh 2048
MNIST CNNPoolTanh 256
MNIST CNNPoolReLU 2048
MNIST CNNPoolReLU 256
MNIST CNNReLU 2048
MNIST CNNReLU 256
Fashion CNNPoolTanh 2048
Fashion CNNPoolTanh 256
Fashion CNNPoolReLU 2048
Fashion CNNPoolReLU 256
Fashion CNNReLU 2048
Fashion CNNReLU 256
CIFAR100 WRN 2048
CIFAR100 WRN 256
CIFAR10 WRN 2048
CIFAR10 WRN 256
Table 4: The mean and standard error of best validation error rates () for each test task in the offline leave-one-out experiments. We show results of the top 6 methods, and we highlight the lowest error rates in bold.

5.4 Results on online optimizer hyperparameter tuning tasks

Figure 8: Results of running BO methods in the online setting on 9 different tasks. The image based tasks all use best validation error rate as objective while the text based tasks (LM1B, Uniref50 and WMT) use best validation ce loss. HyperBO methods achieved better results in 7 out of 9 tasks.

Finally, we look into the online BO setting where we optimize over the full hypercube. In the online setting, some combinations of hyperparameters may be infeasible to evaluate. For example, an overly big learning rate may lead to divergence in gradients, in which case we do not obtain a valid model. To address this, we pre-process the function values to such that infeasible evaluations map to , while bad evaluations approach asymptotically to . More precisely, for each subdataset , we applied for each successful the following mapping:

where is the median of .

In this section, we set HyperBO variants and STBO to share exactly the same GP-UCB acquisition function as STBOH, MIMO and RFGP. The UCB coefficient for all methods is . The variants of HyperBO are as follows:

  • H* NLL: HyperBO with UCB as the acquisition function and negative log marginal likelihood (NLL) as the objective function.

  • H* NLLKL: HyperBO with UCB as the acquisition function and NLL plus 10 times KL divergence on matching datapoints as the objective function. See §A for more details.

In Fig. 8, we include the online tuning results for selected tasks due to limited compute resources. We noticed that for some methods, e.g. STBO and MIMO, it is very difficult for them to recover from a “bad” datapoint. This is partly because predictions from these models are significantly tied to the initial observations. For example, STBO may overfit to the initial bad value and believe there are bad values in the entire search space. Nevertheless, in 7 out of 9 tasks, HyperBO methods performed the best among all methods being compared.

6 Discussion

In this work, we focused on the question of how to make use of multi-task data to enable better Bayesian optimization. For our investigation, we made simplifications such as sequential evaluations and a shared search space across tasks. Our method also relies on an important assumption: functions of all tasks are i.i.d. samples from the same GP. In this section, we explore how reasonable the i.i.d. assumption is and discuss extensions to our work that would enable even more flexible uses.

Assumption on i.i.d. GP samples.

To get a better idea on how much our assumptions helped on training the GP, we compare NLLs associated with 23 tasks in §5.1 with models obtained via 3 scenarios:

  1. [label=()]

  2. No training: a randomly initialized model with no training;

  3. Single task: models trained on 100 randomly selected data points of the test task;

  4. H*: models trained on 18 irrelevant tasks selected in §5.3.2.

Here case (c) corresponds to the method HyperBO used for training a GP and case (b) corresponds to the model STBO can obtain with 100 initial observations. In Tab. 5, we show NLLs of these 3 methods on all tasks888All tasks include ImageNet ResNet50 2048. But it is excluded in the test tasks in Tab. 5 because it has much fewer data points than the others. and NLLs on the test task. Note that the held-out tasks for some test tasks are the same because of the held-out rules in §5.3.1.

Comparing NLLs of the test tasks using models without training and trained via marginal likelihood like STBO, it is perhaps surprising to see that training on a subset of data points of the sub-dataset of the test task not only did not contribute to lowering NLL on the entire sub-dataset, but it even made it worse in 20 out of 23 test tasks. The training process by optimizing the NLL on a part of a sub-dataset leads to severe over-fitting. We can observe the same results of NLLs on all tasks. Without any training, our NLL is . Yet single-task training leads to higher NLLs for all models trained on different sub-datasets.

Our method H*, on the other end, consistently achieves lower NLLs on both the test task and all tasks. Although it is not entirely clear what the relation is between a better NLL of the GP and better BO results, achieving lower NLLs typically means that the model has a better fit to the dataset. Hence, by the assumption of typical BO methods, the test function should look like a sample from our model, and so lower NLLs of model will contribute to matching the assumption of typical BO methods. By enhancing the assumption with ours on i.i.d. GP samples, Tab. 5 shows we then will be able to obtain models with a much better fit to the data.

NLL of the test task only NLL of all tasks (Pseudo) KL
Test task No training Single task H* Single task H* Single task H*
WMT XFormer 64
Uniref50 Transformer 128
LM1B Transformer 2048
SVHN WRN 1024
SVHN WRN 256
ImageNet ResNet50 256
ImageNet ResNet50 512
MNIST CNNPoolTanh 2048
MNIST CNNPoolTanh 256
MNIST CNNPoolReLU 2048
MNIST CNNPoolReLU 256
MNIST CNNReLU 2048
MNIST CNNReLU 256
Fashion CNNPoolTanh 2048
Fashion CNNPoolTanh 256
Fashion CNNPoolReLU 2048
Fashion CNNPoolReLU 256
Fashion CNNReLU 2048
Fashion CNNReLU 256
CIFAR100 WRN 2048
CIFAR100 WRN 256
CIFAR10 WRN 2048
CIFAR10 WRN 256
Table 5: NLLs on 23 tasks and (pseudo) KL divergences on matching datasets with trained and randomly initialized GP models. The NLL of randomly initialized model (No training) on all tasks is . The KL value of randomly initialized model (No training) is . Training on a subset of a sub-dataset in the test task (Single task) often leads to much worse marginal likelihood on the entire sub-dataset. Training on irrelevant tasks (H*) achieves much lower (pseudo) KLs on matching datasets and lower NLLs for both the test task only and all tasks.

We also computed the (pseudo) KL divergence across all matching datasets in the last columns of Tab. 5. See Appendix A for a comprehensive analysis on pseudo KL divergence for degenerate multivariate Gaussians. Note that pseudo KL divergence can be negative. Here we use pseudo KL divergence if required by the matching dataset. Again, single-task training leads to unstable (pseudo) KL values, sometimes even higher than without training (). On the contrary, training with H* leads to much more stable and lower values for KL. This indicates that the model learned to predict similarly to the sample mean/covariance estimate, which is known to help better selection of BO query points by Theorem 2.

Batch evaluation.

For simplicity of this paper, we did not consider batch evaluation but rather only focused on the prior selection dimension of the challenges in BO. However, it is straightforward to adopt any batch BO methods in conjunction with HyperBO to support obtaining observations in parallel. For example, we can directly use batch methods in Snoek et al. (2012), Kathuria et al. (2016), or Wang et al. (2017) to replace line 5 of Alg. 1.

High-dimensional and large scale data.

Similar to batch BO, our method can also be naturally combined with most high-dimensional and large scale BO methods to offer more capabilities. For these cases, typically a probabilistic model different from vanilla GPs may be adopted. In line 2 of Alg. 1, it is straightforward to adapt our method to optimize the cumulative marginal likelihood in Eq. 4 instead for the new model. Our meta-learning idea in this paper in fact also brings benefit to high-dimensional and large scale BO methods so that they can better identify their critical special structures, e.g. low-dimensional embedding Wang et al. (2016), cylindrical kernels Oh et al. (2018) or additive Mondrian kernels Wang et al. (2018a).

Different search spaces.

Roughly speaking, there could be two circumstances for difference search spaces. Case I is that tasks share the same search variables, but the search ranges for some variables are different. For example, we may have each function and . In this case, our solution still applies by simply setting a union search space as for learning and use the designated search space of new tasks for optimization.

Case II is more complicated: the search space for each function is and each dimension of may have a different meaning than another search space (). This paper does not have a solution for this scenario. Further research will be needed to reduce Case II to Case I which can be then immediately combined with HyperBO.

7 Conclusion

We proposed HyperBO: a novel meta BO approach that supports practical applications that involve continuous inputs queried at possibly non-aligned locations across tasks. HyperBO uses a simple yet effective idea that is easy to implement and efficient to run. We evaluated HyperBO on real-world big model optimizer tuning tasks, and the results demonstrated its superior performance over state-of-the-art competing methods.

References

  • Bardenet et al. (2013) Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. Collaborative hyperparameter tuning. In ICML, 2013.
  • Bergstra et al. (2011) James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, 2011.
  • Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  • Brazdil et al. (1994) Pavel Brazdil, Joāo Gama, and Bob Henery. Characterizing the applicability of classification algorithms using meta-level learning. In ECML, 1994.
  • Chen et al. (2017) Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P Lillicrap, Matt Botvinick, and Nando de Freitas. Learning to learn without gradient descent by gradient descent. In ICML, 2017.
  • Feurer et al. (2015) Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In NeurIPS, 2015.
  • Gilmer et al. (2021) Justin M. Gilmer, George E. Dahl, and Zachary Nado. init2winit: a jax codebase for initialization, optimization, and tuning research, 2021. URL http://github.com/google/init2winit.
  • Hansen et al. (2021) Nikolaus Hansen, Anne Auger, Raymond Ros, Olaf Mersmann, Tea Tušar, and Dimo Brockhoff. Coco: A platform for comparing continuous optimizers in a black-box setting. Optimization Methods and Software, 36(1):114–144, 2021. URL https://arxiv.org/pdf/1603.08785.pdf.
  • Havasi et al. (2020) Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M Dai, and Dustin Tran. Training independent subnetworks for robust prediction. arXiv preprint arXiv:2010.06610, 2020.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 770–778, 2016.
  • Kathuria et al. (2016) Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched gaussian process bandit optimization via determinantal point processes. NeurIPS, 2016.
  • Kemp and Tenenbaum (2008) Charles Kemp and Joshua B Tenenbaum. The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31):10687–10692, 2008.
  • Kim et al. (2017) Beomjoon Kim, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Learning to guide task and motion planning using score-space representation. In ICRA, 2017.
  • Kim et al. (2019) Beomjoon Kim, Zi Wang, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Learning to guide task and motion planning using score-space representation. The International Journal of Robotics Research, 38(7):793–812, 2019.
  • Krause and Ong (2011) Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In NeurIPS, 2011.
  • Liu et al. (2020) Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. arXiv preprint arXiv:2006.10108, 2020.
  • Malkomes and Garnett (2018) Gustavo Malkomes and Roman Garnett. Automating Bayesian optimization with Bayesian optimization. Advances in Neural Information Processing Systems, 31:5984–5994, 2018.
  • Nado et al. (2021) Zachary Nado, Justin Gilmer, Christopher J. Shallue, Rohan Anil, and George E. Dahl. A large batch optimizer reality check: Traditional, generic optimizers suffice across batch sizes. CoRR, abs/2102.06356, 2021. URL https://arxiv.org/abs/2102.06356.
  • Nesterov (1983) Yurii E Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
  • Oh et al. (2018) ChangYong Oh, Efstratios Gavves, and Max Welling. Bock: Bayesian optimization with cylindrical kernels. In ICML, 2018.
  • Onomous (2021) Anne Onomous. Anonymized, 2021. URL http://github.com/ANONYMOUS.
  • Poloczek et al. (2016) Matthias Poloczek, Jialei Wang, and Peter I Frazier. Warm starting Bayesian optimization. In Winter Simulation Conference (WSC). IEEE, 2016.
  • Poloczek et al. (2017) Matthias Poloczek, Jialei Wang, and Peter Frazier. Multi-information source optimization. In NeurIPS, 2017.
  • Rahimi et al. (2007) Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NeurIPS, 2007.
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. The MIT Press, 2006.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In NeurIPS, 2012.
  • Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180. PMLR, 2015.
  • Springenberg et al. (2016) Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization with robust bayesian neural networks. Advances in neural information processing systems, 29:4134–4142, 2016.
  • Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
  • Swersky et al. (2013) Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In NeurIPS, 2013.
  • Turner et al. (2021) Ryan Turner, David Eriksson, M. McCourt, Juha Kiili, Eero Laaksonen, Zhen Xu, and I. Guyon. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. ArXiv, abs/2104.10201, 2021.
  • Volpp et al. (2020) Michael Volpp, Lukas P Fröhlich, Kirsten Fischer, Andreas Doerr, Stefan Falkner, Frank Hutter, and Christian Daniel. Meta-learning acquisition functions for transfer learning in Bayesian optimization. In International Conference on Learning Representations (ICLR), 2020.
  • Wang et al. (2017) Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional Bayesian optimization via structural kernel learning. In ICML, 2017.
  • Wang et al. (2018a) Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In AISTATS, 2018a.
  • Wang et al. (2018b) Zi Wang, Beomjoon Kim, and Leslie Pack Kaelbling. Regret bounds for meta Bayesian optimization with an unknown gaussian process prior. In NeurIPS, 2018b.
  • Wang et al. (2016) Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Feitas. Bayesian optimization in a billion dimensions via random embeddings.

    Journal or Artificial Intelligence Research

    , 55:361–387, 2016.
  • Yogatama and Mann (2014) Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In AISTATS, 2014.

Appendix A Objective functions

In §4, we presented NLL and KL divergence as objectives. Below we derive the KL divergence between a regular multivariate Gaussian and a degenerate multivariate Gaussian, which is the case for most of our matching data settings in §5.1: the number of matching data points is greater than the number of training tasks. In the end of this section, we introduce a new objective function, combining NLL and KL, interpreted as MAP with a data-dependent prior.

KL divergence for a degenerate multivariate Gaussian

Eq. 5 of §4.2 gives the KL divergence between two Gaussians in the non-degenerate case. In practice, when we minimize Eq. 5, we can simply remove the constants and do the following

(8)

Here the variables we care about, , only appear in mean vector and covariance matrix over the matching data. Even if the sample mean and covariance estimate is degenerate, the optimization objective stays the same as reflected by the derivation below.

If is degenerate, its base measure is at most -dimensional rather than -dimensional, given that there exists a full rank matrix such that (). Note that is the number of matching data points, the number of training tasks and is the rank of matrix and . The KL divergence is not well-defined because the base measure of is different from the base measure of , given is full-rank. However, it is still possible to derive a pseudo KL divergence as below.

Let the degenerate Gaussian be