JUMBO: Scalable Multi-task Bayesian Optimization using Offline Data

06/02/2021 ∙ by Kourosh Hakhamaneshi, et al. ∙ 0

The goal of Multi-task Bayesian Optimization (MBO) is to minimize the number of queries required to accurately optimize a target black-box function, given access to offline evaluations of other auxiliary functions. When offline datasets are large, the scalability of prior approaches comes at the expense of expressivity and inference quality. We propose JUMBO, an MBO algorithm that sidesteps these limitations by querying additional data based on a combination of acquisition signals derived from training two Gaussian Processes (GP): a cold-GP operating directly in the input domain and a warm-GP that operates in the feature space of a deep neural network pretrained using the offline data. Such a decomposition can dynamically control the reliability of information derived from the online and offline data and the use of pretrained neural networks permits scalability to large offline datasets. Theoretically, we derive regret bounds for JUMBO and show that it achieves no-regret under conditions analogous to GP-UCB (Srinivas et. al. 2010). Empirically, we demonstrate significant performance improvements over existing approaches on two real-world optimization problems: hyper-parameter optimization and automated circuit design.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many domains in science and engineering involve the optimization of an unknown black-box function. Such functions can be expensive to evaluate, due to costs such as time and money. Bayesian optimization (BO) is a popular framework for such problems as it seeks to minimize the number of function evaluations required for optimizing a target black-box function shahriari2015taking; frazier2018tutorial

. In real-world scenarios however, we often have access to offline evaluations of one or more auxiliary black-box functions related to the target function. For example, one might be interested in finding the optimal hyperparameters of a machine learning model for a given problem and may have access to an offline dataset from previous runs of training the same model on a different dataset for various configurations. Multi-task Bayesian optimization (MBO) is an optimization paradigm that extends BO to exploit such additional sources of information from related black-box functions for efficient optimization


Early works in MBO employ multi-task Gaussian Processes (GP) with inter-task kernels to capture the correlations between the auxiliary and target function swersky2013multi; williams2007multi; poloczek2016warm. Multi-task GPs however fail to scale to large offline datasets. More recent works have proposed combining neural networks (NN) with probabilistic models to improve scalability. For example, MT-BOHAMIANN springenberg2016bayesian uses Bayesian NNs (BNN)  neal2012bayesian as surrogate models for MBO. The performance however, depends on the quality of the inference procedure. In contrast, MT-ABLR perrone2018scalable

uses a deterministic NN followed by a Bayesian Linear Regression (BLR) layer at the output to achieve scalability while permitting exact inference. However, the use of a linear kernel can limit the expressiveness of the posterior.

We propose JUMBO, an MBO algorithm that sidesteps the limitations in expressivity and tractability of prior approaches. In JUMBO, we first train a NN on the auxiliary data to learn a feature space, akin to MT-ABLR but without the BLR restriction on the output layer. Thereafter, we train two GPs simultaneously for online data acquisition via BO: a warm-GP on the feature space learned by the NN and a cold-GP on the raw input space. The acquisition function in JUMBO combines the individual acquisition functions of both the GPs. It uses the warm-GP to reduce the search space by filtering out poor points. The remaining candidates are scored by the acquisition function for the cold-GP to account for imperfections in learning the feature space of the warm-GP. The use of GPs in the entire framework ensures tractability in posterior inference and updates.

Theoretically, we show that JUMBO is a no-regret algorithm under conditions analogous to those used for analyzing GP-UCB srinivas2010gaussian. In practice, we observe significant improvements over the closest baseline on two real-world applications: transferring prior knowledge in hyper-parameter optimization and automated circuit design.

2 Background

We are interested in maximizing a target black-box function defined over a discrete or compact set . We assume only query access to . For every query point , we receive a noisy observation . Here, we assume is standard Gaussian noise, i.e., where

is the noise standard deviation. Our strategy for optimizing

will be to learn a probabilistic model for regressing the inputs to using the available data and using that model to guide the acquisition of additional data for updating the model. In particular, we will be interested in using Gaussian Process regression models within a Bayesian Optimization framework, as described below.

2.1 Gaussian Process (GP) Regression

A Gaussian Process (GP) is defined as a set of random variables such that any finite subset of them follows a multivariate normal distribution. A GP can be used to define a prior distribution over the unknown function

, which can be converted to a posterior distribution once we observe additional data. Formally, a GP prior is defined by a mean function and a valid kernel function . A kernel function is valid if it is symmetric and the Gram matrix is positive semi-definite. Intuitively, the entries of the kernel matrix measure the similarity between any two points and . Given points , the distribution of the function evaluations in a GP prior follows a normal distribution, such that where and is a covariance matrix. For simplicity, we will henceforth assume to be a zero mean function.

Given a training dataset , let and denote the inputs and their noisy observations. Since the observation model is also assumed to be Gaussian, the posterior over at a test set of points will follow a multivariate normal distribution with the following mean and covariance:

Due to the inverse operation during posterior computation, standard GPs can be computationally prohibitive for modeling large datasets. We direct the reader to rasmussen2003gaussian for an overview on GPs.

2.2 Bayesian Optimization (BO)

Bayesian Optimization (BO) is a class of sequential algorithms for sample-efficient optimization of expensive black-box functions frazier2018tutorial; shahriari2015taking. A BO algorithm typically runs for a fixed number of rounds. At every round , the algorithm selects a query point and observes a noisy function value . To select , the algorithm first infers the posterior distribution over functions via a probabilistic model (e.g., Gaussian Processes). Thereafter, is chosen to optimize an uncertainty-aware acquisition function that balances exploration and exploitation. For example, a popular acquisition function is the Upper Confidence Bound (UCB) which prefers points that have high expected value (exploitation) and high uncertainty (exploration). With the new point , the posterior distribution can be updated and the whole process is repeated in the next round.

At round , we define the instantaneous regret as where is the global optima and maximizes the acquisition function. Similarly, we can define the cumulative regret at round as the sum of instantaneous regrets . A desired property of any BO algorithms is to be no-regret where the cumulative regret is sub-linear in as , i.e., .

2.3 Multi-task Bayesian Optimization (MBO)

Our focus setting in this work is a variant of BO, called Multi-Task Bayesian Optimization (MBO). Here, we assume auxiliary real-valued black-box functions , each having the same domain as the target function swersky2013multi; springenberg2016bayesian. For each function , we have an offline dataset consisting of pairs of input points and the corresponding function evaluations . If these auxiliary functions are related to the target function, then we can transfer knowledge from the offline data to improve the sample-efficiency for optimizing . In certain applications, we might also have access to offline data from itself. However, in practice, is typically expensive to query and its offline dataset will be very small.

We discuss some prominent works in MBO that are most closely related to our proposed approach below. See Section 5 for further discussion about other relevant work.

Multi-task BO swersky2013multi is an early approach that employs a custom kernel within a multi-task GP williams2007multi to model the relationship between the auxiliary and target functions. Similar to standard GPs, multi-task GPs fail to scale for large offline datasets.

On the other hand, parametric models such as neural networks (NN), can effectively scale to larger datasets but do not defacto quantify uncertainty. Hybrid methods such as

DNGO snoek2015scalable achieve scalability for (single task) BO through the use of a feed forward deep NN followed by Bayesian Linear Regression (BLR) bishop2006pattern. The NN is trained on the existing data via a simple regression loss (e.g, mean squared error). Once trained, the NN parameters are frozen and the output layer is replaced by BLR for the BO routine. For BLR, the computational complexity of posterior updates scales linearly with the size of the dataset. This step can be understood as applying a GP to the output features of the NN with a linear kernel (i.e. where is the NN function with parameters ). For BLR, the computational complexity of posterior inference is linear w.r.t. the number of data points and thus DNGO can scale to large offline datasets.

MT-ABLR perrone2018scalable extends DNGO to multi task settings by training a single NN to learn a shared representation followed by task-specific BLR layers (i.e. predicting , and based on inputs). The learning objective corresponds to the maximization of sum of the marginal log-likelihoods for each task: . The main task is included in the last index, is the Bayesian Linear layer weights for task with prior , and are the hyper-prior parameters, and is the observed data from task . Learning by directly maximizing the marginal likelihood improves the performance of DNGO while maintaining the computational scalability of its posterior inference in case of large offline data. However, both DNGO and ABLR have implicit assumptions on the existence of a feature space under which the target function can be expressed as a linear combination. This can be a restrictive assumption and furthermore, there is no guarantee that given finite data such feature space can be learned.

MT-BOHAMIANN springenberg2016bayesian addresses the limited expressivity of prior approaches by employing Bayesian NNs to specify the posterior over and feed the NN with input and additional learned task-specific embeddings for task . While allowing for a principled treatment of uncertainties, fully Bayesian NNs are computationally expensive to train and their performance depends on the approximation quality of stochastic gradient HMC methods used for posterior inference.

3 Scalable MBO via JUMBO

In the previous section, we observed that prior MBO algorithms make trade-offs in either posterior expressivity or inference quality in order to scale to large offline datasets. Our goal is to show that these trade-offs can be significantly mitigated and consequently, the design of our proposed MBO framework, which we refer to as JUMBO, will be guided by the following desiderata: (1) Scalability to large offline datasets (e.g., via NNs) (2) Exact and computationally tractable posterior updates (e.g., via GPs) (3) Flexible and expressive posteriors (e.g., via non-linear kernels).

3.1 Regression Model

Figure 1: JUMBO. During the pretraining phase, we learn a NN mapping (orange) for the warm-GP. The next query based on (purple) will be the point that has a high score based on the acquisition function of both warm and cold GP (blue).

The regression model in JUMBO is composed of two GPs: a warm-GP and a cold-GP denoted by and , respectively. As shown in Figure 1, both GPs are trained to model the target function but operate in different input spaces, as we describe next.

(with hyperparameters ) operates on a feature representation of the input space derived from the offline dataset . To learn this feature space, we train a multi-headed feed-forward NN to minimize the mean squared error for each auxiliary task, akin to DNGO snoek2015scalable. Thereafter, in contrast to both DNGO and ABLR, we do not train separate output BLR modules. Rather, we will directly train on the output of the NN using the data acquired from the target function . Note that for training , we can use any non-linear kernel, which results in an expressive posterior that allows for exact and tractable inference using closed-form expressions.

Additionally, we can encounter scenarios where some of the auxiliary functions are insufficient in reducing the uncertainty in inferring the target function. In such scenarios, relying solely on can significantly hurt performance. Therefore, we additionally initialize (with hyperparameters ) directly on the input space .

If we also have access to offline data from (i.e. ), the hyperparameters of the warm and cold GPs can also be pre-trained jointly along with the neural network parameters. The overall pre-training objective is then given by:


where denotes the negative marginal log-likelihood for the corresponding GP on .

3.2 Acquisition Procedure

Post the offline pre-training of the JUMBO’s regression model, we can use it for online data acquisition in a standard BO loop. The key design choice here is the acquisition function, which we describe next. At round , let and be the single task acquisition function (e.g. UCB) of the warm and cold GPs, after observing

data points, respectively. Next, we define the acquisition function for JUMBO as a convex combination of the individual acquisition functions by employing an interpolation coefficient

. Formally,


Our guiding intuition for the acquisition function in JUMBO is that we are most interested in querying points which are scored highly by the acquisition functions of both GPs.

To this end, note that Eq. 2 essentially captures that points with will follow and points with follow . We can also choose to depend on . By choosing to be close to for points with , we can ensure to acquire points that have high acquisition scores as per both and . Next, we will discuss some theoretical results that shed more light on the design of .

3.3 Theoretical Analysis

Here, we will formally derive the regret bound for JUMBO and provide insights on the conditions under which JUMBO outperforms GP-UCB srinivas2010gaussian. For this analysis, we will use Upper Confidence Bound (UCB) as our acquisition function for the warm and cold GPs. To do so, we utilize the notion of Maximum Information Gain (MIG).

Definition 1 (Maximum Information Gain srinivas2010gaussian).

Let , . Consider any and let be a finite subset. Let be noisy observations such that , . Let denote the Shannon mutual information.

The MIG of set after evaluations is the maximum mutual information between the function values and observations among all choices of n points in . Formally,

This quantity depends on kernel parameters and the set , and also serves as an important tool for characterizing the difficulty of a GP-bandit. For a given kernel, it can be shown that where for discrete and for the continuous case srinivas2010gaussian. For example for Radial Basis kernel . For brevity, we focus on settings where is discrete. Further results and analysis are deferred to Appendix A.

For GP-UCB srinivas2010gaussian, it has been shown that for any , if

(i.e., the GP assigns non-zero probability to the target function

), then the cumulative regret after rounds will be bounded with probability at least :


with and .

Recall that is a mapping from input space to the feature space . We will further make the following modeling assumptions to ensure that the target black-box function is a sample from both the cold and warm GPs.

Assumption 1.


Assumption 2.

Let denote the NN parameters obtained via pretraining (Eq. 1).Then, there exists a function such that .

Theorem 1.

Let and be some arbitrary partitioning of the input domain . Define the interpolation coefficient as an indicator . Then under Assumptions 1 and 2, JUMBO is no-regret.

Specifically, let be the number of rounds such that the JUMBO queries for points . Then, for any , running JUMBO for iterations results in a sequence of candidates for which the following holds with probability at least :


where , , and is the set of output features for .

Figure 2: The effect of the pre-trained NN on . In the desirable case, gets significantly compressed to .

Based on the regret bound in Eq. 4, we can conclude that if the partitioning is chosen such that and , then JUMBO has a tighter bound than GP-UCB. The first condition implies that the second term in Eq. 4 is negligible and intuitively means that will only need a few samples to infer the posterior of defined on , making BO more sample efficient. The second condition implies that the which in turn makes the regret bound of JUMBO tighter than GP-UCB. Note that cannot be made arbitrarily small, since (and therefore ) will get larger which conflicts with the first condition.

Figure 2 provides an illustrative example. If the learned feature space compresses set to a smaller set , then can infer the posterior of with only a few samples in (because MIG is lower). Such will likely emerge when tasks share high-level features with one another. In the appendix, we have included an empirical analysis to show that is indeed operating on a compressed space . Consequently, if is reflective of promising regions consisting of near-optimal points i.e. for some , BO will be able to quickly discard points from subset and acquire most of its points from .

(a) Target (red) and auxiliary (blue) task
(b) Correlation between target and auxiliary tasks
(c) Posterior of , , their UCB and JUMBO’s acquisition function
Figure 3: Dynamics of JUMBO after observing 6 data points (a) The two functions have different optimums (b) The tasks are related (c) Iteration 4 of the BO with our proposed model, from top to bottom: (1) GP modeling input to objective using () samples (2) GP modeling input to objective using () samples (3) UCB acquisition function for (4) UCB acquisition function for (5) JUMBO’s acquisition function that compromises between the optimum of the two.

3.4 Choice of interpolation coefficient

The above discussion suggests that the partitioning should ideally consist of near-optimal points. In practice, we do not know and hence, we rely on our surrogate model to define . Here, is the optimal value of and the acquisition threshold is a hyper-parameter used for defining near-optimal points w.r.t. . At one extreme, corresponds to the case where (i.e. the GP-UCB routine) and the other extreme corresponds to case with .

Figure 3 illustrates a synthetic 1D example on how JUMBO obtains the next query point. Figure 2(a) shows the main objective (red) and the auxiliary task (blue). They share a periodic structure but have different optimums. Figure 2(b) shows the correlation between the two.

Applying GP-UCB srinivas2010gaussian will require a considerable amount of samples to learn the periodic structure and the optimal solution. However in JUMBO, as shown in Figure 2(c), the warm-GP, trained on () samples, can learn the periodic structure using only 6 samples, while the posterior of the cold-GP has not yet learned this structure.

It can also be noted from Figure 2(c) that JUMBO’s acquisition function is when the value of is close to . Therefore, the next query point (marked with a star) has a high score based on both acquisition functions. We summarize JUMBO in Algorithm 1.

Input: Offline auxiliary dataset , Offline target dataset (optional; default: empty set), Threshold
Output: Sequence of solution candidates maximizing target function
1 Initialize NN , , . Pretrain NN params jointly with and hyper-params using and as per Eq. 1.;
2 Initialize , . for round to  do
3          Set . Set . Set Pick . Obtain noisy observation for . Update and . Update and .
4 end for
Algorithm 1 JUMBO

4 Experiments

We are interested in investigating the following questions: (1) How does JUMBO perform on benchmark real-world black-box optimization problems relative to baselines? (2) How does the choice of threshold impact the performance of JUMBO? (3) Is it necessary to have a non-linear mapping on the features learned from the offline dataset or a BLR layer is sufficient?

Our codebase is based on BoTorch balandat2020botorch

, a Python library for Bayesian optimization that extends PyTorch 

paszke2019pytorch. It is provided in the Supplementary Materials with additional details in Appendix.

4.1 Application: Hyperparameter optimization

Datasets. We consider the task of optimizing hyperparameters for fully-connected NN architectures on 4 regression benchmarks from HPOBench klein2019tabular: Protein Structure rana2013physicochemical, Parkinsons Telemonitoring tsanas2010enhanced, Naval Propulsion coraddu2016machine, and Slice Localization graf20112d. HPOBench provides a look-up-table-based API for querying the validation error of all possible hyper-parameter configurations for a given regression task. These configurations are specified via 9 hyperparameters, that include continuous, categorical, and integer valued variables.

The objective we wish to minimize is the validation error of a regression task after 100 epochs of training. For this purpose, we consider an offline dataset that consists of validation errors for some randomly chosen configurations after 3 epochs on a given dataset. The target task is to optimize this error after 100 epochs. In

klein2019tabular, the authors show that this problem is non-trivial as there is small correlation between epochs 3 and 100 for top-1% configurations across all datasets of interest.

Figure 4:

The regret of MBO algorithms on Protein, Parkinsons, Naval, and Slice datasets. Standard errors are measured across 10 random seeds.

Evaluation protocol. We validate the performance of JUMBO against the following baselines with a UCB acquisition function srinivas2010gaussian:

  • [leftmargin=*]

  • GP-UCB srinivas2010gaussian (i.e. cold-GP only) trains a GP from scratch disregarding completely. Equivalently, it can be interpreted as JUMBO with in Eq. 2 and .

  • MT-BOHAMIANN springenberg2016bayesian trains a BNN on all tasks jointly via SGHMC (Section 2.3).

  • MT-ABLR perrone2018scalable trains a shared NN followed by task-specific BLR layers (Section 2.3).

  • GCP salinas2020quantile uses Gaussian Copula Processes to jointly model the offline and online data.

  • Offline DKL (i.e. warm-GP only) is our proposed extension to Deep Kernel Learning, where we train a single GP online in the latent space of a NN pretrained on (See Section 5 for details). Equivalently, it can be interpreted as JUMBO with in Eq. 2.

Results. We run JUMBO (with ) on all baselines for 50 rounds and 5 random seeds each and measure the simple regret per iteration. The regret curves are shown in Figure 4. We find that JUMBO achieves lower regret than the previous state-of-the-art algorithms for MBO in almost all cases. We believe the slightly worse performance on the slice dataset relative to other baselines is due to the extremely low top-1% correlation between epoch 3 and epoch 100 on this dataset as compared to others (See Figure 10 in klein2019tabular), which could result in a suboptimal search space partitioning obtained via the warm-GP. For all other datasets, we find JUMBO to be the best performing method. Notably, on the Protein dataset, JUMBO is always able to find the global optimum, unlike the other approaches.

Figure 5: (a) Circuit Design results (b) The optimal acquisition threshold (=0.05) is far from the GP-UCB and Offline DKL extremes. (c) The Non-linear mapping is a crucial piece of JUMBO’s algorithm.

4.2 Application: Automated Circuit Design

Next, we consider a real-world use case in optimizing circuit design configurations for a suitable performance metric, e.g., power, noise, etc. In practice, designers are interested in performing layout simulations for measuring the performance metric on any design configuration. These simulations are however expensive to run; designers instead often turn to schematic simulations which return inexpensive proxy metrics correlated with the target metric.

In our circuit environment, the circuit configurations are represented by an 8 dimensional vector, with elements taking continuous values between 0 and 1. The offline dataset consists of 1000 pairs of circuit configurations and a scalar goodness score based on the schematic simulations. We consider the same baselines as before: GP-UCB

srinivas2010gaussian, MT-BOHAMIANN springenberg2016bayesian, MT-ABLR perrone2018scalable, Offline DKL, and GCP salinas2020quantile. We also consider two other baselines: MF-GP-UCB kandasamy2019multi, which extends the GP-UCB baseline to a multi-fidelity setting as schematic score values can be interpreted as a low-fidelity approximation of layout scores, and BOX-GP-UCB perrone2019learning which confines the search space to a hyper-cube over the promising region based on the offline data. We ran each algorithm with for 100 iterations and measured simple regret against iteration. As reflected in the regret curves in Figure 4(a), JUMBO outperforms other algorithms.

4.3 Effect of acquisition threshold

The threshold is a key design hyperparameter for defining the acquisition function for JUMBO. In Figure 4(b), we illustrate the effect of different choices for on the performance of the algorithm for the circuit design setup. JUMBO with reduces to GP-UCB (i.e. cold-GP only) and with , it reduces to offline DKL (i.e. warm-GP only). As we can see, the lowest regret is achieved for . We also note that there is a wide range of that JUMBO performs well relative to other baselines, suggesting a good degree of robustness and less tuning in practice.

4.4 Ablation: BLR with JUMBO’s acquisition function

A key difference between JUMBO and ABLR perrone2018scalable is replacing the BLR layer with a GP. To test whether there is any merit in having a GP, we ran an experiment on Protein dataset and replaced the GP with a BLR in JUMBO’s procedure. Figure 4(c) shows that JUMBO with significantly outperforms JUMBO with a BLR layer.

5 Related Work

Transfer Learning in Bayesian Optimization

: Utilizing prior information for applying transfer learning to improve Bayesian optimization has been explored in several prior papers. Early work of

swersky2013multi focuses on the design of multi-task kernels for modeling task correlations poloczek2016warm. These models tend to suffer from lack of scalability; wistuba2018scalable; feurer2018scalable show that this challenge can be partially mitigated by training an ensemble of task-specific GPs that scale linearly with the number of tasks but still suffer from cubic complexity in the number of observations for each task. To address scalability and robust treatment of uncertainty, several prior works have been suggested salinas2020quantile; springenberg2016bayesian; perrone2018scalable. salinas2020quantile employs a Gaussian Copula to learn a joint prior on hyper-parameters based on offline tasks, and then utilizes a GP on the online task for adapt to the target function. springenberg2016bayesian uses a BNN as surrogates for MBO; however, since training BNNs is computationally intensive perrone2018scalable proposes to use a deterministic NN followed by a BLR layer at the output to achieve scalability.

Some other prior work exploit certain assumptions between the source and target data. For example shilton2017regret; golovin2017google assume an ordering of the tasks and use this information to train GPs to model residuals between the target and auxiliary tasks. feurer2015initializing; wistuba2015learning assume existence of a similarity measure between prior and target data which may not be easy to define for problems other than hyper-parameter optimization. A simpler idea is to use prior data to confine the search space to promising regions perrone2019learning. However, this highly relies on whether the confined region includes the optimal solution to the target task. Another line of work studies utilizing prior optimization runs to meta-learn acquisition functions volpp2019meta. This idea can be utilized in addition to our method and is not a competing direction.

Multi-fidelity Black-box Optimization (MFBO): In multi-fidelity scenarios we can query for noisy approximations to the target function relatively cheaply. For example, in hyperparameter optimization, we can query for cheap proxies to the performance of a configuration on a smaller subset of the training data petrak2000fast, early stopping li2017hyperband, or by predicting learning curves domhan2015speeding; klein2017fast. We direct the reader to Section 1.4 in hutter2019automated for a comprehensive survey on MBFO. Such methods, similar to MF-GP-UCB kandasamy2019multi (section 4.2), are typically constrained to scenarios where such low fidelities are explicitly available and make strong continuity assumptions between the low fidelities and the target function.

Deep Kernel Learning (DKL): Commonly used GP kernels (e.g. RBF, Matern) can only capture simple correlations between points a priori. DKL huang2015scalable; calandra2016manifold addresses this issue by learning a latent representation via NN that can be fed to a standard kernel at the output. snoek2015scalable employs linear kernels at the output of a pre-trained NN while huang2015scalable extends it to use non-linear kernels. The warm-GP in JUMBO can be understood as a DKL surrogate model trained using offline data from auxiliary tasks.

6 Conclusion

We proposed JUMBO, a no-regret algorithm that employs a careful hybrid of neural networks and Gaussian Processes and a novel acquisition procedure for scalable and sample-efficient Multi-task Bayesian Optimization. We derived JUMBO’s theoretical regret bound and empirically showed it outperforms other competing approaches on set of real-world black-box optimization problems.


Appendix A Proofs of Theoretical Results

We will present proofs for Theorem 1 and an additional result in Theorem 5 that extends the no-regret guarantees of JUMBO to continuous domains. Our theoretical derivations will build on prior results from [srinivas2010gaussian] and [kandasamy2019multi].

a.1 Theorem 1

Let and denote the posterior mean and standard deviation of at the end of round after observing }. Similarly, we will use and to denote the posterior mean and standard deviation of at the end of round after observing .

Lemma 2.

Pick and set where , (e.g. ). Define and . Then,


Fix and . Based on Assumption 1, conditioned on , . Similarly, Assumption 2 implies that conditioned on , . Let be the event that and the event that . From proof of Lemma 5.1 in [srinivas2010gaussian] we know that given a normal distribution , . Using and , . Similarly, . Using union bound, we have:

By union bound, we have:

The event in this Lemma is just and the proof is concluded. ∎

Next, we state two lemmas from prior work.

Lemma 3.

If , then is bounded by .


See Lemma 5.2 in [srinivas2010gaussian]. It employs the results of Lemma 2 to prove the statement. ∎

Lemma 4.


denote the posterior variance of a GP after

observations, and let . Assume that we have queried at points of which points are in . Then .


See Lemma 8 in [kandasamy2019multi]. ∎

Proof for Theorem 1


From Lemma 3, we have:


Summing over instantaneous regrets for rounds, we get:


Eq 7 follows from the monotonicity of . Eq. 8 follows from the definition of in Lemma 2 and the last inequality in Eq. 9 follows from Lemma 4.

Finally, from Cauchy-Schwartz inequality, we know that . Combining with Eq. 9, we obtain the result in Theorem 1. ∎

a.2 Extension to Continuous Domains

We will now derive regret bounds for the general case where is a d-dimensional compact and convex set with . This will critically require an additional Lipschitz continuity assumption on .

Theorem 5.

Suppose that kernels and are such that the derivatives of and sample paths are bounded with high probably. Precisely, for some constants ,


Pick , and set . Then, running JUMBO for iterations results in a sequence of candidates for which the following holds with probability at least :

where .

To start the proof, we first show that we have confidence on all the points visited by the algorithm.

Lemma 6.

Pick and set , where , . Define and . Then,

holds with probability of at least .


Fix and . Similar to Lemma 3, . Since , using the union bound for concludes the statement. ∎

For the purpose of analysis, we define a discretization set , so that the results derived earlier can be re-applied to bound the regret in continuous case. To enable this approach we will use conditions on

-Lipschitz continuity to obtain a valid confidence interval on the optimal solution

. Similar to [srinivas2010gaussian], let us choose discretization of size (i.e. uniformly spaced points per dimension in ) such that for all the closest point to in , , has a distance less than some threshold. Formally, .

Lemma 7.

Pick and set , where , . Then, for all , the regret is bounded as follows:


with probabilty of at least .


In light of Lemma 6,the proof follows directly from Lemma 5.8 in [srinivas2010gaussian]. ∎

Proof of Theorem 5


From Eq. 9 in the proof of Theorem 1, we have shown that:

Therefore, using Cauchy-Schwarz:

Since , Theorem 5 follows Lemma 7. ∎

Appendix B Implementation Details

b.1 Pre-training Details

Figure 6

illustrates the skeleton of the architecture that was used for all experiments. The input configuration is fed to a multi-layer perceptron of

layers with hidden units. Then, optionally, a dropout layer is applied to the output and the result is fed to another non-linear layer with outputs. The latent features are then mapped to the output with a linear layer. All activations are .

Figure 6: NN Architecture during pre-training: The first blocks is layers of hidden units with activations. Following that is a dropout layer and then a single layer perceptron to get features. Thereafter, the latent features are mapped linearly to the output.

For HPO experiments, we have used . For circuit experiments we used , and dropout rate of . These hyper-parameters were chosen based on random search by observing the prediction accuracy of the pre-training model on the auxiliary validation dataset which was 20% of the overall dataset.

b.2 Details of training the Gaussian Process hyper-parameters

For both warm and cold GP, we consider a Matern kernel (i.e. with where ). The length scale and observation noise are optimized in every iteration of BO by taking 100 gradient steps on the likelihood of observations via Adam optimizer with a learning rate of .

b.3 Acquisition Function Details

For all experiments, we used Upper Confidence Bound with the exploration-exploitation hyper-parameter at round set as where is the budget of total number of iterations. This way we favor exploration more initially and gradually drift to more exploitation as we approach the end of the budget. For optimization of acquisition function, we use the derivative free algorithm CMA-ES [hansen2016cma].

b.4 Dealing with Categorical Variables in HPO

We handle categorical and integer-valued variables in BO similar to [garrido2020dealing]. In particular, we used as the kernel where is a deterministic transformation that maps the continuous optimization variable to a representation space

that adheres to a meaningful distance measurement. For example, for categorical parameters, it converts a continuous input to a one-hot encoding corresponding to a choice for that parameter, and for integer-valued variables, it converts the continuous variable to the closest integer value. Similarly, for the pre-training phase, we also train using


Appendix C Experimental Evidence

All the experiments were done on a quad-core desktop.

c.1 Space compression through the pretrained NN

In this experiment we studied the latent space of a NN fed with uniformly sampled inputs for circuit design and see that 75% of data variance is preserved in only 4 dimensions (with ), suggesting that the warm-GP is operating in a compressed space.

Figure 7: Explained Variance of latent dimensions

c.2 Tabular Experimental Results

In this section we present the quantitative comparison between JUMBO and the best outstanding prior work for each experimental case. In this table we have also included BOHB as another relevant multi-fidelity baseline.

BOHB combines the successive halving approach introduced in HyperBand [li2017hyperband] with a probabilistic model that captures the density of good configurations in the input space. Unlike other methods BOHB employs a fixed budget and utilizes the information beyond epoch 3. It runs multiple hyperparameter configurations in parallel and terminates a subset of them after every few epochs based on their current validation error until the budget is exhausted.

Protein () Parkinsons () Naval () Slice ()
GP-UCB 1.98 4.93 1.4 0.77
MT-BOHAMIANN 6.71 2.13 2 0.84
MT-ABLR 13.52 4.91 2.3 1.42
OfflineDKL 1.40 2.67 4.9 10.67
BOHB 6.38 3.16 2.1 0.23
GCP 7.50 3.15 3.3 0.46
JUMBO (ours) 0 0.23 0.7 0.73
Table 1: Comparison of simple regret for HPO. Lower is better. On average JUMBO’s simple regret at convergence is 45% better than the state-of-the-art MBO baseline in each experiment.