## 1 Introduction

Many problems in machine learning are associated with finding the optimum of functions that are either explicitly unknown or expensive or slow to evaluate. A popular approach to such problems is Bayesian optimisation (BO) which formulates a sequential search based on a probabilistic model. Underpinning the search are two components, a probabilistic model of the objective function referred to as the

*surrogate model*and an

*acquisition function*

describing the strategy of the sequential gathering of samples. A popular approach is to place a Gaussian process (GP) prior over the surrogate model. This is beneficial as it permits the formulation of rich, yet descriptive, distributions over functions such that prior knowledge of the objective can easily be incorporated. The deterministic acquisition function, then, identifies where the objective should be evaluated in the next stage of the sequence given the current beliefs about the objective. The BO loop alternates between fitting the surrogate model to the currently observed data and evaluating the objective function at the location where the acquisition is maximised. The acquisition function can be viewed as a transformation of the surrogate’s estimate of the objective function to a measure of

*utility*in observation with respect to locating the optimum.

Each step of the BO loop is tasked with finding locations in the input domain that should be included into the surrogate model. Importantly, we want these locations to be as informative as possible about the location of the optimum to aid the search problem, as we have a limited evaluation budget. However, due to the greedy sequential structure of the BO loop, we can sometimes add observations which are detrimental to the search problem as a whole. As an example, consider a function with several local minima. A traditional BO loop over such a function is likely to first locate one of the minima during exploration and then switch to a local exploration of this structure. Eventually we will have sufficiently explored the local structure and wish to go back to a global exploration behaviour. Global exploration become challenging, however, if the length scale used to explore (and model the structure of) the identified local minimum is fundamentally different from the length scale between the extrema. Another example, shown in Figure 1, is where the objective function has nonsmooth or complicated local structure that the surrogate struggles to model while there is a global structure on a different scale. In a traditional setting, it becomes challenging to model both these structures as the surrogate needs to accommodate for the observations in the posterior belief about the function, making future search either local or uninformed.

The underlying problem in both scenarios is the discrepancy in the objective of supervised learning and surrogate modelling for sequential decision making. In the former, the goal is to find a model which describes every point in the domain equally well. In contrast, the latter’s predictions are never considered directly. Instead, the acquisition translates them to a belief about the utility of a location for finding the optimum. The surrogate, thus, only needs to describe the path towards and the location of the optimum well. The specification of a well-suited surrogate capturing the informative variations in the objective function is a critical challenge in BO

movckus1975bayesian.One approach is to introduce a noise model to the surrogate that aims to explain away uninformative parts of the function. However, while a homoscedastic noise model makes a too simplistic global noise assumption, a hetroscedastic GP is challenging to learn from small amounts of data. Another approach is to assume a hierarchical model that warps the input space such that it fits the surrogate model snoek2014input; Oh2018BOCKB; calandra2016manifold

. For function inputs that are smoothly warped, such as machine-learning model hyperparameters set on a logarithmic scale, this has proven to be a successful approach

snoek2014input. While transforming the space allows for modelling non-stationary functions, it increases the influence of the possibly only local structure rather than reducing it. This can be detrimental in the presence of uninformative or nonsmooth, local structure.Acknowledging the challenge of model specification, an approach which includes this as a part of the optimization was presented in malkomes2018automating. Using a compositional kernel grammar from duvenaud2013structure to induce a space of GP models to choose from. Although this and other model selection procedures malkomes2016bayesian,duvenaud2013structure,grosse2012exploiting,gardner2017discovering themselves have shown promise, however, in addition to the computational overhead the procedures are still reliant on the existence of suitable models in this space. It remains challenging to handle cases where the objective function contains structure that is both hard to specify a-priori and that is unhelpful in guiding the search to the optimum. In the cases of uninformative, local structure, a search for a prior that models those structures well can again be detrimental for the search.

In this paper we propose an approach to address the discrepancy between the objective of fitting the surrogate model and finding the optimum of the function. We adopt the idea of a Latent Gaussian Process regression model wang12_gauss_proces_regres_with_heter,yousefi_unsupervised_2016,bodin_latent_2017 for the surrogate model. By introducing a latent space orthogonal to the domain of the function, we can modulate the relevance of each data point individually. Using this formulation allows us to control the influence of each observation, effectively reducing influence of locations believed to be detrimental for the search while highlighting important structures.

## 2 Latent Gaussian process surrogate models

In a supervised learning setting, predictive distributions of a model take local and global effects into account. Away from the data, these considerations can lead to uninformed predictions as all possible local behaviours have to remain plausible. In contrast, it is not the goal in Bayesian optimization to model the objective function as faithfully as possible. Instead, predictions should propagate the information needed for the decision where to evaluate next.

In this paper we focus on modelling trends that are informative for the search problem while reducing the influence of uninformative variations. In a non-parametric setting this means that our model must have the ability to reduce the influence of uninformative observations. This requirement is closely related to the noise assumptions in supervised learning problems, where conditionally independent additive likelihoods can explain away local disturbance.

Let be the latent objective function and let be observed data with .
We aggregate the observations as with the matrix containing the stacked observations and containing the respective noiseless observations .
In this work, we consider the *modulated objective function* where we augment with a latent space such that can be recovered via the marginalization,

(1) |

We denote the unobserved variable associated with as , their concatenation as and place a GP prior over the function and an isotropic Gaussian prior over the latent variable .

The choice of the kernel governs how the latent dimensions can be used by the modulated model to explain variations in data. Imposing additive structure , implies that is itself a GP. A (degenerate) bias kernel

acting only on the diagonal of the covariance matrix yields a GP with a homoscedastic Gaussian likelihood. While this is a common modelling choice in supervised learning, a homoscedastic assumption can be problematic for surrogate models as objective functions tend to have different local behaviours in different parts of the input space, leading to uninformed predictive uncertainties. An extension to heteroscedastic likelihoods is also problematic as the low number of observations makes training impractical.

Instead of additive structure, we choose to be compatible with and place a single stationary kernel on the complete space . Common choices for stationary kernels in Bayesian optimization include the squared exponential kernel and the Matérn family of kernels snoek_practical_2012. This choice of kernel allows our model to reduce the relevance of individual observations (and thus explain them) by moving them away from other data in . Figure 1 shows a comparison of posteriors of our proposed method with noiseless and noisy GPs. In special cases, our model behaves similarly to common model choices: (1) Setting all is equivalent to noiseless GP regression over . (2) Constraining the to a constant pairwise distance resembles homoscedastic GP regression. (3) Moving all orthogonally along their own dimension yields heteroscedastic regression.

A high-dimensional latent space

allows the modulated objective function many degrees of freedom to explain away data and conform to the GP prior yielding uninformed models. In supervised learning, this trade-off can be performed by placing priors on the amplitude of the effects which should be explained in the output space via the noise variance. This is often problematic for Bayesian optimization because amplitudes in the output space are unknown. Our stationary joint kernel allows us to formulate this trade-off in the input space, which is often well-understood. A low dimensionality of

together with a small prior variance for the latent variables imposes more structure. Points close together in -space can never be explained independently by moving them too far apart in -space with respect to a kernel length scale .## 3 Approximate inference

When performing inference in the model we want to be robust with respect to the hyper-parameter settings in the surrogate model. To achieve this we perform full marginalisation via Markov Chain Monte Carlo (MCMC) algorithms shahriari_taking_2016. In addition to the model hyper-parameters we also need to address the latent variable

. In our model, the jointly dependent grows with the number of observations which makes MCMC posterior samples increasingly challenging to obtain as more observations are being gathered. Since we know that the choice of kernel parameters is critical, we retain inference using MCMC for the kernel parameters and approximate the latent variables via variational inference. The variational bound is given by(2) |

where we use a fully factorized mean field approximation depending on variational parameters and . This bound is optimized using the Adam optimizer kingma_adam_2014 to obtain a variational belief for the latent variables . While the KL-divergence can be evaluated in closed form, the expected log-likelihood of the data is intractable due to the marginalization of the hyperparameters. We approximate the expectation over via sampling and the inner marginalization with MCMC using slice sampling snoek_practical_2012.

Running independent MCMC for every iteration of the optimization is costly. However, since the belief about does not change much between iterations, we can continue on the same Markov chain between bound iterations and significantly increase the performance. We draw one sample from per MCMC sample and initialize the MCMC algorithm with a burn-in of 200 samples and 100 initial posterior samples to initialize . For every following iteration of the optimizer, we use no burn-in and draw 10 joint samples.

#### Informative predictions

In the modulated objective function , the latent space

is used to explain effects which are not informative for decision making. We therefore evaluate the acquisition function on the GP at the hyperplane

, so that(3) |

We approximate the marginalizations of the kernel parameters and the latent variables via MCMC on with joint sampling from . Given a sample of and , the predictive distribution is a standard GP marginal and can be computed in closed form. Note that marginalizing the prior belief of would yield a posterior belief about the (unmodulated) objective function following Equation 1. This is not desirable as the marginalization would reintroduce effects which have been explained as local disturbance.

## 4 Experiments

We now demonstrate the benefit of our surrogate latent variable extension via experiments on standard BO benchmarks with quantitative comparisons to baseline methods. In addition we propose a heuristic to add uncertainty to the utility estimate.

#### Baselines and metric

We compare using our proposed latent variable extension against not using the extension on a popular GP model setup for BO. In addition, we compare the latent variable extension against other methods of handling challenging structure in the objective function, namely (i) a noiseless GP, (ii) a GP with homoscedastic noise and (iii) a non-stationary, Warped GP snoek2014input.

We provide two quantitative metrics.
Firstly, we follow standard practice to compare across datasets and provide the *mean gap* estimated over 20 runs malkomes2018automating.
The gap measure is defined as

(4) |

where is the minimum function value among the first initial random points, is the best function value found within the evaluation budget and is the function’s true optimum. Methods are judged to have very similar or equivalent performance to the best performing if not significantly different, determined by an one-sided paired Wilcoxon signed rank test at 5% significance malkomes2018automating

. Secondly, as the nonparameteric Wilcoxon signed rank test requires the individual runs, we report mean and standard deviation on the simple regret, the difference between

and .We use the Matérn 5/2 kernel for all surrogates and the model-marginalised expected improvement acquisition function evaluated via slice sampling snoek2012practical. The inference for our latent variable extension also includes variational approximations, as described in Section 3. For the maximization of the expected utility with respect to input location, we use -cover sampling, as in de2012exponential, for all models where the expected utility in the experiments is the model-marginalised expected improvement. For further details, we refer to the supplementary material.

#### Benchmark datasets

We perform the comparisons on function benchmarks from sigopt-web-page,scikit-optimize using the default domains provided by the benchmark detailed in the supplementary material. A number of problems are available in different dimensionalities which are denoted in with the results. We note that in experiments in other work, domains are sometimes restricted, which changes the characteristics and difficulty of the benchmark problem. In addition, problems are marked with the descriptive properties given in [evalset]sigopt-web-page that can reflect the relative difficulty of the task.

#### Priors on the latent input variables

The prior can be parameterized in relation to the relative scale of the characteristics to be ignored. We specify the function prior over the product space using a kernel with common parameters for and . Thus, the standard deviation of the prior relates directly to distances in -direction.

When domain specific knowledge is available, may be specified at an appropriate scale. However, we often do not have access to such knowledge. In all our experiments we adopt a hierarchical prior approach whereby is sampled uniformly from a small candidate set at each evaluation. Specifically, we set and where , the length of the diagonal of the unit Q-hypercube. A choice of corresponds to a noiseless GP without latent covariates. We found that this approach performed well empirically and is applied consistently across all our experiments where not otherwise specified.

### 4.1 Latent Gaussian process surrogate model

We first evaluate the benefit of extending the surrogate model to include a latent input variable that will alter the relevance of the observed data as illustrated in Figure 1. Recall that we seek to provide a more informed belief about the function far from data to produce sensible sampling decisions via the acquisition function.

#### Surrogate function modelling in the presence of nonsmooth structure

Figure 2 shows the behaviour of the latent input model on the Holder Table benchmark function from sigopt-web-page. We show two scenarios, one with and one without an added nonsmooth component in form of saw-tooth and square wave functions. The challenge is to solve the optimization adequately using a GP surrogate despite the nonsmooth, uninformative structure that is difficult to model with a GP. While showing comparatively little variance in the simple regret measure on the default Holder Table benchmark, a noiseless GP’s variance increases substantially for the corrupted version, as the surrogate is influenced heavily by the nonsmooth component.

Adding a homoscedastic noise model (GP+HS) helps during early iterations, however, the BO loop is unable to locate and exploit minima as informative trends are explained as noise instead and the mean regret plateaus. Warped GPs show similar behaviour in terms of performance. In contrast, our surrogate (GP+LM) shows good performance for both the smooth and nonsmooth settings resulting in both an improved mean regret but also a low variance in the resulting regret. While the regret drops slower compared to GP+HS in the beginning due to GP+LM tendency to explore more in the start, given enough observations, a good solution is found reliably.

Benchmark | Evals | Dim | Properties | GP | Warped GP | GP + HS | GP + LM |

Hartmann | 50 | 6 | boring | 0.959 | 0.554 | 0.881 | 0.935 |

Griewank | 50 | 2 | oscillatory | 0.930 | 0.499 | 0.690 | 0.903 |

Shubert | 50 | 2 | oscillatory | 0.504 | 0.160 | 0.307 | 0.483 |

Ackley | 50 | 2 | complicated, oscillatory | 0.930 | 0.250 | 0.890 | 0.958 |

Cross In Tray | 50 | 2 | complicated, oscillatory | 0.908 | 0.411 | 0.901 | 0.953 |

Holder table | 50 | 2 | complicated, oscillatory | 0.937 | 0.860 | 0.907 | 0.994 |

Corrupted Holder Table | 50 | 2 | complicated, oscillatory | 0.779 | 0.788 | 0.820 | 0.898 |

Branin01 | 100 | 2 | none | 1.000 | 1.000 | 1.000 | |

Branin02 | 100 | 2 | none | 0.991 | 0.964 | 0.966 | |

Beale | 100 | 2 | boring | 0.981 | 0.982 | 0.981 | |

Hartmann | 100 | 6 | boring | 0.987 | 0.947 | 0.982 | |

Griewank | 100 | 2 | oscillatory | 0.966 | 0.841 | 0.940 | |

Levy | 100 | 2 | oscillatory | 0.997 | 0.999 | 0.999 | |

Shubert | 100 | 2 | oscillatory | 0.578 | 0.470 | 0.789 | |

Deflected Corrugated Spring | 100 | 10 | oscillatory | 0.351 | 0.840 | 0.721 | |

Weierstrass | 100 | 8 | complicated | 0.586 | 0.693 | 0.677 | |

Cross In Tray | 100 | 2 | complicated, oscillatory | 1.000 | 0.995 | 1.000 | |

Holder Table | 100 | 2 | complicated, oscillatory | 0.974 | 0.967 | 1.000 | |

Corrupted Holder Table | 100 | 2 | complicated, oscillatory | 0.866 | 0.886 | 0.919 | |

Ackley | 100 | 2 | complicated, oscillatory | 0.975 | 0.916 | 0.974 | |

Ackley | 100 | 6 | complicated, oscillatory | 0.465 | 0.788 | 0.828 | |

NN Boston | 100 | 9 | unknown | 0.699 | 0.711 | 0.750 | |

NN Climate Model Crashes | 100 | 9 | unknown | 0.723 | 0.840 | 0.805 |

#### Evaluation on benchmark suite

Table 1 presents results across a wide range of benchmark functions consisting of the SigOpt benchmark suite sigopt-web-page and two additional hyper parameter tuning benchmarks scikit-optimize. The benchmark functions have challenging properties such as complicated and oscillatory local structure and high dimensionality. In general, the GP+HS and Warped GP tend to be outperformed by the noiseless GP surrogate and GP+LM reliably shows comparable or improved performance. On some benchmarks known to be difficult functions, the difference is particularly large, such as Ackley 6D, Deflected Corrugated Spring 10D and Shubert 2D. The Ackley 6D, exhibiting the largest increase in performance by our latent variable extension, is described as

“technically oscillatory, but with such a short wavelength that its behavior probably seems more like noise”

sigopt-web-page.### 4.2 Limiting the overconfidence of surrogate models

Experiments above have shown that the use of GP+LM increases robustness against complicated and uninformative local structure in the objective function. Such structure can cause a surrogate to be overconfident about the function at particular locations and underconfident about the function as whole as it can require extreme parameter choices such as very short length scales to be explained. This can be particularly detrimental for the search due to the acquisition function’s role of balancing efficient exploration of the domain with the exploitation of current knowledge of the underlying function shahriari2015taking. With an overconfidence around the currently predicted minimum this trade-off can easily be tilted too heavily towards exploitation. We address this problem further by investigating the application of a heuristic to reduce the impact of such overconfidence and provide an empirical evaluation of its performance. The heuristic is easy to implement in any surrogate-based search setting and can offer complementary benefit to GP+LM.

#### Limiting the precision of the location selection

We limit the effect of overconfidence by injecting noise into the location selection. First, we find the location with highest utility with high precision, such as via an auxiliary optimizer shahriari2015taking or a sufficiently dense grid. We then perturb this location choice with uniform noise,

(5) |

and use as the next query location. This implies that we are indifferent towards sample locations within a small region around the location proposed by the acquisition function as a result of the surrogate belief. We refer to this heuristic as the -construction as we define as the side length of hypercubes that fit densely in the search space, the size of the indifference regions. This parameterization is independent of the input dimensionality and can easily be interpreted with respect to grid sampling. There are two special cases: yields random search while disables the heuristic.

#### Parameterization

We need to select the , which for a given input dimensionality implies each hypercube’s side to be if the domain itself is a unit hypercube. Similarly to the case of for the latent variable extension, it can be difficult to parameterize without specific domain knowledge about the precision of the function to optimize. For this reason, we sample at every BO iteration, where corresponds to using the expected utility maximum directly.

#### Illustration of overconfident estimates of utility

In Figure 3 we provide two examples of using the heuristic with constant and sampled uniformly from the set denoted as . We compare using the added heuristic to, with the same auxiliary optimizer, directly accepting the maximum utility location as the query location. We include BO on a grid as well as random search for reference. This experiment is performed using standard noiseless GP surrogates.

The two functions, Powell Triple Log 12D and Deflected Corrugated Spring 10D, show different properties. For the former, BO using an auxiliary optimizer significantly outperforms random sampling while for the latter, random sampling performs better than BO and choosing the maximum utility location. This can have its explanation in that the former function is comparably slowly varying while the latter is oscillatory. Grid evaluation performs poorly in these high dimensional problems due to the curse of dimensionality. Notably, by only adding the proposed heuristic

to the GP which uses the auxiliary optimizer, good solutions can be found reliably in both cases.Benchmark | Evals | Dim | Properties | GP | GP | GP + LM | GP + LM |
---|---|---|---|---|---|---|---|

Branin01 | 50 | 2 | none | 0.00 0.00 | 0.00 0.00 | 0.00 0.00 | 0.00 0.00 |

Powell Triple Log | 50 | 12 | none | 3.83 0.80 | 3.96 0.53 | 4.20 0.62 | 4.06 0.71 |

Hartmann | 50 | 6 | boring | 0.13 0.12 | 0.17 0.31 | 0.19 0.38 | 0.18 0.28 |

Griewank | 50 | 2 | oscillatory | 0.08 0.18 | 0.06 0.05 | 0.07 0.06 | 0.06 0.04 |

Shubert | 50 | 2 | oscillatory | 84.7 64.9 | 103 56.4 | 90.9 61.1 | 89.2 60.8 |

Cosine Mixture | 50 | 10 | oscillatory | 0.83 0.38 | 0.65 0.18 | 0.76 0.36 | 0.82 0.15 |

Drop-Wave | 50 | 10 | oscillatory | 0.92 0.03 | 0.86 0.07 | 0.92 0.04 | 0.84 0.08 |

Deflected Corrugated Spring | 50 | 10 | oscillatory | 3.92 1.80 | 1.53 0.68 | 2.99 1.32 | 1.11 0.39 |

Weierstrass | 50 | 8 | complicated | 7.10 1.64 | 6.50 1.56 | 6.87 1.20 | 6.33 1.63 |

Holder Table | 50 | 2 | complicated, oscillatory | 0.73 1.21 | 0.56 1.15 | 0.02 0.03 | 0.19 0.66 |

Ackley | 50 | 6 | complicated, oscillatory | 11.3 6.91 | 11.4 7.44 | 6.60 5.18 | 3.82 0.78 |

NN Boston | 50 | 9 | unknown | 0.03 0.01 | 0.02 0.01 | 0.03 0.01 | 0.02 0.01 |

NN Climate Model Crashes | 50 | 9 | unknown | 0.05 0.05 | 0.04 0.03 | 0.03 0.01 | 0.04 0.01 |

#### Empirical evaluation

We expand the experiment above across a diverse range of functions and show that the heuristic can be combined with GP+LM surrogate models to achieve the benefits of both contributions. Table 2 details these results showing the average regret at the end of 50 observations. We can confirm that the addition of the heuristic to a GP surrogate (GP ) improves over the GP surrogate without the heuristic (GP ) in the majority of cases. Similarly, adding the heuristic to GP+LM improves performance in most cases. The best overall performance is achieved by combining GP+LM with the -construction. The most notable performance increase is achieved for high-dimensional problems with complicated or oscillatory behaviour.

## 5 Conclusion

We have presented an approach to Bayesian optimisation which utilises a Gaussian process surrogate model with an additional latent input variable extension. The latent variable allows us to influence the function evaluations in an informed manner to handle structure in the objective function detrimental for finding the optimum. We show experimentally how our approach is able to reliably solve benchmarks with challenging local structure. Importantly our approach can be included into any Gaussian process based surrogate model used for Bayesian Optimization.

## 6 Acknowledgements

This project was supported by Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Partnership, EPSRC CAMERA project (EP/M023281/1), the Royal Society, and the Hans Werthén Fund at The Royal Swedish Academy of Engineering Sciences. We would like to thank Zhenwen Dai at Amazon for useful discussions.

Comments

There are no comments yet.