How Bayesian should Bayesian Optimisation be?
Bayesian optimisation (BO) uses probabilistic surrogate models - usually Gaussian processes (GPs) - for the optimisation of expensive black-box functions. At each BO iteration, the GP hyperparameters are fit to previously-evaluated data by maximising the marginal likelihood. However, this fails to account for uncertainty in the hyperparameters themselves, leading to overconfident model predictions. This uncertainty can be accounted for by taking the Bayesian approach of marginalising out the model hyperparameters. We investigate whether a fully-Bayesian treatment of the Gaussian process hyperparameters in BO (FBBO) leads to improved optimisation performance. Since an analytic approach is intractable, we compare FBBO using three approximate inference schemes to the maximum likelihood approach, using the Expected Improvement (EI) and Upper Confidence Bound (UCB) acquisition functions paired with ARD and isotropic Matern kernels, across 15 well-known benchmark problems for 4 observational noise settings. FBBO using EI with an ARD kernel leads to the best performance in the noise-free setting, with much less difference between combinations of BO components when the noise is increased. FBBO leads to over-exploration with UCB, but is not detrimental with EI. Therefore, we recommend that FBBO using EI with an ARD kernel as the default choice for BO.READ FULL TEXT VIEW PDF
Gaussian Process (GPs) models are a rich distribution over functions wit...
A body of work has been done to automate machine learning algorithm to
Bayesian optimisation is an important decision-making tool for high-stak...
Dynamic paired comparison models, such as Elo and Glicko, are frequently...
Gaussian processes (GPs) are Bayesian nonparametric generative models th...
Bayesian optimisation is widely used to optimise stochastic black box
Bayesian optimisation has gained great popularity as a tool for optimisi...
How Bayesian should Bayesian Optimisation be?
Bayesian optimisation (BO) is a popular sequential approach for optimising costly or time-consuming black-box functions that have no derivative information or closed form (Brochu et al., 2010; Snoek et al., 2012). It is a surrogate model-based approach that employs a probabilistic model built with previous evaluations. BO comprises of two main steps, which are repeated until budget exhaustion or convergence. Firstly, a probabilistic surrogate model is constructed, which is typically a Gaussian process (GP) because of their strength in function approximation and uncertainty quantification (Rasmussen and Williams, 2006; Shahriari et al., 2016; Frazier, 2018). Secondly, an acquisition function is optimised to select the next location to expensively evaluate. Acquisition functions combine the surrogate model’s predictions and associated uncertainty to strike a balance between exploiting areas of design space with good predicted values and exploring locations with high uncertainty in their predicted values.
During the first step of a BO iteration, the surrogate model must be learned from the data. The predominant strategy in BO is to find the model hyperparameters that maximise the marginal likelihood, also known as the model evidence. This point-based estimate is known as themaximum likelihood (ML) estimate, or, if we also take into account some prior belief about the model’s hyperparameters, the maximum a posteriori (MAP) estimate. These are normally found via gradient-based optimisation (MacKay, 1999; GPy, 2012; Balandat et al., 2020). However, the marginal likelihood landscape may contain multiple optima of similar quality, as well as flat, ridge-like structures on which gradient-based optimisers may get stuck (Warnes and Ripley, 1987); a common partial remedy is to take the best from multiple optimisations from randomly chosen initial parameters. However, in addition the ML or MAP estimate does not take into account any uncertainty that exists about the true hyperparameters, leading to naturally overconfident predictions.
Viewing this from a Bayesian perspective tells us that we need to marginalise out the hyperparameters of the model; that is, every possible hyperparameter choice should be weighted by how well its corresponding model explains the data. Then, we can use the prediction of these weighted models to take into account the uncertainty in the hyperparameters. In all but the simplest of cases, this requires calculation of an intractable integral, so in practise approximations to the integral are made using methods such as variational inference (Jordan et al., 1999)
and Markov Chain Monte Carlo (MCMC)(Hastings, 1970; Metropolis et al., 1953; Duane et al., 1987). Several works have performed a fully-Bayesian treatment of the hyperparameters in BO, and some advocate for it to become the prevailing strategy (Osborne, 2010; Snoek et al., 2012). Yet most works that apply a fully-Bayesian approach, (Benassi et al., 2011; Henrández-Lobato et al., 2014; Wang and Jegelka, 2017) only use it because it is the correct thing to do, without any more justification. Therefore, in this work, we investigate whether a fully-Bayesian treatment of the surrogate model’s hyperparameters leads to improved performance in BO. Specifically, we investigate the performance of BO using two acquisition functions (Expected Improvement (EI) and Upper Confidence Bound (UCB)), for two different Gaussian process kernel types (isotropic and ARD), and under four different levels of observation noise. These comparisons are made for the traditional MAP approach and three approximate inference schemes for a fully-Bayesian treatment.
Our main contributions can be summarised as follows:
We provide the first empirical study of the effect on BO of a fully-Bayesian treatment of the hyperparameters.
We evaluate different combinations of acquisition function, GP kernel type, and inference method on fifteen well-known test functions over a range of dimensions (2 to 10) and for a range of noise levels.
We show empirically that using the EI with an ARD kernel and fully-Bayesian inference using MCMC leads to superior BO performance in the noise-free setting.
We show that a fully-Bayesian treatment of the hyperparameters always leads to (even more) over-exploration with UCB. However, for EI a fully Bayesian treatment only increases exploration on higher-dimensional functions with increased observational noise level.
We begin in Section 2 by reviewing BO and how to perform fully-Bayesian BO. In Section 3 we review GPs, paying particular attention to the hyperparameter learning, and follow this up in Section 4 by reviewing the approximate inference schemes used in this work. An extensive experimental evaluation is carried out in Section 5, along with a discussion of the results. We finish with concluding remarks in Section 6.
Bayesian optimisation (BO), also known as Efficient Global Optimisation, is a surrogate-assisted global search strategy that sequentially samples the problem domain at locations likely to contain the global optimum. It takes into account both the predictions of a probabilistic surrogate model, typically a Gaussian process (GP), and its corresponding uncertainty (Jones et al., 1998). It was first proposed by Kushner (1964), and improved and popularised by both Močkus et al. (1978) and Jones et al. (1998). See (Brochu et al., 2010; Shahriari et al., 2016; Frazier, 2018) for comprehensive reviews of BO. Without loss of generality, we can define the problem of finding a global minimum of an unknown, potentially noise-corrupted objective function as
defined on a compact domain . In BO it is assumed that is black-box, it has no (known) closed form and no derivative information is available. However, we are able to access the results of its evaluations at any location . BO is particularly effective in cases where the evaluation budget is limited due to function evaluations being expensive in terms of time and/or money. In this case we wish to optimise in either as few function evaluations as possible, or as well as possible for a given budget .
|:||Number of initial samples|
|:||Budget on the number of expensive evaluations|
Algorithm 1 outlines the standard Bayesian optimisation procedure. It starts (line 2) by generating initial sample locations with a space-filling algorithm, such as Latin hypercube sampling (McKay et al., 1979). These are expensively evaluated with the function: . Then, at each BO iteration, a GP model is usually trained (Snoek et al., 2012) by maximising the log marginal likelihood with respect to the model hyperparameters
, which are usually parameters of the GP kernel, such as a length scale, and the noise variance assumed to be corrupting the observed value of; see Section 3.3
. The marginal likelihood may also be multiplied by a prior probability of the parameters,, expressing a priori beliefs about the parameters. Maximisation of obtains the maximum a posteriori (MAP) estimate of the model hyperparameters (line 5). The choice of where to evaluate next in BO is determined by an acquisition function , also known as an infill criterion, which balances the exploitation of good regions of the design space found thus far with the exploration regions where the predictive uncertainty is high. Acquisition functions are discussed further in the next section. The location to be expensively evaluated next is selected by maximising the acquisition function (line 6 is then expensively evaluated with , the training data is augmented, and the process is repeated until budget exhaustion.
Acquisition functions are measures of quality that enable us to decide which location is the most promising, and thus where we should expend our next expensive evaluation. They are based on the predictive distribution of the surrogate model, where the dependence on the observed expensive evaluations () are summarised in . Acquisition functions usually depend on both the posterior mean prediction and its associated uncertainty captured by the variance . Two of the most popular acquisition functions are the Expected Improvement (EI) (Jones et al., 1998) and Upper Confidence Bound (UCB) (Srinivas et al., 2010). EI measures expected positive improvement over the best function value observed so far :
which can be expressed analytically as (Jones et al., 1998):
where is the predicted improvement normalised by its corresponding uncertainty, and and
are the Gaussian probability density and cumulative density functions respectively. EI is known to often be too exploitative, resulting in optimisation runs that can converge prematurely to a local minima(Berk et al., 2019). Various works have tried to curtail this behaviour by increasing the amount of exploration. Berk et al. (2019), for example, try to do this by averaging over realisations drawn from surrogate model’s posterior distribution instead of just using the mean prediction. Chen et al. (2021), like other authors (Sóbester et al., 2005; Feng et al., 2015), equate the two terms in EI to exploitation and exploration and up-weight the amount of the latter accordingly. However, as shown by De Ath et al. (2021), only certain weight combinations allow for this version of EI to be monotonic in both and ; otherwise it can prefer inferior solutions, with a worse predicted value.
UCB is an optimistic strategy that is the weighted sum of the surrogate model’s posterior mean prediction and its associated uncertainty:
where is a weight that usually depends on the number of function evaluations performed thus far and that explicitly controls the exploration-exploitation trade-off. Setting is non-trivial: too small a value and UCB will get stuck in local optima; too large a value and UCB will become too exploratory, taking too many evaluations to converge. The convergence proofs for UCB of Srinivas et al. (2010) rely on a particular schedule for in which it increases proportional to the logarithm of , although this scheme has been shown to be over-exploratory for many practical problems (De Ath et al., 2021).
Other acquisition functions have also been proposed such as methods (De Ath et al., 2021), Probability of Improvement (Kushner, 1964), Knowledge Gradient (Scott et al., 2011), and various information-theoretic approaches (Hennig and Schuler, 2012; Henrández-Lobato et al., 2014; Wang and Jegelka, 2017; Ru et al., 2018). However, in this work we focus on EI and UCB due to their popularity.
Interestingly, even though fully-Bayesian approaches for Gaussian process (GP) modelling have been proposed in the literature for several decades, (O’Hagan, 1978; Handcock and Stein, 1993), the vast majority of BO works follow Algorithm 1, they perform a MAP estimate of the GP hyperparameters at each iteration. However, there are some exceptions to this. Osborne (2010) developed a Bayesian approach for global optimisation along with a novel acquisition function, and showed that their fully-Bayesian approach outperformed the standard MAP approach. The hugely influential tutorial of Snoek et al. (2012) advocates a fully-Bayesian treatment of the GP hyperparameters and shows that it is sometimes superior to MAP estimation. Other works (Benassi et al., 2011; Henrández-Lobato et al., 2014; Wang and Jegelka, 2017) have also performed a fully-Bayesian approach and have used it to illustrate the effectiveness of their proposed acquisition functions, rather than specifically recommending it.
In order to carry out a fully-Bayesian treatment of the hyperparameters in BO, we need to marginalise out the hyperparameters of the surrogate model, the acquisition function is averaged weighted by the posterior probability of the hyperparameters(Snoek et al., 2012):
where is the surrogate model’s posterior hyperparameter distribution. The integral appearing in (5) is generally intractable, but it can be approximated via Monte Carlo integration:
where are samples drawn from . These samples can be drawn via an approximate inference method such as Hamiltonian Monte Carlo or variational inference, both of which are discussed further in Section 4. We note that the MAP estimate of the hyperparameters can be regarded as approximating by a delta function that places all the posterior probability mass at . In using the estimated integrated acquisition function (6), the uncertainty in the surrogate model hyperparameters is explicitly taken into account and may therefore be expected to lead to acquisition of better locations.
A Gaussian process (GP) defines a prior distribution over functions, such that any finite number of function values are distributed as a multivariate Gaussian (Rasmussen and Williams, 2006). In GP regression we aim to learn a mapping from a collection of inputs to their corresponding outputs , where the outputs are often noisy realisations of the underlying function we wish to model. Assuming that the observations are corrupted with additive Gaussian noise, where , the observation model is defined as . In GP regression we place a multivariate Gaussian prior over the latent variables :
with a covariance with and associated hyperparameters . Here, is a positive semidefinite covariance function modelling the covariance between any pair of locations. For notational simplicity, and without loss of generality, we take the mean function of the GP to be zero; (De Ath et al., 2020) discusses BO performance with different choices of mean function.
The covariance function , also known as a kernel (function), encodes prior beliefs about the characteristics of the modelled function, such as its smoothness. Kernels are typically stationary, meaning that they are a function of the distance between the two inputs, . One of the most frequently used kernels is the squared exponential (SE) kernel:
with parameters and signal variance defining its characteristic length-scale and output-scale respectively. Using a SE kernel for each input dimension with a separate length-scale results in the SE automatic relevance determination (ARD) kernel:
It has been argued (Stein, 1999; Snoek et al., 2012) that the SE kernel has too strong smoothness assumptions for the realistic modelling of physical processes. Popular alternatives include the family of covariance functions (Stein, 1999). Here we use the kernel with , as recommended by Snoek et al. (2012):
where is a modified Bessel function and is the squared distance between and scaled by the length-scales which again allows ARD suppression of irrelevant dimensions. Further information on GP kernels can be found in (Rasmussen and Williams, 2006; Duvenaud, 2014).
Given some noisy observations at locations and a covariance function , predictions about the underlying function
can be made using the GP model. Assuming a Gaussian noise model, the joint distribution of the observed training valuesfunction values at a test location is
where the elements of the
-dimensional vectorare . Hereafter, the observation noise is incorporated into so that represents both the kernel hyperparameters and the likelihood noise. We also drop the explicit dependence of the kernel on for ease of exposition.
Conditioning the joint distribution (11) on the observations yields the predictive distribution of
as a Gaussian distribution:
with mean and variance
However, before the GP can be used to make predictions the kernel hyperparameters and the observational noise must be inferred.
One of the most useful properties of GPs is that we are able to calculate the marginal likelihood , otherwise known as the model evidence (MacKay, 1992a), of data for a particular model defined by a set of hyperparameters. The marginal likelihood may be found by direct integration of the product of the likelihood function of the latent variables and the GP prior. For numerical reasons, the log marginal likelihood is normally used instead:
The first term in (15) is the data-fit term, how well the model predicts the observed targets , while the second corresponds to a complexity penality that depends only on the magnitude of the covariance function, and the third is a normalisation constant.
The predominant strategy in BO for inferring the hyperparameters is to find the value of that maximises the log marginal likelihood (15). In doing so, we find the maximum likelihood (ML) parameters . Predictions can then be made by using in (13) and (14).
However, we often have some prior beliefs about the hyperparameters. For example, if our observations were standardised to have zero mean and unit variance, then it would be extremely unlikely that the likelihood noise would be larger than one because then the model would predict the majority of the observations as noise. These beliefs maybe encoded in a prior distribution , and can be taken into account in the likelihood evaluation by multiplying the marginal likelihood by the prior distribution:
In practice, the logarithm of is added to (15) to arrive at the log posterior distribution, which is then maximised to find the maximum a posteriori (MAP) parameters . Note here that if we a priori believe that all hyperparameter configurations are equally likely, then .
The optimal hyperparameters, or , are normally found by gradient-based optimisation (MacKay, 1999; GPy, 2012; Balandat et al., 2020) with multiple restarts, using L-BFGS-B (Byrd et al., 1995). These restarts are needed because the landscape may contain multiple local maxima (Rasmussen and Williams, 2006). However, the gradient-based optimisation can be sensitive to the starting locations because hyperparameters that are weakly identified can give rise to flat, ridge-like structures (Warnes and Ripley, 1987).
Figure 1 shows nine gradient-based optimisation runs on the log marginal likelihood (15) as a function of the length-scale and output-scale . Training data was generated by drawing a realisation from a GP with an 5/2 kernel having hyperparameters = . The figure shows the paths taken and final positions (red circles) by gradient based optimisations starting at the locations shown as white circles. Note that was fixed at . Only two of the optimisation runs ended up at a location close to the maximum, with the other runs failing to get close. The three runs that started on the left-hand side of the figure illustrate the flat, ridge-like structures that can occur – in all three cases the optimiser got stuck in a suboptimal region of almost zero gradient.
As illustrated, point-based estimates of the hyperparameters are subject to high variability, because they may become stuck in local minima or on vast plateaux, and do not take into account the uncertainty in the hyperparameters themselves. Multiple restarts are often sufficient to avoid local minima and plateaux, but accounting for hyperparameter uncertainty requires a fully-Bayesian formulation. Given a prior expressing beliefs about the hyperparameters, the posterior distribution is given by:
As noted above, with the posterior distribution on hand, the acquisition function weighted by the posterior probability of the hyperparameters can be optimised to find the next location to be expensively evaluated (cf. (5
)). However, it is also often of interest to find the posterior predictive distribution of the function at:
The left-hand term of the integrand is a Gaussian (12) and the right-hand term is the hyperparameter posterior. As illustrated in Figure 2, performing a fully-Bayesian treatment of the hyperparameters in GP regression leads to increased uncertainty in the model predictions. This is because we also account for the uncertainty in the hyperparameters, rather than assuming that the MAP estimate is correct. Similarly, a fully-Bayesian treatment of Bayesian optimisation accounts for the uncertainty in the hyperparameters when optimising the acquisition function.
However, as is frequently the case with Bayesian inference, the integral necessary to evaluate is intractable, and we therefore turn to approximate methods.
Practical fully-Bayesian optimisation requires averaging with respect to the GP hyperparameter distribution. In particular, full account of the model uncertainty is taken by using the averaged acquisition function (5). Since this cannot be done analytically, the averages can be approximated either by simulating draws directly from the posterior or to approximate the posterior and draw samples from the approximation; (6). The former is primarily carried out by Markov Chain Monte Carlo (MCMC) (Metropolis et al., 1953; Hastings, 1970) methods, and the latter by many density approximation methods such as Laplace’s method, Expectation Propagation (Minka, 2001), and variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008). Compared to MCMC, density approximation methods tend to be much faster and easier to scale to larger amounts of data (Blei et al., 2017), but lack the guarantees of producing asymptotically exact samples from the target density (Robert and Casella, 2005). Here, we consider Hamiltonian Monte Carlo and variational inference as representative methods from the two main approximate inference styles.
Monte Carlo estimation in the context of GP modelling has been carried out by a number of authors (Rasmussen, 1996; Neal, 1997; Murray and Adams, 2010; Filippone et al., 2013; Lalchand and Rasmussen, 2020). We particularly note Lalchand and Rasmussen (2020), who compared the quality of GP models using Hamiltonian Monte Carlo and variational inference with ML estimates.
Hamiltonian or Hybrid Monte Carlo (Duane et al., 1987) (HMC), is an MCMC method which constructs a Markov chain exploring the target probability density, ensuring that locations are visited proportional to their probability. HMC is a promising method for approximate inference in cases where gradient information is available (Rasmussen, 1996). It is preferred to the more traditional Metropolis-Hastings (MH) (Hastings, 1970) algorithm because it is able to rapidly explore regions of high probability by introducing auxiliary momentum variables that are associated with the parameters of the target density. New proposals as to where to next move are generated by simulating Hamiltonian dynamics for a predefined number of steps , with the dynamics themselves simulated using a leap-frog symplectic integrator. At each iteration of the algorithm, the gradients of the log marginal likelihood of the target density are required for each of the steps. In the context of approximate inference for GP regression, this results in the need to invert the kernel matrix times, hence making HMC a costly procedure to carry out. See (Neal, 2011, 1997) for tutorials on HMC and its application to GPs.
Here, we use a self-tuning variant of HMC known as the No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014), in which both the path length and integration step size are automatically tuned. This avoids the additional overhead in manually tuning both parameters each time inference is performed with different data and is thus of particular benefit in BO where the quantity of data grows at each iteration.
Variational Inference (VI), tries to find a most similar approximate density to a given probability density function(Blei et al., 2017). More formally, we first assume a parametrisable family of approximate densities that captures features of the target density . Then, we try to find that minimises the Kullback-Leibler (KL) divergence to the target density, for data :
Finally, we can approximate the posterior with and draw arbitrarily many samples from it to perform approximate inference.
There any many choices for the family of approximation, although one of the key ideas behind VI is to choose to be expressive enough to model the target density well, but simple enough to allow for efficient optimisation (Blei et al., 2017). A few examples include mean-field (MFVI) and full-rank (FRVI) Gaussian approximations, as well as the more recent, expressive, and computationally expensive normalising flows (Rezende and Mohamed, 2015).
In this work, we focus on MFVI and FRVI because these have been empirically shown to be suitable for approximating the GP hyperparameter posterior distribution (Lalchand and Rasmussen, 2020), while still being relatively cheap computationally. The mean-field approximation defines as a product of independent densities. Here we choose to be a product of normal densities, one for each hyperparameter:
where is a vector of unconstrained variational parameters, and is the number of parameters in the target distribution. Of course, because this is a diagonal approximation to the true posterior, the mean-field approximation will not capture correlation between parameters. In contrast to this, the full-rank Gaussian approximation allows for cross-covariance terms to be modelled explicitly. To ensure that the learned covariance is always positive semidefinite, the covariance matrix is written in terms of its Cholesky factorisation, . This results in an approximating distribution defined as
where is the vector of unconstrained variational parameters.
VI seeks to minimise the KL divergence between the target density , and the approximating distribution . It can be shown ((Blei et al., 2017)) that the KL divergence is minimised by maximising the variational free energy or evidence lower bound (ELBO),
Both terms in ELBO are readily computed, although the model-specific computations required to find the gradient with respect to the parameters for gradient-based optimisation may be cumbersome. More recently, however, the ubiquity of automatic derivative calculations have led to the development of scalable VI algorithms, namely automatic differentiation VI (ADVI) (Kucukelbir et al., 2017). ADVI first transforms the inference problem into one with unconstrained real-values by, for example, taking the logarithm of parameters that need to be kept strictly positive. It then recasts the gradient of the ELBO as an expectation of instead, allowing for the use of Monte Carlo methods to perform gradient approximation where needed. Lastly, it reparameterises other gradients in terms of a standard Gaussian, again allowing for the expectation of a gradient to be calculated (easier) instead of the gradient of an expectation (hard to impossible); this is known as the reparameterisation trick (Kingma and Welling, 2014).
|GoldsteinPrice||2||StyblinskiTang||5, 7, 10|
We now compare the performance of using a fully-Bayesian treatment of the hyperparameters in BO (denoted FBBO), with the standard approach of using the MAP solution (denoted MAP). The EI, and UCB acquisition functions (Section 2.1) are compared for both MAP and FBBO using direct MCMC, MFVI and FRVI across the 15 well-known benchmark functions listed in Table 1111Function formulae can be found at: http://www.sfu.ca/~ssurjano/optimization.html.. We investigate several scenarios to compare FBBO with MAP. Specifically, we consider the noise-free case, where the function of interest is assumed to not be significantly corrupted by noise; for this we fix . We also look at the case where the function is corrupted by additive Gaussian noise for three different levels of noise. Note that, because the functions in Table 1
are noise-free, we add stochastically generated noise to their evaluations to simulate a noisy setting. We modify the functions to have Gaussian additive noise with a standard deviation that is proportional to the rangeof possible function values. Concretely, we estimate by evaluating Latin hypercube samples (LHS) across the function domain, finding the maximum of these samples, and calculating , where is the known minimum of the function. Therefore, for a given , each function evaluation is a draw from a Gaussian distribution: . We investigate three noise levels, . We also compare FBBO and MAP using ARD and isotropic kernels, using one length-scale for each dimension or using one length-scale for all dimensions of the problem. One might expect, given the reduction in the number of hyperparameters in the isotropic case, that the hyperparameter posterior distribution would be less complex and therefore less likely to benefit from a fully-Bayesian treatment. Overall, this results in 8 sets of experiments across the test functions: the noise-free setting and three different amounts of noise, repeated for the ARD and isotropic kernels.
A zero-mean GP with a kernel (10) was used for all experiments. At each BO iteration, input variables were rescaled to reside in , and observations were rescaled to have zero-mean and unit variance prior to GP inference. Relatively uninformative priors were used, based on BoTorch recommendations (Balandat et al., 2020), for the three types of hyperparameters: , , and , where
is a Gamma distribution with concentration and rate parametersand respectively. Models were initially trained on observations generated by maximin LHS and then optimised for a further function evaluations. The trade-off between exploitation and exploration in UCB was set using Theorem 1 in (Srinivas et al., 2010), which implies that increases logarithmically with . Each optimisation run was repeated
times from a different set of LHS samples, and the sets of initial locations were used for all methods to enable statistical comparison. An odd number () of repeats were chosen to allow for the calculation of the median fitness value without the need for rounding. For MAP estimation, the GP hyperparameters were estimated by maximising using a multi-start strategy with L-BFGS-B (Byrd et al., 1995) and restarts. In all the approximate fully-Bayesian methods the acquisition functions were averaged over posterior samples (6). HMC/MCMC sampling inference (Section 4.1) was carried out using PyMC3 (Salvatier et al., 2016), with chains, discarding the first samples as burn-in and taking every th sample. We note that this is significantly more than previous works, (Henrández-Lobato et al., 2014; Wang and Jegelka, 2017) only performed inference every BO iterations, using MCMC to draw samples, discarding only the first of them, and taking every rd sample. In the non-MCMC BO iterations, the authors reused the latest set of samples. When carrying out variational inference (Section 4.2), optimisation of the ELBO (22) was undertaken for a maximum of steps or until convergence. Finally, GP models were built using GPyTorch (Gardner et al., 2018) and the resulting acquisition functions were optimised using BoTorch (Balandat et al., 2020). Full experimental results are available in the supplementary material, and all code, initial locations and optimisation runs are available online222https://github.com/georgedeath/how-bayesian-should-BO-be.
Here, we report performance in terms of the logarithm of the simple regret , which is the difference between the true minimum value and the best seen function evaluation up to iteration : .
Each combination of acquisition function (EI and UCB) and kernel (isotropic and ARD) were evaluated on the test problems shown in Table 1, for the four different noise levels ().
Figure 3 shows, for the EI and UCB acquisition functions and noise levels, the number of times BO with each inference method was the best-performing according to a one-sided, paired Wilcoxon signed-rank test (Knowles et al., 2006) with Holm-bonferroni (Holm, 1979) correction (). As can be seen from the plot, for EI with the isotropic kernel, MAP outperforms the other inference methods. Given that there are only three hyperparameters regardless of the problem dimensionality , this is not wholly surprising. It is likely that the hyperparameter posterior distribution (18) quickly becomes sharply unimodal, particularly given the lack of freedom in the parameters due the single length-scale. This matches the observations of Rasmussen and Williams (2006), who note that as the amount of data increases, as it does in BO, that one often finds a local optimum that is orders of magnitude more probable than any other, and, indeed, that it becomes more sharply peaked.
Conversely, for EI using an ARD kernel MCMC outperforms MAP. With ARD there are hyperparameters, and thus much more freedom to allow for different, but equally likely, explanations for the data. This corresponds to a much more diffuse, and potentially more multimodal, posterior density. We note that this does not conflict with Rasmussen and Williams
’s observation because the curse of dimensionality means that much more data would be required for one mode to become dominant and sharply-peaked. Figure4 shows some illustrative convergence plots using EI with an isotropic (upper) and ARD kernel (lower).
Figure 3 also shows that for EI, BO with MCMC inference (top row, red bars) was often statistically significantly better than both MFVI and FRVI (orange and green bars respectively) on the majority of test problems, kernels, and noise levels. In fact, when MCMC was equal to or better than MAP, the ordering of the inference methods was always MCMC MAP ¿ FRVI ¿ MFVI. This suggests that the posterior hyperparameter distribution is not well approximated by either VI method. Figure 1 lends weight to this interpretation because the posterior distribution was boomerang-shaped, which cannot be well approximated by either the mean field approximation, in which each parameter is independent, or the full-rank approximation, which is constrained to be Gaussian. Note that the greater flexibility of FRVI allows it to perform better than MFVI. Due to the superiority of MCMC over the variational methods, we compare MAP with MCMC for the remainder of this work.
Figure 5 summarises the performance of MAP vs. MCMC for each combination of acquisition function and kernel (columns), and test problem (rows). As can be seen from the figure, when using UCB, MCMC-based inference is almost always worse. We suspect that the increased uncertainty of the full posterior (MCMC) leads to increased exploration with an already exploratory acquisition function, which ultimately hampers convergence. Conversely, using EI with an isotropic kernel, neither MAP nor MCMC is consistently superior. MCMC is almost always better with ARD when using EI, probably due to the better representation of the posterior hyperparameter distribution, which allows better prediction and exploration of the function being optimised.
In order to investigate whether UCB and EI are really more exploratory when using MCMC compared with MAP, we calculated the Euclidean distance between consecutively evaluated locations in each optimisation run, and found that this is indeed the case – for plots of the distances see the supplementary material. With UCB, the distances between consecutive locations are the smallest for MAP and increase substantially for the other inference methods. Distances between consecutively evaluated locations for EI, however, did not follow a similar trend: In the noise-free setting, there was practically no difference in the behaviour of EI with MAP or MCMC. Conversely, as the noise level increases, the approximate inference-based methods become more exploratory than the MAP estimates, particularly on the higher-dimensional problems. This indicates that the MAP estimates are sufficient for the lower-dimensional problems where the hyperparameter posterior distribution quickly becomes sharply peaked. EI is known to be too exploitative, and, as discussed in Section 2.1, several works have tried ad hoc schemes to increase the amount of exploration it performs. Therefore, we argue that if the EI acquisition function is being used, then taking into account the hyperparameter uncertainty via a fully-Bayesian treatment of them is essential, particularly in higher dimensions. Doing so in a principled, Bayesian way leads to the incorporation of an increased amount of exploration when needed, and does so without recourse to ad hoc rules.
Finally, we compare the different combinations of acquisition function, inference method, and kernel type to give some general advice to practitioners as to which of these BO components should be used. As shown in the supplementary material, EI with a fully-Bayesian treatment of the hyperparameters and an ARD kernel leads to the best performance in the noise-free setting, with EI vastly outperforming UCB for all combinations of components. However, as the noise level increases, the picture becomes less clear. UCB improves its performance in comparison to EI and the methods are roughly equal for the different noise levels (supplementary material, Figure 3). EI’s deteriorating performance at higher noise levels may indicate that the surrogate model is poorer at representing the noisy function, leading to insufficient exploration. This is in contrast to the more exploratory UCB, which over-explores in the noise-free case (Berk et al., 2019). It would be interesting to investigate BO strategies for optimisation in the presence of noise. It is clear that a fully-Bayesian treatment of the hyperparameters should be avoided with the UCB acquisition function because the increased level of exploration led to poor performance for both kernel types across all levels of noise.
We investigated how a fully-Bayesian treatment of the Gaussian process hyperparameters affects optimisation performance. With noise-free evaluations, EI with a MCMC inference and an ARD kernel was the best combination of evaluated BO components. However, as the observational noise level increased, there was less to differentiate between the components. In the case of EI, MCMC was found to be more effective with an ARD kernel than an isotropic one. We attribute this to additional flexibility to model the complex and possibly multi-modal hyperparameter posterior that is afforded by a kernel that treats different dimensions with different length scales. MCMC is generally superior to variational methods (MFVI and FRVI) because the marginal posterior is often sufficiently complex that it is poorly modelled with the (necessarily unimodal) mean field or full-rank Gaussian approximations.
We do not recommend the fully-Bayesian approach with UCB because the additional hyperparameter uncertainty leads to even greater exploration with the already over-exploratory UCB acquisition function. However, this is not the case for EI and we, therefore, recommend the fully-Bayesian treatment of the hyperparameters in BO using MCMC because it allows for a principled way to increase exploration without any ad hoc enhancements to EI.
Other important future directions also focus on the modelling ability of the surrogate. Particular aspects are the non-stationarity induced as the optimiser converges (e.g. Snoek et al., 2014) and improving the modelling of functions with degenerate features, such as discontinuities, using deep GPs (Damianou and Lawrence, 2013; Salimbeni et al., 2019).
Bayesian Learning for Neural Networks. Lecture Notes in Statistics, Vol. 118. Springer New York, New York.
Evaluation of Gaussian Processes and other Methods for Non-Linear Regression. Ph.D. Dissertation. Department of Computer Science, University of Toronto.