    # What do you Mean? The Role of the Mean Function in Bayesian Optimisation

Bayesian optimisation is a popular approach for optimising expensive black-box functions. The next location to be evaluated is selected via maximising an acquisition function that balances exploitation and exploration. Gaussian processes, the surrogate models of choice in Bayesian optimisation, are often used with a constant prior mean function equal to the arithmetic mean of the observed function values. We show that the rate of convergence can depend sensitively on the choice of mean function. We empirically investigate 8 mean functions (constant functions equal to the arithmetic mean, minimum, median and maximum of the observed function evaluations, linear, quadratic polynomials, random forests and RBF networks), using 10 synthetic test problems and two real-world problems, and using the Expected Improvement and Upper Confidence Bound acquisition functions. We find that for design dimensions >5 using a constant mean function equal to the worst observed quality value is consistently the best choice on the synthetic problems considered. We argue that this worst-observed-quality function promotes exploitation leading to more rapid convergence. However, for the real-world tasks the more complex mean functions capable of modelling the fitness landscape may be effective, although there is no clearly optimum choice.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Bayesian optimisation (BO) is a popular approach for optimising expensive (in terms of time and/or money) black-box functions that have no closed-form expression or derivative information (Snoek et al., 2012; Shahriari et al., 2016)

. It is a surrogate-based modelling approach that employs a probabilistic model built with previous function evaluations. The Gaussian process (GP) model typically used in BO provides a posterior predictive distribution that models the target function in question and quantifies the amount of predictive uncertainty. A GP is a collection of random variables, any finite number of which have a joint Gaussian distribution

(Rasmussen and Williams, 2006). It can be fully specified by its mean function and kernel function (also known as a covariance function) (Rasmussen and Williams, 2006). The kernel and mean function may be regarded as specifying a Bayesian prior on the functions from which the data are generated. The kernel function describes the structure, such as the smoothness and amplitude, of the functions that can be modelled, while the mean function specifies the prior expected value of the function at any location (Shahriari et al., 2016).

In BO, the location that maximises an acquisition function (or infill criterion) is chosen as the next location to be expensively evaluated. Acquisition functions combine the surrogate model’s predictions and the uncertainty about its prediction to strike a balance between myopically exploiting areas of design space that are predicted to yield good-quality solutions and exploring regions that have high predicted uncertainty.

In the literature, many acquisition functions have been proposed in a number of works (Močkus et al., 1978; Srinivas et al., 2010; Hernández-Lobato et al., 2014; Wang and Jegelka, 2017; De Ath et al., 2019; Kushner, 1964), and, in general, no one strategy has been shown to be all-conquering due to the no free lunch theorem (Wolpert and Macready, 1997). However, recent works have shown that purely exploiting the surrogate model becomes a more effective strategy as the dimensionality of the problem increases (De Ath et al., 2019; Rehbach et al., 2020). Similarly, the role of the kernel function in BO has been investigated (Palar et al., 2020; Lindauer et al., 2019; Mukhopadhyay et al., 2017; Acar, 2013; Palar and Shimoyama, 2019)

. One of the most popular kernels is the radial basis function (also known as the squared exponential kernel)

(Snoek et al., 2012). However, it is generally regarded as being too smooth for real-world functions (Stein, 2012; Rasmussen and Williams, 2006), and the Matérn family of kernels is often preferred.

Contrastingly, little attention has been paid to the role of the mean function in BO, with general practise being to use a constant value of zero (Wang and Jegelka, 2017; Hernández-Lobato et al., 2014; De Ath et al., 2019), although the constant value can also be inferred from the data (Shahriari et al., 2016). In general regression tasks, other mean functions have been considered, such as polynomials (Blight and Ott, 1975; Kennedy and O’Hagan, 2001; Williamson et al., 2015)

, and, more recently, non-parametric methods such as neural networks

(Iwata and Ghahramani, 2017; Fortuin et al., 2019)

. In light of the lack of previous work into the role of the mean function in BO, we investigate the effect of different mean functions in BO in terms of both the convergence rate of the optimisation and quality of the best found solution. Specifically, we compare the performance of using different constant values, linear and quadratic functions, as well as using random forests and radial basis function networks.

Our main contributions can be summarised as follows:

• We provide the first empirical study of the effect of using different Gaussian process mean functions in Bayesian optimisation.

• We evaluate eight mean functions on ten well-known, synthetic test problems and two real-world applications. This assessment is on a range of design dimensions (2 to 10) and for two popular acquisition functions.

• We show empirically, and explain in terms of the exploration versus exploitation trade-off, that choosing the mean function to be the constant function equal to the worst-seen so far evaluation of the objective function is consistently no worse and often superior to other choices of mean function.

We begin in Section 2 by briefly reviewing Bayesian optimisation. In Section 3 we review Gaussian processes, paying particular attention to the mean function and introduce the various mean functions we evaluate in this work. Extensive empirical experimentation is carried out on well-known test problems and a two real-world applications in Section 4. We finish with concluding remarks in Section 5.

## 2. Bayesian Optimisation

Bayesian optimisation (BO), also known as Efficient Global Optimisation (EGO), is a surrogate-assisted global search strategy that sequentially samples design space at likely locations of the global optimum, taking into account both the surrogate model’s prediction and the associated prediction uncertainty (Jones et al., 1998). See (Shahriari et al., 2016; Brochu et al., 2010; Frazier, 2018) for comprehensive reviews of BO. Without loss of generality, we can define the problem of finding a global minimum of an unknown objective function as

 (1) minx∈Xf(x),

where is the feasible design space of interest. We assume that is a black-box function, i.e. it has no simple closed form, but that we can have access the results of its evaluations at any location , although evaluating is expensive so that the number of evaluations required to locate the global optimum should be minimised.

Algorithm 1 outlines the BO procedure. It starts (line 2) with a space filling design, typically Latin hypercube sampling (McKay et al., 2000), of the feasible space. These samples are then expensively evaluated with the function , and a training dataset is constructed (line 5). Then, at each iteration of the sequential algorithm, a regression model, usually a Gaussian process () is constructed and trained (line 7) using the current training data. The choice of where next to expensively evaluate is determined by maximising an acquisition function (or infill criterion) which balances the exploitation of regions of design space that are predicted to yield good-quality solutions and exploration of regions of space where the predictive uncertainty is high. The design maximising is expensively evaluated and the training data is subsequently augmented (lines 8 to 10). This process is then repeated until the budget has been expended.

Two of the most popular acquisition functions are Expected Improvement (EI) (Močkus et al., 1978) and Upper Confidence Bound (UCB) (Srinivas et al., 2010). EI measures the positive predicted improvement over the best solution evaluated thus far, :

 (2) αEI(x)=σ(x)(sΦ(s)+ϕ(s)),

where is the predicted improvement at normalised by the uncertainty, and and

are the Gaussian cumulative density and probability density functions respectively. UCB is a weighted sum of the mean prediction and its associated uncertainty:

 (3)

where is a weight that depends on the number of function evaluations and explicitly controls the exploitation vs. exploitation trade-off. Note that both EI and UCB are presented here in the form used for minimisation. While other acquisition functions have been proposed, such as probability of improvement (Kushner, 1964) and entropy-based methods such as predictive entropy search (Hernández-Lobato et al., 2014) and max-value entropy search (Wang and Jegelka, 2017), we limit our investigation to the commonly used EI and UCB to allow the focus on the mean functions themselves.

## 3. Gaussian Processes

Gaussian processes are a common choice of surrogate model due to their strengths in uncertainty quantification and function approximation (Shahriari et al., 2016; Rasmussen and Williams, 2006). A GP is a collection of random variables, and any finite number of these have a join Gaussian distribution (Rasmussen and Williams, 2006). We can define a Gaussian process prior over to be , where is the mean function and

is the kernel function (covariance function) with hyperparameters

. Given data consisting of evaluated at sampled locations , the posterior distribution of at a given location is Gaussian:

 (4) p(f|x,D,θ)=N(f|μ(x),σ2(x))

with posterior mean and variance

 (5) μ(x|D,θ) =m(x)+κ(x,X)K−1(f−m) (6) σ2(x|D,θ) =κ(x,x)−κ(x,X)⊤K−1κ(X,x).

Here is the matrix of design locations,

is the corresponding vector of true function evaluations and

is the vector comprised of the mean function at the design locations. The kernel matrix is given by and is given by . In this work we use an isotropic Matérn  kernel: Figure 1. GP models with identical training data (blue) and kernel hyperparameters, along with different constant mean functions. The mean function used in each GP is shown (grey, dashed) and corresponds to the value of the smallest (left), arithmetic mean (centre), and largest (right) seen function values. The lower row shows the corresponding EI and its maximiser (red dashed). Note how the min and max models lead EI to prefer exploring and exploiting respectively.
 (7)

where and , as recommended for modelling realistic functions (Snoek et al., 2012). The kernel’s hyperparameters are learnt via maximising the log marginal likelihood (up to a constant):

 (8) logp(f|X,θ)=−12log|K|−12(f−m)⊤K−1(f−m)

using a multi-restart strategy (Shahriari et al., 2016) with L-BFGS-B (Byrd et al., 1995). Henceforth, we drop the explicit dependencies on the data and the kernel hyperparameters for notational simplicity.

### 3.1. Mean Functions

It is well-known that the posterior prediction (4) of the GP reverts to the prior as the distance of from the observed data increases, i.e. as . In particular, the posterior mean reverts to the prior mean, i.e. ; and the posterior variance approaches the prior variance, for the Matérn kernel. The upper row of Figure 1 illustrates this effect for three different constant mean values: the best, average (arithmetic mean), and worst function values observed thus far. Note how the predicted values of the GP tend to the mean function (dashed) as the distance from the nearest evaluated location (blue) increases. The figure also shows another important, and often overlooked, aspect of the mean function: its effect on the acquisition function (lower). In this case three different locations maximise EI for the three mean functions, and it is not known a priori which location is preferable.

Practitioners of BO usually standardise the observations

before fitting the GP at each iteration, i.e. they subtract the mean of the observations and divide by the standard deviation of the relevant feature/variable after which the mean function is taken as the constant function equal to zero; that is, the effective mean function is the constant function equal to the arithmetic mean of the observed function values. In addition, the prior variance of the GP is matched to the observed variance of the function values. Although the standardisation is not usually discussed in the literature, it is commonplace in standard BO libraries, e.g. GPyOpt

(authors, 2016), BoTorch (Balandat et al., 2019) and Spearmint. It is, however, unclear whether this is the best choice for BO or whether a different mean function may be preferable. We now introduce the mean functions that will be evaluated in this work.

Here, we consider the mean function to be a set of basis functions with corresponding weights (Rasmussen and Williams, 2006):

 (9) m(x)=h(x)⊤w.

A constant mean function with value , for example, can be written as with . In addition to the standard constant function equal to the arithmetic mean of the data , we consider three other constant values: using the best and worst seen observation’s value at each iteration, i.e. and , and the median observed value . In comparison to using the data mean , using leads to acquisition functions becoming more exploratory, as illustrated in Figure 1 (left panel). This is because locations far away from previously evaluated solutions will have predicted means equal to the and with large predicted uncertainty, leading to large values of, for example, EI and UCB. Conversely, using will, as illustrated in Figure 1 (right panel), lead to increased exploitation due to regions far from having large uncertainty, but poor predicted values and hence small . Interestingly, the effect of using or will change over the course of the optimisation. When there are relatively few function evaluations and will be approximately

. However, as the number of expensive function evaluations increases and the optimisation converges towards the (estimated) optimum,

and particularly will tend to approach , thus leading to increased exploration.

We also consider linear and quadratic mean functions. Linear mean functions are defined as , with corresponding weights and where refers to the th element of . Quadratic mean functions are defined similarly, with polynomial terms up to degree and with weights where

. To avoid overfitting to the data, these regressions are typically trained via a regularised least-squares approximation, i.e. ridge regression (also known as Tikhonov regularization). The optimal regularised weights

are estimated by solving

 (10) w∗=argminw∥f−Hw∥2+λ∥w∥2,

where and

controls the amount of regularisation. The ordinary least squares estimator is

 (11) w∗=(H⊤H+λI)−1H⊤f,

In this work, the regularisation parameter was chosen via five-fold cross-validation for .

Another choice of basis functions are radial basis functions (RBFs), which have the property that each basis function only depends on the Euclidean distance from a fixed centre (Bishop, 2006)

. These are known as RBF networks and can be thought of as either linear neural networks using RBF activation functions

(Orr, 1996) or as finite-dimensional Gaussian processes (Bishop, 2006). A commonly used set of basis functions, and the ones used in this work, are the Gaussian RBFs . While any set of locations can be used as the centres , we place a Gaussian RBF at each of the previously-evaluated locations, i.e. , resulting in . Similarly to the linear and quadratic mean functions, the regularisation parameter and length scale were chosen via five-fold cross validation. Values of were selected in the same range as for the linear and quadratic mean functions, and the values of were selected from . A more fine-grained selection of values were chosen following a preliminary investigation which revealed the modelling error to be more sensitive to changes in than

. We note here that the use of regularisation is particularly important when placing an RBF on each evaluated location because the RBF network will otherwise be able to perfectly interpolate the data. This would lead to

and therefore (8) would reduce to , which can be maximised by either or in (7). This results in the posterior variance estimates being over-confident and thus having small variance everywhere with predictions determined by the mean function.

Lastly, we include a non-parametric regressor, extremely randomised trees, better known as Extra-Trees (ET, (Geurts et al., 2006)), a variant of Random Forests (RF, (Breiman, 2001)). RFs are ensembles of classification or regression trees that are each trained on a different randomly chosen subsets of the data. Unlike RFs, that attempt to split the data at each cut-point of a tree optimally, the ET method instead selects the cut-point as the best from a small set of randomly chosen cut-points; this additional randomisation results in a smoother regression in comparison to RFs. Given that ETs use randomised cut-points, they typically use all the training data in each tree. However, to counteract the overfitting that this will produce, we allow each item in the training set to be resampled, instead of using all elements of the set for each tree. Note that, while not presented as such here, RFs can also be interpreted as a kernel method (Scornet, 2016; Davies and Ghahramani, 2014).

## 4. Experimental Evaluation

We now investigate the performance of the mean functions discussed in Section 3.1 using the EI and UCB acquisition functions (Section 2) on ten well-known benchmark functions with a range dimensionality and landscape properties, and two real-world applications. Full results of all experimental evaluations are available in the supplementary material. The mean functions to be evaluated are the constant functions, , , , and , labelled Arithmetic, Median, Min, and Max respectively, as well as the Linear, Quadratic, Extra-Trees (RandomForest), and RBF network-based (RBF) mean functions.

A Gaussian process surrogate model with an isotropic Matérn kernel (7) was used in all experiments. The Bayesian optimisation runs themselves were carried out as in Algorithm 1, with the additional step of fitting a mean function before training the GP (Algorithm 1, line 7) at each iteration. All test problems evaluated in this work were scaled to , and observations were standardised at each BO iteration, prior to mean function fitting. The models were initially trained on observations generated by maximin Latin hypercube sampling (McKay et al., 2000), and each optimisation run was repeated times with different initialisation. The same set of initial observations were used for each of the mean functions to enable statistical comparisons. The hyperparameters of the GP were optimised by maximising the marginal log likelihood (8) with L-BFGS-B (Byrd et al., 1995) using 10 restarts. Following common practise, maximisation of the acquisition functions was carried out via multi-start optimisation; details of the full procedure can be found in (Balandat et al., 2019). The trade-off between exploitation and exploration in UCB, , is set to Theorem 1 in (Srinivas et al., 2010), which increases logarithmically with the number of function evaluations, with approximately in the range . The Bayesian optimisation pipeline and mean functions were implemented with BoTorch (Balandat et al., 2019) and code is available online to recreate all experiments, as well as the LHS initialisations used and full optimisation runs.

Optimisation quality is measured with simple regret , which is the difference between the true minimum value and the best value found so far after evaluations:

 (12) Rt=|f(x∗)−min{f1,f2,…,ft}|.

### 4.1. Synthetic Experiments

The mean functions were evaluated on the ten popular synthetic benchmark functions listed in Table 1 with a budget of function evaluations that included the initial LHS samples. These functions were selected due to their different dimensionality and landscape properties, such the presence of multiple local or global minima (Ackley, Eggholder, Hartmann6, GoldsteinPrice, StyblinskiTang, Shekel) deep, valley-like regions (Branin, Rosenbrock, SixHumpCamel), and steep ridges and drops (Michalewicz).

Table 2 shows the median regret over the 51 repeated experiments, together with the median absolute deviation from the median (MAD) for the mean functions using the EI acquisition function. Due to space constraints, the corresponding table for UCB is included in the supplementary material. The method with the lowest (best) median regret on each function is highlighted in dark grey, and those highlighted in light grey are statistically equivalent to the best method according to a one-sided paired Wilcoxon signed-rank test (Knowles et al., 2006) with Holm-Bonferroni correction (Holm, 1979) (). Figure 2. Illustrative convergence plots for eight benchmark problems using the EI acquisition function. Each plot shows the median regret, with shading representing the interquartile range across the 51 runs and the dashed vertical line indicating the end of the initial LHS phase. Figure 3. Illustrative convergence plots for eight benchmark problems using the UCB acquisition function. Each plot shows the median regret, with shading representing the interquartile range across the 51 runs and the dashed vertical line indicating the end of the initial LHS phase.

The convergence of the various mean functions on illustrative test problems are shown using the EI (Figure 2) and UCB (Figure 3) acquisition functions. Convergence plots for the Branin and GoldsteinPrice test problems were visually similar to Eggholder and SixHumpCamel respectively; they are available in the supplementary material. As one might expect, because points are naturally less distant from one another, the choice of mean function has less impact in 2 dimensions. Although, interestingly, optimisation runs with the UCB algorithm in achieve lower regret with the constant mean functions compared to the others evaluated.

Perhaps surprisingly, the non-constant mean functions, Linear, Quadratic, RandomForest and RBF, appear to offer no advantage over the constant mean functions despite their ability to model the large scale optimisation landscape. The Quadratic model, in two dimensions where there are only three parameters to be fitted, appears to be well suited to the SixHumpCamel function which is roughly bowl-shaped (albeit with quartic terms); however, this appears to be an exceptional case.

In higher dimensions with the EI acquisition function, using the worst observation value as the constant mean function (Max) consistently provides the lowest regret on the test functions evaluated. This is consistent with recent work (De Ath et al., 2019; Rehbach et al., 2020) showing that being more exploitative in higher dimensions is preferable to most other strategies. However, for the UCB acquisition function this is not the case and no mean function is consistently best. We suspect that this is because the value of is so large that the UCB function (3) is always dominated by the exploratory term () so that the mean function has relatively little influence. This is in contrast to EI, which has been shown (De Ath et al., 2019) to be far more exploitative than UCB.

The standard choice of using a constant mean function equal to the arithmetic mean of the observations (Arithmetic) is, in higher dimensions (), only statistically equivalent to the best-performing method on one of the five test functions for both EI and UCB. This result calls into question the efficacy of the common practise of using the constant mean function in all Bayesian optimisation tasks. Based on these results we suggest that an increase performance may be obtained by using the Max mean function with EI. Although there does not appear to be such a clear-cut answer as to which mean function should be used in conjunction with the UCB acquisition function, we posit that this is less important because, based on these optimisation results, one would prefer the performance of EI over UCB in general.

### 4.2. Active Learning for Robot Pushing

Like (Wang and Jegelka, 2017; Jiang et al., 2019; De Ath et al., 2019, 2020)

we optimise the control parameters for two active learning robot pushing problems

(Wang et al., 2018); see (De Ath et al., 2019) for a diagrammatic outline of the problems. In the push4 problem, a robot should push an object towards an unknown target location and is constrained such that it can only travel in the initial direction of the object. Once the robot has finished pushing, it receives feedback in the form of the object-target distance. The robot is parametrised by its initial location, the orientation of its pushing hand and how long it pushes for. This can therefore be cast as a minimisation problem in which the four parameter values are optimised with respect to the object’s final distance from the target. The object’s initial location is fixed to the centre of the problem domain (Wang and Jegelka, 2017) and the target location is changed for each of the 51 optimisation runs, with these kept the same across mean functions. Thus, the optimisation performance is considered over problem instances rather initialisations of a single instance.

The second problem, push8, two robots push their respective objects towards two unknown targets, with their movements constrained so that they always travel in the direction of their object’s initial location. The parameters controlling the robots can be optimised to minimise the summed final object-target distances. Like push4, initial object locations were fixed for problem instances and the targets’ locations were chosen randomly, with a constraint enforcing that both objects could cover the targets without overlapping. This means, however, that in some problem instances it may not be possible for both robots to push their objects to their respective targets because they will block each other. Thus, for push8 we report the final summed object-distances rather than the regret due to the global optimum not being known. Figure 4. Convergence plots for the robot pushing problem using EI and UCB. Each plot shows the median regret, with shading representing the interquartile range of the 51 runs.

Figure 4 shows the convergence plots using the various mean functions with the EI and UCB acquisition functions. Tabulated results are available in the supplementary material. In the four-dimensional push4 problem, one of the worst-performing mean functions on the synthetic functions, RandomForest, had substantially lower regret than the other mean functions using UCB on both problems and for EI on push4. However, in the eight-dimensional push8, all mean functions were statistically equivalent when using EI, apart from Min, Linear and RBF, which were worse. However, when using UCB, the RandomForest mean function was statistically better than all other mean functions, and it achieved results comparable to using EI.

It might be suspected that the superior ability of the Random Forest (and RBF on some problems) to represent the inherently difficult landscape features of these problems would account for the better performance of RandomForest. The landscape for these problems has sharp changes (high gradients) in the function, e.g. when a change of the robot’s starting location results in the object no longer being pushed towards the target, as well as plateaux where changes in certain parameter values have little effect, e.g. if the amount of pushing time results in the robot not reaching the object. However, we investigated the mean prediction error of each of the combined mean plus Gaussian process models trained on the first 100 expensively evaluated locations by calculating the normalised root mean squared prediction error at 1000 locations (chosen by Latin Hypercube Sampling). As shown in the supplementary material, this indicates that, in fact, the RF model has a comparatively poor prediction error. By contrast, the RBF mean function yields the most accurate model of the overall landscape, but it only shows better performance for push4 using UCB. These prediction errors were evaluated over the entire domain and, therefore, it is possible that the RF mean function is sufficiently superior in the vicinity of the optimum to allow the more rapid convergence seen here.

Interestingly, the performance of the Max mean function with EI on the synthetic test problems is not reflected on these two more real-world problems. However, both the Arithmetic and Max mean functions are statistically equivalent on both push4 and push8, with Max having lower median regret and MAD than Arithmetic. Nonetheless, it remains unclear why the RF mean function gives best performance in three of these four cases.

### 4.3. Pipe Shape Optimisation

Lastly, we evaluate the mean functions on a real-world computational fluid dynamics (CFD) optimisation problem. The goal of the PitzDaily CFD problem (Daniels et al., 2018) is to minimise the pressure loss between a pipe’s entrance (inflow) and exit (outflow) by optimising the shape of the pipe’s lower wall. The loss is evaluated by generating a CFD mesh and simulating the two-dimensional flow using OpenFOAM (Jasak et al., 2007), with each function evaluation taking between and seconds. The decision variables in the problem are the control points of a Catmull-Clark subdivision curve that represents the lower wall’s shape; see (Daniels et al., 2018) for a pictorial representation of these. In this work we use control points, resulting in a 10-dimensional decision vector. The control points are constrained to lie in a polygon, rather than a hypercube for all previous problems, and, therefore, we draw initial samples uniformly from within the constrained region rather than using LHS. Similarly, we use CMA-ES (Hansen, 2009) to optimise the acquisition functions and penalise locations that violate the constraints. Figure 5. Convergence plots for the PitzDaily test problem using EI and UCB. Each plot shows the median regret, with shading representing the interquartile range of the 51 runs.

Convergence plots of the flow loss for the mean functions with EI and UCB are shown in Figure 5. The Arithmetic, Min and Max constant mean functions, when using EI, are all statistically equivalent and the best-performing. It is interesting to note the contrasting performance between the RandomForest mean function on this and the robot pushing tasks.

As shown in Figure 5, using the UCB acquisition function leads to the constant mean and Linear mean functions all having a median flow loss within

of one another, with their inter-quartile ranges rapidly decreasing. This effect can also be seen towards the end of the optimisation runs with EI. Inspection of solutions (control points) with a flow loss

revealed that they all had distinct values but that they led to very similar sub-division curves. This implies that they all represented essentially the same inner wall shape and thus indicate the presence of either one large, valley-like global optimum or many, global optima. We suggest that this may be the actual minimum flow loss achievable for this problem. All mean functions, in combination with both EI and UCB, were able to successfully discover solutions that led to a flow loss of less than found by a local, gradient-based method in (Nilsson et al., 2014) that used approximately 500 function evaluations. This highlights the strength of Bayesian optimisation in general because the convergence rates shown in Figure 5 are far more rapid for the majority of mean functions and realise better solutions than the local, adjoint method.

## 5. Conclusion

We have investigated the effect of different mean functions in Bayesian optimisation using the expected improvement and upper confidence bound acquisition functions. This was assessed by performing BO on ten synthetic functions and two real-world problems. The constant mean function Max, which uses a constant value of the worst-seen expensive function evaluation thus far, was found to consistently out-perform the other mean functions across the synthetic functions in higher dimensions, and was statistically equivalent to the best performing mean function on nine out of ten functions. We suggest that this is because this mean function tends to promote exploitation which can lead to rapid convergence in higher dimensions (De Ath et al., 2019; Rehbach et al., 2020) because exploration is implicitly provided through the necessarily inaccurate surrogate modelling. However, on the two, real-world problems this trend did not continue, but its performance was still statistically equivalent to the commonly-used mean equal to the mean of the observations. For this reason we recommend using the Max mean function in conjunction with expected improvement for, at worst, the same performance as a zero mean and generally improved performance in higher dimensions.

Interestingly, the lack of consistency between mean function performance on the synthetic and real-world problems may indicate a larger issue in BO, namely that synthetic benchmarks do not always contain the same types of functional landscapes as real-world problems. In future work we would like to characterise a function’s landscape during the optimisation procedure and adaptively select the best-performing components of the BO pipeline, e.g. mean function, kernel, and acquisition function, to suit the problem structure.

This work focused on learning a mean function independent of the training of the GP. In further work we would like to jointly learn the parameters of the mean function, i.e. the weights in (9), alongside the hyperparameters of the GP itself. The efficacy of the BO approach clearly depends crucially on the ability of the surrogate model to accurately predict the function’s value in unvisited locations. We therefore look forward to evaluating a fully-Bayesian approach that marginalises over the mean function parameters and kernel hyperparameters. Although the Monte Carlo sampling required to evaluate the resulting acquisition functions may be substantial, an important area of investigation is whether fully-Bayesian models can significantly improve the convergence of Bayesian Optimisation.

###### Acknowledgements.
This work was supported by Sponsor Innovate UK Rl [grant number Grant #3].

## References

• E. Acar (2013) Effects of the correlation model, the trend model, and the number of training points on the accuracy of kriging metamodels. Expert Systems 30 (5), pp. 418–428. Cited by: §1.
• T. G. authors (2016) GPyOpt: a bayesian optimization framework in python. Cited by: §3.1.
• M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy (2019)

BoTorch: Programmable Bayesian Optimization in PyTorch

.
External Links: 1910.06403 Cited by: §3.1, §4.
• C. M. Bishop (2006) Springer, Berlin, Heidelberg. Cited by: §3.1.
• B. J. N. Blight and L. Ott (1975) A Bayesian approach to model inadequacy for polynomial regression. Biometrika 62 (1), pp. 79–88. Cited by: §1.
• L. Breiman (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §3.1.
• E. Brochu, V. M. Cora, and N. De Freitas (2010)

A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

.
External Links: 1012.2599 Cited by: §2.
• R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu (1995) A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16 (5), pp. 1190–1208. Cited by: §3, §4.
• S. J. Daniels, A. A. M. Rahat, R. M. Everson, G. R. Tabor, and J. E. Fieldsend (2018) A suite of computationally expensive shape optimisation problems using computational fluid dynamics. In Parallel Problem Solving from Nature – PPSN XV, pp. 296–307. External Links: ISBN 978-3-319-99259-4 Cited by: §4.3.
• A. Davies and Z. Ghahramani (2014) The random forest kernel and other kernels for big data from random partitions. External Links: 1402.4293 Cited by: §3.1.
• G. De Ath, R. M. Everson, J. E. Fieldsend, and A. A. M. Rahat (2020) -Shotgun: -greedy batch bayesian optimisation. External Links: 2002.01873 Cited by: §4.2.
• G. De Ath, R. M. Everson, A. A. M. Rahat, and J. E. Fieldsend (2019) Greed is good: exploration and exploitation trade-offs in bayesian optimisation. External Links: 1911.12809 Cited by: §1, §1, §4.1, §4.2, §5.
• V. Fortuin, H. Strathmann, and G. Rätsch (2019) Meta-learning mean functions for gaussian processes. External Links: 1901.08098 Cited by: §1.
• P. I. Frazier (2018) A tutorial on bayesian optimization. External Links: 1807.02811 Cited by: §2.
• P. Geurts, D. Ernst, and L. Wehenkel (2006) Extremely randomized trees. Machine Learning 63 (1), pp. 3–42. Cited by: §3.1.
• N. Hansen (2009) Benchmarking a bi-population cma-es on the bbob-2009 function testbed. In Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers, pp. 2389–2396. Cited by: §4.3.
• J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani (2014) Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems, pp. 918–926. Cited by: §1, §1, §2.
• S. Holm (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6 (2), pp. 65–70. Cited by: §4.1.
• T. Iwata and Z. Ghahramani (2017)

Improving output uncertainty estimation and generalization in deep learning via neural network gaussian processes

.
External Links: 1707.05922 Cited by: §1.
• H. Jasak, A. Jemcov, and U. Kingdom (2007) OpenFOAM: a c++ library for complex physics simulations. In International Workshop on Coupled Methods in Numerical Dynamics, IUC, pp. 1–20. Cited by: §4.3.
• S. Jiang, H. Chai, J. González, and R. Garnett (2019) Efficient nonmyopic Bayesian optimization and quadrature. External Links: arXiv:1909.04568 Cited by: §4.2.
• D. R. Jones, M. Schonlau, and W. J. Welch (1998) Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13 (4), pp. 455–492. External Links: ISSN 1573-2916 Cited by: §2.
• M. C. Kennedy and A. O’Hagan (2001) Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (3), pp. 425–464. Cited by: §1.
• J. D. Knowles, L. Thiele, and E. Zitzler (2006) A tutorial on the performance assessment of stochastic multiobjective optimizers. Technical report Technical Report TIK214, Computer Engineering and Networks Laboratory, ETH Zurich, Zurich, Switzerland. Cited by: §4.1.
• H. J. Kushner (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal Basic Engineering 86 (1), pp. 97–106. Cited by: §1, §2.
• M. Lindauer, M. Feurer, K. Eggensperger, A. Biedenkapp, and F. Hutter (2019) Towards assessing the impact of bayesian optimization’s own hyperparameters. External Links: 1908.06674 Cited by: §1.
• M. D. McKay, R. J. Beckman, and W. J. Conover (2000) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 42 (1), pp. 55–61. Cited by: §2, §4.
• J. Močkus, V. Tiešis, and A. Žilinskas (1978) The application of Bayesian methods for seeking the extremum. Towards Global Optimization 2 (1), pp. 117–129. Cited by: §1, §2.
• T. Mukhopadhyay, S. Chakraborty, S. Dey, S. Adhikari, and R. Chowdhury (2017) A critical assessment of kriging model variants for high-fidelity uncertainty quantification in dynamics of composite shells. Archives of Computational Methods in Engineering 24 (3), pp. 495–518. Cited by: §1.
• U. Nilsson, D. Lindblad, and O. Petit (2014) Description of adjointShapeOptimizationFoam and how to implement new objective functions. Technical report Chalmers University of Technology, Gothenburg, Sweden. Cited by: §4.3.
• M. J. L. Orr (1996) Introduction to radial basis function networks. Technical report University of Edinburgh. Cited by: §3.1.
• P. S. Palar, L. R. Zuhal, T. Chugh, and A. Rahat (2020) On the impact of covariance functions in multi-objective bayesian optimization for engineering design. In AIAA Scitech 2020 Forum, Cited by: §1.
• P. S. Palar and K. Shimoyama (2019) Efficient global optimization with ensemble and selection of kernel functions for engineering design. Structural and Multidisciplinary Optimization 59 (1), pp. 93–116. Cited by: §1.
• C. E. Rasmussen and C. K. I. Williams (2006) Gaussian processes for machine learning. The MIT Press, Boston, MA. External Links: ISBN 0-262-18253-X Cited by: §1, §1, §3.1, §3.
• F. Rehbach, M. Zaefferer, B. Naujoks, and T. Bartz-Beielstein (2020) Expected improvement versus predicted value in surrogate-based optimization. External Links: 2001.02957 Cited by: §1, §4.1, §5.
• E. Scornet (2016) Random forests and kernel methods. IEEE Transactions on Information Theory 62 (3), pp. 1485–1500. Cited by: §3.1.
• B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas (2016) Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1, §1, §2, §3, §3.
• J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959. Cited by: §1, §1, §3.
• N. Srinivas, A. Krause, S. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022. Cited by: §1, §2, §4.
• M. L. Stein (2012) Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. Cited by: §1.
• Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-Pérez (2018) Active model learning and diverse action sampling for task and motion planning. In Proceedings of the International Conference on Intelligent Robots and Systems, pp. 4107–4114. Cited by: §4.2.
• Z. Wang and S. Jegelka (2017) Max-value entropy search for efficient Bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning, pp. 3627–3635. Cited by: §1, §1, §2, §4.2.
• D. Williamson, A. T. Blaker, C. Hampton, and J. Salter (2015) Identifying and removing structural biases in climate models with history matching. Climate dynamics 45 (5), pp. 1299–1324. Cited by: §1.
• D. H. Wolpert and W. G. Macready (1997) No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1 (1), pp. 67–82. Cited by: §1.