ε-shotgun: ε-greedy Batch Bayesian Optimisation

02/05/2020 ∙ by George De Ath, et al. ∙ University of Exeter Swansea University 0

Bayesian optimisation is a popular, surrogate model-based approach for optimising expensive black-box functions. Given a surrogate model, the next location to expensively evaluate is chosen via maximisation of a cheap-to-query acquisition function. We present an ϵ-greedy procedure for Bayesian optimisation in batch settings in which the black-box function can be evaluated multiple times in parallel. Our ϵ-shotgun algorithm leverages the model's prediction, uncertainty, and the approximated rate of change of the landscape to determine the spread of batch solutions to be distributed around a putative location. The initial target location is selected either in an exploitative fashion on the mean prediction, or – with probability ϵ– from elsewhere in the design space. This results in locations that are more densely sampled in regions where the function is changing rapidly and in locations predicted to be good (i.e close to predicted optima), with more scattered samples in regions where the function is flatter and/or of poorer quality. We empirically evaluate the ϵ-shotgun methods on a range of synthetic functions and two real-world problems, finding that they perform at least as well as state-of-the-art batch methods and in many cases exceed their performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Global optimisation of non-convex and black-box functions is a common task in many real-world problems. These include hyperparameter tuning of machine learning algorithms

(Snoek et al., 2012), drug discovery (Hernández-Lobato et al., 2017), analog circuit design (Lyu et al., 2018), mechanical engineering design (Chugh et al., 2017; Daniels et al., 2019; Rahat et al., 2018) and general algorithm configuration (Hutter et al., 2011). Bayesian optimisation (BO) has become a popular approach for optimising expensive, black-box functions that have no closed-form expression or derivative information (Snoek et al., 2012; Shahriari et al., 2016). It employs a probabilistic surrogate model of a function using available function evaluations. The location at which the function is next expensively evaluated is chosen as the location that maximises an acquisition function (or infill criterion) that balances exploration and exploitation.

In real-world problems it is often possible to run multiple experiments in parallel by using modern hardware capabilities to expensively evaluate several locations at once. When optimising machine learning algorithms, for example, multiple model configurations can be evaluated in parallel across many processor cores on one or multiple machines (Chen et al., 2018; Kandasamy et al., 2018). Consequently, this has led to the development of batch (or parallel) BO algorithms, which use acquisition functions to select locations to be evaluated at each iteration. Clearly, a strictly serial evaluation makes the best overall use of the available CPU time because each new location to be evaluated is selected with the maximum available information. Parallel evaluation, however, holds the promise of substantially reducing the wall-clock time to locate the optimum.

The selection of a good set of locations to evaluate at each batch iteration is a non-trivial problem. In sequential BO, techniques which favour greedy exploitation of the surrogate model have been shown to be preferable to the more traditional acquisition functions (Rehbach et al., 2020; De Ath et al., 2019). De Ath et al. (2019), for example, show that using an strategy of exploiting the surrogate model the majority of the time and, with probability (where ), randomly selecting a location to explore yields superior optimisation performance on a variety of synthetic and real-world problems. Consequently, in this work we investigate methods in the batch BO setting.

We present , a novel approach to batch BO, which uses an strategy for selecting the first location in a batch, and then samples the remaining

points from a normal distribution centred on

, with a scale parameter determined by the surrogate model’s posterior mean and variance at

and the magnitude of the gradient in the vicinity of . This embodies maximum exploitation of the surrogate model the majority of the time by virtue of the choice of . The remaining locations may be exploratory or exploitative depending on the characteristics of the local landscape. Larger regions of decision space will be sampled when is surrounded by a relatively flat landscape, while denser sampling will occur where is in a locally steeper region, such as the landscape around a local (or global) optimum.

Our contributions can be summarised as follows:

  • We present , a new batch Bayesian optimisation approach based on the strategy of exploiting the surrogate model.

  • We empirically compare a range of state-of-the-art batch Bayesian optimisers across a variety of synthetic test problems and two real-world applications

  • We empirically show that the approaches are equal to or better than several state-of-the-art batch BO methods on a wide range of problems.

We begin in Section 2 by briefly reviewing Bayesian optimisation along with Gaussian processes (the surrogate model generally used in BO) and common acquisition functions. Batch BO and the algorithm archetypes used for selecting the batch locations are then reviewed in Section 2.2, which leads to the proposed approach in Section 3. Empirical evaluation on well-known test problems and two real-world applications are presented in Section 4. We finish with concluding remarks in Section 5.

2. Bayesian Optimisation

Our goal is to minimise a black-box function , defined on a compact domain . The function itself is unknown, but we have access to the results of its evaluations at any location . We are particularly interested in cases where the evaluations are expensive, either in terms of time or money or both, and we seek to minimise in either as few evaluations as possible to incur as little cost as possible or for a fixed budget.

2.1. Sequential Bayesian Optimisation

Bayesian Optimisation (BO), also known as Efficient Global Optimisation, is a global search strategy that sequentially samples design space at locations that are likely contain the global optimum, taking into account the predictions of the surrogate model and their associated uncertainty (Jones et al., 1998). It starts by generating initial sample locations with a space filling algorithm, typically Latin hypercube sampling (McKay et al., 2000), and expensively evaluates them with the function, . This collected set of observations forms the dataset with which the surrogate model is initially trained. Following model training, and at each iteration of BO, the next location for expensive evaluation is selected according an acquisition function (or infill criterion). These usually combine the surrogate model’s prediction and prediction uncertainty of the design space to balance the exploitation of promising solutions (those with good predicted values) and those solutions with high uncertainty. The location maximising this criterion is used as the next point to be expensively evaluated. The dataset is augmented with and and the process is repeated until the budget is exhausted. The value of the global minimum

is estimated to be the best function evaluation seen during the optimisation run, i.e.

.

2.1.1. Gaussian Processes

Gaussian processes (GP) are a popular and versatile choice of surrogate model for , due to their strengths in function approximation and uncertainty quantification (Rasmussen and Williams, 2006)

. A GP is a collection of random variables, and any finite number of these are jointly Gaussian distributed. A GP prior over

can be defined as where is the mean function, is the kernel function (also known as a covariance function) and are the hyperparameters of the kernel. Given data consisting of evaluated at sampled locations , the posterior estimate of at location is a Gaussian distribution:

(1)

with mean and variance

(2)
(3)

where is matrix of input locations in each row and

is the corresponding vector of true function evaluations

. The matrix contains the kernel evaluated at each pair of observations, and is the -dimensional vector whose elements are . Kernel hyperparameters are learnt via maximising the log likelihood:

(4)

For notational simplicity, we drop explicit dependencies on the data and kernel hyperparamters from now on.

2.1.2. Acquisition Functions

An acquisition function is used to measure the anticipated quality of expensively evaluating at any given location : the location that maximises the acquisition function is chosen as the next location for expensive evaluation. While this strategy may appear merely to transfer the problem of optimising to a maximisation of

, the acquisition function is cheap to evaluate so the location of its global optimum can be cheaply estimated using an evolutionary algorithm.

Acquisition functions attempt, either implicitly or explicitly, to balance the trade-off between maximally exploiting the surrogate model, i.e. selecting a location with the best predicted value, and maximally exploring the model, i.e. selecting the location with the most uncertainty. Perhaps the two most widespread acquisition functions, are Expected Improvement (EI) (Jones et al., 1998) and Upper Confidence Bound (UCB) (Srinivas et al., 2010). EI measures the positive predicted improvement over the best solution observed so far and UCB is a weighted sum of the surrogate model’s mean prediction and uncertainty . These were both shown (De Ath et al., 2019) to be monotonic with respect to increases in both and and that the solutions that maximise them both belong to the Pareto set of locations which maximally trade-off exploitation (minimising ) and exploration (maximising ).

Recently, approaches have been successfully used as acquisition functions (De Ath et al., 2019). These select a maximally exploitative solution, with probability and select a random solution with probability . De Ath et al. (2019) present two methods for selecting the random solution, either uniformly from or from the approximate Pareto set of solutions of the surrogate model’s mean prediction and variance. They showed that approaches are particularly effective on higher dimensional problems, and that performing pure exploitation (i.e. ) is competitive with the best-performing methods. This result was recently confirmed by Rehbach et al. (2020), who empirically show that solely using the surrogate model’s predicted value performs better than EI on most problems with a dimensionality of 5 or more.

2.2. Batch Bayesian Optimisation

In batch Bayesian optimisation (BBO) the goal is to select a batch of promising locations to expensively evaluate in parallel. One of the earliest BBO approaches, the qEI method of Ginsbourger et al. (2008), generalised the sequential EI acquisition function to a batch setting in which all batch locations are jointly estimated. However, it is not analytically tractable to compute qEI, even for small batch sizes (González et al., 2016a). Although a fast approximation to qEI does exist (Chevalier and Ginsbourger, 2013), it is not faster than naive Monte Carlo approximations for larger batch sizes. More recently, Wang et al. (2016) proposed a more efficient algorithm to estimate the gradient of qEI, but the approach still results in having to optimise in a dimensional space for each set of batch locations. Two other methods that jointly optimise the batch of locations, the parallel predictive entropy search (Shah and Ghahramani, 2015) and the parallel knowledge gradient method (Wu and Frazier, 2016), have also been shown to scale poorly as batch size increases (Daxberger and Low, 2017).

Consequently, iteratively selecting the batch sample locations has become the prevailing methodology. One such strategy is to attempt to ensure that different locations are selected for the batch by, for each of the

locations, sampling a realisation from the surrogate model posterior (Thompson Sampling) and minimising it

(Kandasamy et al., 2018). However, this relies on there being sufficient uncertainty in the model to allow for the realisations to have different minima (De Palma et al., 2019). De Palma et al. (2019) proposed sampling from a distribution of acquisition functions, or rather from the distribution of hyperparameters that control the acquisition function’s behaviour, such as the trade-off between exploration and exploitation in UCB.

Instead of relying on the stochasticity of either the surrogate model or acquisition function hyperparameters, another group of methods penalise the regions from which a batch point has already been selected; thus they are less likely (or unable to) select from nearby locations. A well-known heuristic to achieve this is to

hallucinate the results of pending evaluations (Azimi et al., 2010; Ginsbourger et al., 2010; Desautels et al., 2014). In this set of methods, the first batch location is selected by optimising an acquisition function and then subsequent locations are chosen by incorporating the predicted outcome of the already-selected batch locations into the surrogate model and optimising the acquisition function over the new model. The popular Kriging Believer method (Ginsbourger et al., 2010) uses the surrogate’s mean prediction as the hallucinated value, which reduces the model’s posterior uncertainty to zero at the hallucinated locations without affecting the posterior mean.

An alternative to penalising the surrogate model via hallucination is to penalise an acquisition function in a region around the selected batch points (González et al., 2016a; Alvi et al., 2019). In these methods, the first point in a batch is selected via maximisation of a sequential acquisition function , e.g. EI; the subsequent locations are chosen by iteratively maximising a penalised version of the sequential acquisition function:

(5)

where are local penalisers centred at . These penalise a region around with decreasing penalisation as the distance from increases; for example (González et al., 2016a) use a squared exponential function. The length scale over which the penalisation is significant is set by

(6)

where is equal to the global minimum of the function, is a valid Lipschitz constant expressing how rapidly can change with , and weights the importance of the uncertainty about .

In practise, the true value of is unknown and therefore the best seen value so far, , is used in lieu. It can be shown (González et al., 2016a) that is a valid Lipschitz constant and González et al. (2016a) approximate this using the surrogate model’s mean prediction: , resulting in a global estimate of the largest gradient in the model that is fixed for all selected batch locations. Alvi et al. (2019) argue that this under-penalises flatter regions of space in which the estimated gradient of the function is much smaller, and therefore they calculate a different value of for each selected batch location, estimating it within a length-scale of each location. González et al. (2016a) set and focus only on the difference between the predicted value of and the global optimum, whereas Alvi et al. (2019) let to also include prediction uncertainty. This penalty shrinks as the predicted value of approaches the global minimum and also as the largest local (or global) gradient of the model increases.

Motivated by the success of the sequential exploitative and approaches (De Ath et al., 2019; Rehbach et al., 2020), we invert the local penalisation strategy and, instead, present a method that samples from within the region that would usually be penalised (6). We empirically show that this approach out-performs recent BBO methods on a range of synthetic functions and two real-world problems.

3. -Shotgun BBO

Motivated by the recent success of methods, we extend the two sequential methods of De Ath et al. (2019) to the batch setting. We use an acquisition function to generate the first batch location and then sample the remaining locations from a normal distribution centred on

, with a standard deviation given by:

(7)

Sampling in this manner creates a scattered set of batch points around , akin to a shotgun blast, whose approximate spread is determined by the amount of model uncertainty of , its predicted value relative to the best seen function evaluation, and by the steepest gradient within its vicinity.

Inputs:
: Batch size
: Proportion of the time to explore
: Kernel length scale
1:if   then
2:     if Using Pareto front selection then
3:         
4:         
5:     else
6:               
7:else
8:     
9: Largest gradient; centred on
10: Best seen function value
11:
12:
13:while   do
14:     
15:     if   then
16:               
17:return
Algorithm 1 query point selection for BBO

We describe two alternative strategies employing this idea, which are summarised in Algorithm 1. The first method, which we call with Pareto front selection (), selects with probability the location with the most promising mean prediction from the surrogate model (line 8). In the remaining cases it selects a random element from the approximate Pareto set , which is found using an evolutionary multi-objective optimiser (lines 3 and 4).

Following the selection of the , samples the remaining locations from a normal distribution centred on with a standard deviation equal to the radius (7) of penalisation used in (Alvi et al., 2019) (lines 9 to 16). We conservatively estimate the global optimum to be the best function evaluation seen so far, i.e. . The localised Lipschitz constant, similarly to Alvi et al. (2019), is estimated to be , where is a hypercube, centred on with side lengths of twice the length-scale of the surrogate model’s kernel. This allows for the local gradient to influence the size of the sampling region. If the local gradient is small then it is beneficial to sample over a wide region to learn more about the structure of . Conversely, a steeper local gradient would indicate that the modelled function is changing rapidly and therefore sampling more densely (due to a larger ) is required to accurately model and guide the search.

Figure 1 shows three example locations of

for a surrogate model and their corresponding probability density functions from which samples would be drawn. The blue circle is located at the minimum of the modelled function, and its corresponding sampling radius is relatively small because

is small, the local gradient is small, and the predicted uncertainty is relatively large, resulting in a fairly sharp distribution to sample from. The red circle corresponds to a location with a sampling radius that is larger than the previous point, because, while the difference between the modelled function and and the predicted uncertainty is similar to the blue location, the local gradient is smaller. Lastly, the black location has a similar model uncertainty to the red location, but has a much larger resulting in a much wider pdf, even when taking into account the larger local gradient.

Figure 1. selection example. The upper panel shows the predicted mean and uncertainty (green) of an unknown function sampled at four locations (red crosses). The lower panel shows the pdfs of , centred on the three correspondingly coloured locations of in the upper figure.

The second strategy, with random selection (), is identical to except that, with probability , it selects at a random location in the entire feasible space (line 6), instead of a location from the approximate Pareto set. It might be expected that would outperform because it selects locations that are more informative to the optimisation process, because they are non-dominated with respect to their predicted value and uncertainty. However, since De Ath et al. (2019) show that sequential methods based on selection from the Pareto set have only marginally better performance than purely random selection, we include to assess whether this is true in the batch setting.

4. Experimental Evaluation

We investigate the performance of the two proposed methods, and

on ten well-known benchmark functions with a differing dimensionality and two real-world applications in the form of an active learning problem for robot pushing and pipe shape optimisation. Full results of all experimental evaluations are available in the supplementary material.

Following the reported success of purely exploitative methods (De Ath et al., 2019; Rehbach et al., 2020), we also compare and to the purely exploitative method without any random point selection (i.e. ), which we denote . We also compare the batch methods to five BBO methods representative of different styles of batch optimisation as discussed in Section 2.2: Two acquisition function-based penalisation methods: the popular Local Penalisation (LP) method (González et al., 2016a), which uses soft penalisation (), and the more recent PLAyBOOK (Alvi et al., 2019) method that uses hard local penalisation (). Kriging Believer (KB), which penalises by hallucinating the already-selected batch points (Ginsbourger et al., 2010); and the Thompson sampling (TS) method of Kandasamy et al. (2018), which minimises a realisation of the modelled function from the surrogate. Lastly, we include qEI (Ginsbourger et al., 2008), which jointly estimates the location of the batch members. All methods were implemented in Python using the same packages111Implementation available: https://github.com/XXXX/XX, apart from the local penalisers of LP and PLAyBOOK which used the PLAyBOOK implementation.222https://github.com/a5a/asynchronous-BO

A zero-mean Gaussian process surrogate model with an isotropic Matérn kernel was used in all the experiments. The kernel was selected due to its widespread usage and recommended use for modelling realistic functions (Snoek et al., 2012). The models were initially trained on observations generated by maximin Latin hypercube sampling (McKay et al., 2000), with each optimisation run repeated times with different initialisations. The same sets of initial batch locations were common across all methods to enable statistical comparison. At each iteration, before batch point selection, the hyperparameters of the GP were optimised by maximising the log likelihood (4) with L-BFGS-B (Byrd et al., 1995) using 10 restarts (GPy, since 2012).

The LP, PLAyBOOK and KB methods all used the EI acquisition function. For each location selected in LP and PLAyBOOK, we followed the authors’ guidelines (Alvi et al., 2019) and uniformly sampled the acquisition function at locations, selecting the best location after locally optimising (with L-BFGS-B) the best 5. For the other methods, a maximum budget of acquisition function evaluations was used in conjunction with L-BFGS-B for functions with and for we used CMA-ES using the standard bi-population strategy (Hansen, 2009) and (up to) 9 restarts. The approximate Pareto set of non-dominated locations (in terms of and ) in was found using NSGA-II (Deb et al., 2001) with a population size, mutation rate, crossover rate, and crossover and mutation distribution indices of . For both and we took .

4.1. Synthetic Experiments

[-][]1-2[]4-5 Name Name
[]1-2[]4-5 WangFreitas (Wang and de Freitas, 2014) 1 logSixHumpCamel 2
Branin 2 modHartman6 6
BraninForrester (Forrester et al., 2008) 2 logGSobol (González et al., 2016a) 10
Cosines (González et al., 2016b) 2 logRosenbrock 10
logGoldsteinPrice 2 logStyblinkskiTang 10
[]1-2[]4-5[-]
Table 1. Synthetic functions used and their dimensionality . Formulae can be found as cited or at http://www.sfu.ca/~ssurjano/optimization.html for those labelled with .

The methods were evaluated on the 10 synthetic benchmark functions in Table 1 with batch sizes and a fixed budget of function evaluations. Table 2 shows, for a batch size of , the median difference (over repeated experiments) between the estimated optimum and true optimum, as well as the median absolute deviation from the median (MAD), a robust measure of dispersion. The method with the minimum median for each function is highlighted in red, and those that are statistically equivalent to the best method according to a one-sided, paired Wilcoxon signed-rank test (Knowles et al., 2006) with Holm-Bonferroni correction (Holm, 1979) (), are shown in blue. Note that tabulated results for all batch sizes are available in the supplementary material.

Method WangFreitas (1) BraninForrester (2) Branin (2) Cosines (2) logGoldsteinPrice (2)
Median MAD Median MAD Median MAD Median MAD Median MAD
LP
PLAyBOOK
KB
qEI
TS
(0.1)
(0.1)
Method logSixHumpCamel (2) modHartman6 (6) logGSobol (10) logRosenbrock (10) logStyblinskiTang (10)
Median MAD Median MAD Median MAD Median MAD Median MAD
LP
PLAyBOOK
KB
qEI
TS
(0.1)
(0.1)
Table 2. Optimisation results with a batch size of . Median absolute distance (left) and median absolute deviation from the median (MAD, right) from the optimum after 20 batches (200 function evaluations) across the 51 runs. The method with the lowest median performance is shown in red, with those with statistically equivalent performance shown in blue.

Figure 2 shows the convergence plots of the various algorithms on six test problems for . As might be expected, qEI tends to perform worse as increases, which we suspect is linked to the dimensionality of the qEI acquisition function being . For the 10-dimensional functions and , this requires global optimisation in a -dimensional space, a far from trivial task. The Thompson sampling method (TS) relies upon there being sufficient stochasticity in the surrogate model to select batch locations that are well distributed in space. If there is too much or too little variation in the locations selected, as appears to be the case in these results, the batch will be selected in either locations with poor mean values (too much variation) or all at the same location (too little variation). Similar performance results for TS are also shown in (Alvi et al., 2019).

Figure 2. Illustrative convergence plots for six benchmark problems and three batch sizes . Each plot shows the median difference between the best function value seen and the true optimum , with shading representing the interquartile range across the 51 runs.

As shown in the convergence plots and Table 2, the batch algorithms, and , both performed well across the range of synthetic problems for all batch sizes. , which always samples at and around the surrogate’s best mean prediction, also performed well across the majority of synthetic functions and was statistically equivalent to on all functions with a batch size of . This indicates that fully exploiting the model at each iteration and learning about the best mean prediction’s local landscape (via sampling its local neighbourhood) is a sound strategy and mirrors the findings, that being greedy is good, of Rehbach et al. (2020) and De Ath et al. (2019) in the sequential setting.

Interestingly, on the modHartman6 function in particular (Figure 2, lower-left), led to better median than for , even though there were 4 times fewer batches (10 instead of 40) and therefore the surrogate model was fitted far fewer times. This indicates that the model poorly estimated the underlying function, thus misleading the optimisation process. However, the expected trend prevails: an increase in generally led to a decrease in the median as well as a decrease in the rate of convergence.

The acquisition penalisation-based methods, LP and PLAyBOOK, performed similarly, with LP slightly ahead of PLAyBOOK. The dominating factor setting the penalisation radii (6) in both methods is the Lipschitz constant, which was estimated as being the largest value of over the whole problem domain for LP and locally for PLAyBOOK. Since the global Lipschitz constant will always be at least as large as a local one, it is perhaps unsurprising that LP performs better, because a larger constant corresponds to a smaller radius of penalisation, meaning that the batch points will be, on average, closer together and, therefore, more similar to the better-performing batch methods.

Convergence plots for and have a well-defined step-like appearance for several test functions, which is particularly visible in the plots with larger batch sizes. This is a consequence of the batch selection process because the first location in the batch minimises the surrogate model’s mean function (recall Algorithm 1, line 8). It does, however, imply that the sequential strategy is driving the optimisation process as the subsequent evaluations in the batch generally show little improvement over . This also means that the locations sampled around are useful because they improve the surrogate model accuracy, allowing the surrogate’s mean prediction to drive the optimisation.

Figure 3. Synthetic function optimisation summary. Symbols correspond to the proportion of times that a method is best or statistically equivalent to the best method across the 10 synthetic functions for .

Figure 3 summarises the performance of each of the evaluated methods for the batch sizes. We note that the methods are consistently the best or statistically indistinguishable from the best performing methods across the set of benchmark functions across all batch sizes. Interestingly, the older Kriging Believer (Ginsbourger et al., 2010), that penalises the surrogate model’s variance around selected batch points, performed better than the newer, acquisition-based penalisers, particularly for larger batch sizes. The increase in relative performance may be related to the particular acquisition function used because, as shown in (De Ath et al., 2019), EI weights improvements over the current much more highly than increases in variance. This may lead to the variance penalisation in KB having a smaller radius of effect than the penalisation in EI-space by the LP and PLAyBOOK methods, resulting in KB sampling locations closer together, in a more similar fashion to the methods.

As shown in Figure 3, for the -based algorithms, there is little to differentiate overall between selecting a location at random from either the Pareto front () or uniformly across the feasible space (). However, appears to be marginally better on lower-dimensional functions, most likely due to the surrogate model better describing the overall structure of the modelled function. Conversely, is slightly better on higher-dimensional functions because the modelled function with naturally be of a poorer quality and therefore relying solely on it, without sufficient stochasticity, could hinder the optimisation process.

Pure exploitation, i.e. , the method, leads to state-of-the-art performance across for many problems and dimensionalities. However, for problems in which a large amount of exploration is needed in order to locate a deceptive optimum, the methods with little exploration are unable to escape local minima or expend enough of their optimisation budget exploring the landscape. This is particularly apparent on the WangFreitas problem (Wang and de Freitas, 2014), which has a large, shallow local minimum and a narrow, deep global minimum surrounded by plateaus; see (Wang and de Freitas, 2014) for a plot. Figure 4 compares the performance of the methods for different on this problem. Note how (green) is able to more consistently find the global optimum for smaller values of , because, unlike (red), it is not constrained to only select non-dominated areas of decision space. , on the other hand, is consistently misled by the surrogate model’s incorrect estimation of the function and therefore fails to correctly optimise the function even when .

Figure 4. Distribution of after 200 function evaluations, taken over 51 runs, for (green) and (red, hatched) for different values of (horizontal axis) on the WangFreitas test problem with a batch size of .

4.2. Active Learning for Robot Pushing

Following (Wang and Jegelka, 2017; Jiang et al., 2019; De Ath et al., 2019), we optimise the control parameters for two active learning robot pushing problems (Wang et al., 2018); see (De Ath et al., 2019) for diagrams. In the first problem, push4, a robot is required to push an object towards an unknown target location. It receives the object-target distance once it has finished pushing. Its movement is constrained such that it can only travel in the direction towards the object’s initial location. The parameters to be optimised are the robot’s initial location, the orientation of its hand and for how long it travels. Thus optimising the values of the four parameters to reduce the final object-target distance can be cast as a minimisation problem. The object’s initial location is always set to be the centre of the domain (Wang and Jegelka, 2017; De Ath et al., 2019), but the target location is changed in each optimisation run, with these common across methods to ensure fair comparison. The performance of an optimisation algorithm is thus averaged over problem instances rather than the same function with different initialisations, as with the synthetic functions previously.

Similarly, in the second problem push8, two robots push their own objects towards unknown targets, with the complication that they may block each other’s path. The 8 parameters controlling both robots can be optimised to minimise the summed final object-target distances. Initial object locations were fixed and target’s positions were generated randomly, while ensuring that each pair of target positions allowed each object to touch its target without the objects overlapping. However, in some problem instances it is not possible for both robots to push their objects to their targets because the objects may be positioned such that the robots need to cross each other’s paths. In order to report the difference between and the true optimum we estimate the optimum of each problem instance by randomly sampling in the feasible space with sets of robot parameters and locally optimising the best 100 of these using L-BFGS-B. We therefore report the difference between the algorithm’s optimum and this estimated global optimum.

Figure 5. Convergence plots for the robot pushing problems (rows) over three batch sizes (columns). Each plot shows the median value of , with shading representing the interquartile range across the 51 runs.

Figure 5 shows the convergence of the evaluated methods for . In the four-dimensional push4 the methods, have statistically equivalent performance, but outperform other methods for and . KB also has statistically similar performance to the exploitative methods when , echoing its efficacy on the synthetic problems in which it also comparatively improved with increasing .

In the harder push8, all methods were statistically equivalent with , while TS and KB were worse with ; they were also worse, along with LP for . and consistently had the lowest median fitnesses, although other techniques were statistically equivalent. Interestingly, and echoing the results on the synthetic modHartman6 function, had a lower median fitness value for and than for the smaller batch sizes.

4.3. Pipe Shape Optimisation

We also evaluated the BBO methods on a real-world computational fluid dynamics (CFD) design problem. The PitzDaily test problem (Daniels et al., 2018)

, involves reducing the pressure loss along a pipe of different inflow and outflow diameters by optimising the pipe’s internal shape. The optimisation aims to find the shape of the lower wall of the pipe that minimises the pressure loss. The loss is evaluated by running a CFD mesh generation and partial differential equation simulation of the two-dimensional flow. Each function evaluation takes between 60

and 90, depending on mesh complexity.

The pipe’s lower wall geometry is represented by a Catmull-Clark sub-division curve, whose control points comprise the decision variables. We use 5 control points, resulting in a 10-dimensional decision vector. The control points are constrained to lie within a polygon and the initial locations used in the optimisation runs are uniformly sampled in this constrained domain. Similarly, for the batch optimisation methods themselves, we take the naive approach of rejection sampling when optimising acquisition functions for the KB, qEI and the approaches; we do not consider locations that violate the constraints for LP, PLAyBOOK and TS.

Figure 6. Convergence plots for the PitzDaily problem for three batch sizes . Each plot shows the median best seen flow loss, with shading representing the interquartile range across the 51 runs.

Convergence plots of the flow loss with are shown in Figure 6. The Kriging Believer (KB) consistently optimised the problem well, although , , and LP were statistically equivalent with batch sizes and . The rates at which KB reaches its best flow losses, however, were superior to the other methods on the PitzDaily problem. In the case, KB reaches close to its best flow losses after only 5 batches. We note that all methods were able to discover pipe shape configurations that led to flow losses that were better than found by an adjoint (local, gradient-based) optimisation method (Nilsson et al., 2014), although TS, qEI, and PLAYbOOK were not able to reach a median flow loss of lower than the adjoint solution for all batch sizes.

5. Conclusions and Future Work

Our novel method, which uses an acquisition function to select the first batch location and samples the remaining locations around it, is both conceptually simple and computationally efficient because, unlike many other batch methods, only one global optimisation run is needed to select the batch locations. The method is competitive with state-of-the-art BBO algorithms and better than them in several cases. We attribute this to the exploitative nature of the first batch point selected together with benefit of learning an accurate function model with the remainder of the batch.

Pure exploitation (: ) led to good performance on the majority of problems because the surrogate model poorly estimates the true function, particularly on higher-dimensional functions, thus inducing enough exploration. However, in the case of degenerate functions, e.g. WangFreitas, the surrogate model is too poor optimise well, requiring a larger to promote exploration. We have found that works well.

Future research questions revolve around how best to select the locations of the batch points. Although not described in detail here, we also investigated an alternative approach of always selecting the first batch location as the surrogate model’s best mean prediction and dividing the remaining samples into two groups. One group, selected with probability for each location, was used identically to the greedy shotgun approach of sampling around the first batch location and the second group (probability ) were randomly sampled in space or on the approximated Pareto front. This approach gave similar results to .

One possible extension is use Latin hypercube sampling instead of drawing random samples to better spread out the batch locations, making the best use of each function evaluation. In addition, it may be beneficial to tailor the sampling covariance to sample less/more densely in directions that are flatter/steeper in decision space.

Lastly, the function evaluation may take different times depending on location and computational hardware, resulting in some evaluations finishing before others. Current research, therefore, focuses on extending the method to asynchronous BBO.

Acknowledgements.
This work was supported by Innovate UK grant number 104400.

References

  • A. Alvi, B. Ru, J. Calliess, S. Roberts, and M. A. Osborne (2019) Asynchronous batch Bayesian optimisation with improved local penalisation. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 253–262. Cited by: §2.2, §2.2, §3, §4.1, §4, §4.
  • J. Azimi, A. Fern, and X. Z. Fern (2010) Batch bayesian optimization via simulation matching. In Advances in Neural Information Processing Systems, pp. 109–117. Cited by: §2.2.
  • R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu (1995) A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16 (5), pp. 1190–1208. Cited by: §4.
  • Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas (2018) Bayesian optimization in AlphaGo. External Links: 1812.06855 Cited by: §1.
  • C. Chevalier and D. Ginsbourger (2013) Fast computation of the multi-points expected improvement with applications in batch selection. In International Conference on Learning and Intelligent Optimization, pp. 59–69. Cited by: §2.2.
  • T. Chugh, K. Sindhya, K. Miettinen, Yaochu Jin, T. Kratky, and P. Makkonen (2017) Surrogate-assisted evolutionary multiobjective shape optimization of an air intake ventilation system. In IEEE Congress on Evolutionary Computation, pp. 1541–1548. Cited by: §1.
  • S. J. Daniels, A. A. M. Rahat, G. R. Tabor, J. E. Fieldsend, and R. M. Everson (2019) Automated shape optimisation of a plane asymmetric diffuser using combined computational fluid dynamic simulations and multi-objective bayesian methodology. International Journal of Computational Fluid Dynamics 33 (6–7), pp. 256–271. Cited by: §1.
  • S. J. Daniels, A. A. M. Rahat, R. M. Everson, G. R. Tabor, and J. E. Fieldsend (2018) A suite of computationally expensive shape optimisation problems using computational fluid dynamics. In Parallel Problem Solving from Nature – PPSN XV, pp. 296–307. Cited by: §4.3.
  • E. A. Daxberger and B. K. H. Low (2017) Distributed batch Gaussian process optimization. In Proceedings of the 34th International Conference on Machine Learning, pp. 951–960. Cited by: §2.2.
  • G. De Ath, R. M. Everson, A. A. M. Rahat, and J. E. Fieldsend (2019) Greed is good: exploration and exploitation trade-offs in bayesian optimisation. External Links: 1911.12809 Cited by: §1, §2.1.2, §2.1.2, §2.2, §3, §3, §4.1, §4.1, §4.2, §4.
  • A. De Palma, C. Mendler-Dünner, T. Parnell, A. Anghel, and H. Pozidis (2019) Sampling acquisition functions for batch Bayesian optimization. External Links: 1903.09434 Cited by: §2.2.
  • K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan (2001)

    A fast and elitist multiobjective genetic algorithm: NSGA-II

    .
    IEEE Transactions on Evolutionary Computation 6 (2), pp. 182–197. Cited by: §4.
  • T. Desautels, A. Krause, and J. W. Burdick (2014) Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research 15, pp. 4053–4103. Cited by: §2.2.
  • A. I. J. Forrester, A. Sobester, and A. J. Keane (2008) Engineering design via surrogate modelling - a practical guide. Wiley. External Links: ISBN 978-0-470-06068-1 Cited by: Table 1.
  • D. Ginsbourger, R. Le Riche, and L. Carraro (2008) A multi-points criterion for deterministic parallel global optimization based on gaussian processes. External Links: Link Cited by: §2.2, §4.
  • D. Ginsbourger, R. Le Riche, and L. Carraro (2010) Kriging is well-suited to parallelize optimization. In Computational intelligence in expensive optimization problems, pp. 131–162. Cited by: §2.2, §4.1, §4.
  • J. González, Z. Dai, P. Hennig, and N. Lawrence (2016a) Batch Bayesian optimization via local penalization. In

    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics

    ,
    Vol. 51, pp. 648–657. Cited by: §2.2, §2.2, §2.2, Table 1, §4.
  • J. González, M. Osborne, and N. Lawrence (2016b) GLASSES: Relieving the myopia of Bayesian optimisation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Vol. 51, pp. 790–799. Cited by: Table 1.
  • GPy (since 2012) GPy: a gaussian process framework in python. Note: http://github.com/SheffieldML/GPy Cited by: §4.
  • N. Hansen (2009) Benchmarking a bi-population cma-es on the bbob-2009 function testbed. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 2389–2396. Cited by: §4.
  • J. M. Hernández-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-Guzik (2017) Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 1470–1479. Cited by: §1.
  • S. Holm (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6 (2), pp. 65–70. Cited by: §4.1.
  • F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, C. A. C. Coello (Ed.), pp. 507–523. Cited by: §1.
  • S. Jiang, H. Chai, J. González, and R. Garnett (2019) Efficient nonmyopic Bayesian optimization and quadrature. External Links: 1909.04568 Cited by: §4.2.
  • D. R. Jones, M. Schonlau, and W. J. Welch (1998) Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13 (4), pp. 455–492. Cited by: §2.1.2, §2.1.
  • K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Poczos (2018) Parallelised Bayesian optimisation via thompson sampling. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Vol. 84, pp. 133–142. Cited by: §1, §2.2, §4.
  • J. D. Knowles, L. Thiele, and E. Zitzler (2006) A tutorial on the performance assesment of stochastic multiobjective optimizers. Technical report Technical Report TIK214, Computer Engineering and Networks Laboratory, ETH Zurich, Zurich, Switzerland. Cited by: §4.1.
  • W. Lyu, F. Yang, C. Yan, D. Zhou, and X. Zeng (2018) Batch Bayesian optimization via multi-objective acquisition ensemble for automated analog circuit design. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 3306–3314. Cited by: §1.
  • M. D. McKay, R. J. Beckman, and W. J. Conover (2000) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 42 (1), pp. 55–61. Cited by: §2.1, §4.
  • U. Nilsson, D. Lindblad, and O. Petit (2014) Description of adjointShapeOptimizationFoam and how to implement new objective functions. Technical report Chalmers University of Technology, Gothenburg, Sweden. Cited by: §4.3.
  • A. A.M. Rahat, C. Wang, R. M. Everson, and J. E. Fieldsend (2018) Data-driven multi-objective optimisation of coal-fired boiler combustion systems”. Applied Energy 229, pp. 446–458. Cited by: §1.
  • C. E. Rasmussen and C. K. I. Williams (2006) Gaussian processes for machine learning. The MIT Press, Boston, MA. External Links: ISBN 0-262-18253-X Cited by: §2.1.1.
  • F. Rehbach, M. Zaefferer, B. Naujoks, and T. Bartz-Beielstein (2020) Expected improvement versus predicted value in surrogate-based optimization. External Links: 2001.02957 Cited by: §1, §2.1.2, §2.2, §4.1, §4.
  • A. Shah and Z. Ghahramani (2015) Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pp. 3330–3338. Cited by: §2.2.
  • B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas (2016) Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959. Cited by: §1, §4.
  • N. Srinivas, A. Krause, S. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022. Cited by: §2.1.2.
  • J. Wang, S. C. Clark, E. Liu, and P. I. Frazier (2016) Parallel bayesian global optimization of expensive functions. External Links: 1602.05149 Cited by: §2.2.
  • Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-Pérez (2018) Active model learning and diverse action sampling for task and motion planning. In Proceedings of the International Conference on Intelligent Robots and Systems, pp. 4107–4114. Cited by: §4.2.
  • Z. Wang and S. Jegelka (2017) Max-value entropy search for efficient Bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning, pp. 3627–3635. Cited by: §4.2.
  • Z. Wang and N. de Freitas (2014) Theoretical analysis of Bayesian optimisation with unknown Gaussian process hyper-parameters. External Links: 1406.7758 Cited by: §4.1, Table 1.
  • J. Wu and P. Frazier (2016) The parallel knowledge gradient method for batch bayesian optimization. In Advances in Neural Information Processing Systems, pp. 3126–3134. Cited by: §2.2.