## 1 Introduction

Since their beginnings about half a century ago [26, 49, 31], Bayesian optimization algorithms have been increasingly used for derivative-free global minimization of expensive to evaluate functions. Typically assuming a continuous objective function , single-objective Bayesian optimization algorithms consist in sequentially evaluating at promising points under the assumption that is a sample realization (path or trajectory) of a random field . Such algorithms are especially popular in the case where evaluating requires heavy high-fidelity numerical simulations (or computer experiments, see notably [33, 40, 41, 24]), where

stands for some design parameters to be optimized over. Such expensive simulations are classically encountered in the resolution of partial differential equations from physical sciences, engineering and beyond

[13]. In recent years, Bayesian optimization also has attracted a lot of interest from the machine learning community

[27, 6, 34, 43], be it to optimize simulation-based objective functions [28, 45, 38]or even to estimate tuning parameters of machine learning algorithms themselves

[3, 4, 42]. In both communities, a Gaussian random field (or Gaussian Process, GP) model is often used for , so that prior information on is taken into account through a trend function and a covariance kernel . Once and are specified, possibly up to some parameters to be inferred based on data, the considered GP model can be used as an instrument to locate the next evaluation point(s) via so-called infill sampling criteria, also referred to as acquisition functions or simply as criteria. While a number of Bayesian optimization criteria have been proposed in the literature (see, e.g., [23, 15, 46, 43, 9] and references therein), we concentrate here essentially on the Expected Improvement (EI) criterion [30, 24] and on variations thereof, with a focus on its use in synchronous batch-sequential optimization. Denoting by points were is assumed to have already been evaluated and by a batch of candidate points where to evaluate next, the multipoint EI is defined as(1) |

where refers to the conditional expectation knowing the event . One way of calculating such criterion is to rely on Monte Carlo simulations. Figure 1

illustrates both what the criterion means and how to approach it by simulations, relying on three samples from the multivariate Gaussian distribution underlying Equation (

1). Our main focus here, in contrast, is on deriving Equation (1) in closed form, studying the criterion’s differentiability, and ultimately calculating and efficiently approximating its gradient in order to perform efficient batch optimization using derivative-based deterministic search.Now, for , it is well known that EI can be expressed in closed form as a function of and as follows

(2) |

where (defined for ) and

are the cumulative distribution function and probability density function of the standard Gaussian distribution, respectively.

When deriving Equation (2), Equation (1) happens (hence for ) to involve a first order moment of the truncated univariate Gaussian distribution. As shown in [7] and developed further here, it turns out that Equation (1) can be expanded in a similar way in the multipoint case () relying on moments of truncated Gaussian vectors. This is essential for the open challenges tackled here of efficiently calculating and optimizing the multipoint criterion of Equation (1).

The applied motivation for having batch-sequential EI algorithms is strong, as distributing evaluations of Bayesian optimization algorithms over several computing units allows significantly reducing wall-clock time and with the fast popularization of clouds, clusters and GPUs in recent years it is becoming always more commonplace to launch several calculations in parallel. Even at a slightly inflated price and scripting effort, reducing the total time off is often a primary goal in order to deliver conclusions involving heavy experiments, be they numerical or laboratory experiments, in studies subject to hard time limitations. Obviously, given its practical importance, the question of parallelizing EI algorithms and alike by selecting points per iteration has been already tackled in a number of works from various disciplinary horizons (including notably [36, 1, 12, 8, 19]). Here we essentially focus on approaches relying on the maximization of Equation (1) and related multipoint criteria. The multipoint EI of Equation (1) has been defined in [30, 41] and first calculated in closed form for the case in [17]. For the case , a Monte Carlo scheme and some sub-optimal batch selection strategies were proposed. Further work on Monte Carlo simulations for multipoint EI estimation can be found in [22, 18]; besides this, stochastic simulation ideas have been explored in [14] for maximizing this multipoint EI criterion via a stochastic gradient algorithm, an approach recently investigated in [47]. Meanwhile, a closed-form formula for the multipoint EI relying on combinations of -and -dimensional Gaussian cumulative distribution functions was obtained in [7], a formula which applicability in reasonable time is however restricted to moderate (say ) in the current situation. Building upon [7], [29] recently calculated the gradient of the multipoint EI criterion in closed form and obtained some first experimental results on (non-stochastic) gradient-based multipoint EI maximization.

Our aim in the present paper is to present a set of novel analytical and numerical results pertaining to the calculation, the computation, and the maximization of the multipoint EI criterion. As most of these novel results apply to a broader class of criteria, we first present in Section 2 a generalization of the multipoint EI that allows accounting for noise in conditioning observations and also exponentiating the improvement. This generalized criterion is calculated using moments of truncated Gaussian vectors in the flavour of [7]. The obtained formula is then revisited in the standard case (noise-free with an exponent set to ), leading to a numerical approximation of the multipoint EI with arbitrary precision and very significantly reduced computation time. Next, the -dimensional maximization of the multipoint EI criterion is discussed in Section 3, where the differentiability of the generalized criterion is studied, its analytical gradient is calculated, and further numerical approaches for fast gradient approximations with controllable accuracy are presented. Finally, Section 4 is dedicated to numerical experiments where, in particular, a multistart derivative-based multipoint EI maximization algorithm highlighting the benefits of the considered methodological principles and the proposed fast approximations is tested and compared to baseline strategies.

## 2 Criteria in parallel Bayesian optimization

### 2.1 General definition of Expected Improvement

Throughout this section the objective function may be observed noise-free or in noise, meaning that at some arbitrary iteration the observed value may be or where

is a realization of a zero mean Gaussian random variable with known (or estimated and plugged-in) variance.

is assumed to be one realization of a random field , where has a Gaussian random field (GRF) distribution conditionally to events of the form (with conditioning on in the noisy case, see for instance [35]). This setup naturally includes the case where is a GRF, but also the so-called Universal Kriging settings where is the sum of a trend with an improper prior and a GRF [33, 32]. Note that in noisy cases the ’s are generally assumed to be independent (although the case of ’s forming a Gaussian vector is tractable), but more essentially they are assumed independent of .In batch-sequential Bayesian Optimization we are interested in computing sampling criteria depending on new points . At any step of corresponding synchronous parallel algorithms, the next batch of points is then defined by globally maximizing over all possible batches:

(3) |

Values of such criteria typically depend on through the conditional distribution , simplifying to in the noiseless context. Conditional mean and covariance functions are analytically formulated via the so-called *kriging equations*, see e.g. [39].
Working out these criteria thus generally boils down to Gaussian vector calculus, which may become intricate and quite cumbersome to implement as (or , in noisy settings) increases.
Our considered generalized version of the multipoint EI criterion, that allows accounting for a Gaussian noise in the conditioning observations and also for an exponentiation in the definition of the improvement, is defined as:

(4) |

where , and . This form gathers several sampling criteria notably including , both in noiseless and noisy settings, and also a multipoint version of the generalized EI of [41]

. In addition, the obtained results apply to batch-sequential versions of the Expected Quantile Improvement

[35] (EQI) and variations thereof, by a simply change of process from to the quantile process. We will show in proposition 2 that such generalized multipoint EI criteria can be formulated as a sum of moments of truncated Gaussian vectors. In the next subsection, in order to get a closed form for the generalized EI we first define these moments and derive some first analytical formulas, that might also be of relevance in further contexts.### 2.2 Preliminaries on moments of truncated Gaussian distribution

We fix
and in noisy settings or in noiseless settings.

###### Definition 1.

Let be a Gaussian vector with mean and covariance matrix , where is the cone of positive definite matrices of . For all positive integer , we define the function on by

(5) |

where the inequality is to be interpreted component-wise.

If is composed of values of a GRF at a batch of locations , we use the notation . We obtain the moments of a truncated Gaussian distribution by an extension of Tallis’ technique [44] to any order, presented in the following proposition:

###### Proposition 1.

The function defined by

(6) |

where is the cumulative distribution function of the centered

-variate normal distribution, is infinitely differentiable, and the moments

are given by:(7) |

The proof of this Proposition is given in appendix B.1

and relies on calculating the moment generating function

. Even if an analytical formula can be obtained at any order of differentiation , the complexity of derivatives in equation (7) increases rapidly. We give below the results for equals 1 and 2.#### Case

Differentiating with respect to yields:

where is the gradient of (see appendix A.1 for an analytical derivation). Taking in the previous equation gives

(8) |

where is the column of . It is shown in appendix A.1 that computing each of the components of requires to compute a multivariate CDF of the normal distribution in dimension . The number of calls to this function for computing the first moment of the truncated Gaussian distribution is thus of .

#### Case

Similarly, differentiating twice with respect to yields

(9) |

For readability, the detailed formula of , the Hessian matrix of , is sent to Appendix A.2. The number of calls to the multivariate normal CDF is of .

### 2.3 Analytic formulas for generalized q-Ei

The previous results obtained for the moments of the truncated normal distribution turn out to be of interest for computing the generalized introduced in Equation (4), as shown by the following proposition.

###### Proposition 2.

For , the criterion defined by (4) exists for all and can be written as a sum of moments of truncated normal distributions

(10) |

with a vector of size defined, noting , by

,
,
,
.

Moreover, in the noiseless case the random vector becomes deterministic conditionally to . Denoting by the (smallest) index of the minimal observation, i.e. , and writing the vector of the last components of , Equation (10) is simplified to:

(11) |

###### Remark 1.

In the rest of the article we also use the following compact notation for the dimensional vector :

(12) |

where is a matrix implicitly defined by of proposition 2.

The proof of Proposition 2 is relegated to Appendix C for conciseness. Equation (10) highlights that the computation of the generalized in noisy settings is challenging since it involves computing different moments, each requiring calls to the multivariate normal CDF in a dimension close to . Even for and moderate , the linear dependence in the number of observations makes the use of this criterion challenging in application. Regarding the noiseless criterion, the computation of moments is more affordable, at least for moderate , but one has to keep in mind that the ultimate goal here is to perform global maximization of the considered criteria. It is thus important to bring further calculation speed-ups in order to perform this optimization in a reasonable time compared to the evaluation time of the objective function , assumed expensive to evaluate. The next section discusses these matters and proposes faster formulas to compute both and its gradient.

## 3 Computing and optimizing the criteria

### 3.1 Generalities

Maximizing the expressions given in Equation (10) (noisy settings) or (11) (noiseless settings) is difficult. These maximizations are performed with respect to a batch of points , and are thus optimization problems in dimension . In this space, the objective function to be maximized, is not convex in general and has the interesting property that the points in the batch can be permuted without changing the value of ; i.e. for any permutation of . With this property, one can reduce the measure of the search domain by , e.g. by imposing that the first coordinate of the points in the batch are in ascending order. We will restrict our attention here to the use of multi-start gradient based local optimization algorithms acting on the whole input domain , that do not exploit the structure of the problem but do not seem to be affected by this, at least with the chosen settings regarding the starting designs. Our contribution here will be to propose a faster formula for computing the first moments previously presented, as well as their derivatives. This will yield an easier computation of both the generalized EI and its -dimensional gradient. Besides, a second approximate but faster formula to further reduce the calculation time of the gradient will be introduced.

### 3.2 Gradient of the generalized q-Ei

In this section, we extend the analytical gradient calculation of the performed in [29] to the case of the generalized noisy and noise-free , and provide in turn a more concise formula. Again, the presented formulas rely on results on moments of truncated Gaussian distributions.

###### Proposition 3.

Let be a batch such that the conditional covariance matrix

is positive definite and the functions and

are differentiable at each point
.
These derivatives are written
and respectively.
In this setup, the function of
Equation (10) is differentiable
and its derivative
with respect to the j
coordinate of the point is

(13) | ||||

where , and is the Kronecker symbol. The derivatives and are calculated in Appendix B.2.

This new expansion of the gradient of the generalized EI as a sum of derivatives of first order moments will prove to be very useful thanks to formulas presented next.

### 3.3 Fast numerical estimation of first order moments and their derivatives

Let us now focus on the practical implementation of the closed-form formula of Equation (10). We take and note in noisy settings and in noiseless settings. As mentioned before, the computation of the noisy or noiseless (see, Eqs. (10),(11)) requires calls to the CDF of the and variate normal distribution, and . These CDFs are here computed using the Fortran algorithms of [16] wrapped in the mnormt R package [2]. A quick look at Eqs. (8),(10) suggests that the noisy requires evaluations of and evaluations of . For the noiseless case (see, Equation (11)), the number of calls are divided by . In both cases, a slight improvement can be obtained by noticing a symmetry which reduces the number of calls from (resp. in the noiseless case) to (resp. ). This symmetry is justified in Appendix E.

Despite this improvement, and even in the classical noiseless case, the number of calls is still proportional to . We now give new efficient and trustworthy expansion that enables a fast and reliable approximation of first order moments of truncated Gaussian vectors by reducing this number of calls to .

###### Proposition 4.

Let , and let be a Gaussian random vector with mean vector and covariance matrix . Then we have

(14) |

###### Proof.

The obtained formula simply uses the approximation of a moment
with finite differences of the moment generating function.
We showed here that instead of fully computing the moment generating function, we can expand the simpler *tangent* function . For conciseness, we name here the use of this formula “tangent moment method”. This formula thus enables approximating the first order moment at the cost of only two calls to . Hence,
from Equation (11), computing a noiseless
can be performed at the cost of calls to .
Besides, a similar approach can be applied to approximate the gradient of through faster computations of and , as shown next:

###### Proposition 5.

The following equations hold:

(15) | ||||

(16) | ||||

where is the Hessian matrix of (see Appendix A.2 for details).

As before, these formulas enable reducing the number of calls to the multivariate CDF by an order . For the computation of this number goes from to . For computing its -dimensional gradient, it goes from to . The latter complexity suggests restricting to moderate values of in applications. In the next section we present further results that enable further reducing the complexity for numerically estimating the gradient.

### 3.4 A slightly biased but fast proxy of the gradient

The key idea to obtain further computational savings is summarized in this section. We first strategically decompose the gradient of moments as a sum of two terms.

###### Proposition 6.

Let us consider a Gaussian multivariate random field from to . For , let us denote by and the mean and the covariance matrix of . Let and assume that is positive definite. Also, assume that the functions , and are differentiable at . Then the following decomposition holds for .

(17) |

###### Proof.

is continuous at , so there exists a neightborhood of such that for all , is positive definite. Let us define on :

Applying equation (22) of appendix B.1, for all and , is a moment generated by differentiation of the following function:

(18) |

The analytical form of equation (18) and the assumed differentiability at ensure existence of partial derivatives of at . So to conclude,

∎

The latter decomposition can be interpreted as follows: infinitesimal variations of around modify the moments in two ways. First, it modifies the distribution of , second it changes the distribution of the truncation . For the particular case of , we propose to neglect this second variation. Applying this approximation to (11) gives for ,

(19) |

where the last step is obtained by mean square differentiability of the process , with an event constant with respect to , see Appendix D. We can observe that this approximation makes a summation term disappear. The computation of this formula requires evaluations of -variate Gaussian CDF. Indeed, Equation (19) indicates that each component of the gradient vector can be considered as a moment of a truncated Gaussian vector, so we can apply the results of section 2. In particular, when , applying Proposition 4, two Gaussian CDF calls are needed for each of the components, leading to evaluations. Besides, from Equation (14), the second CDF call does not depend on , which implies that this term is common for every dimension. Thus the gradient of Equation (19) finally comes with CDF evaluations instead of . For a full gradient with respect to all points of the batch, we then need CDF evaluations – a substantial improvement compared to the obtained in [29] and the obtained in the previous section. The complexities for computing moments, and its gradients, expressed in terms of number of calls to the function, are summarized in Table 1. These new computational savings come at the price of a non-exact gradient calculation. A first numerical validation is represented in Figure 2. On this example, we observe small () relative errors between the exact and approximate gradient of dimension (the biggest difference vector has a norm of 0.13, compared to an exact gradient norm of 13.1). We also observe that the relative error appears to be typically smaller with higher , which is promising for maximisations. However, this apparently trustful but non-exact calculation naturally raises the question of the impact of such an approximation on the performances of gradient-based maximization algorithms. As we will see in the next section, this proxy gradient turned out to enable quite competitive maximization performances based on numerical experiments.

Number of CDF evaluations | ||||||

Total | ||||||

analytic | ||||||

tangent moment | ||||||

analytic | ||||||

tangent moment | ||||||

analytic | ||||||

tangent moment | ||||||

proxy | ||||||

analytic | ||||||

tangent moment | ||||||

proxy |

## 4 Application

The goal of this section is to illustrate the usability of the proposed gradient-based maximization schemes and in particular the improvements brought by the fast formulas detailed in the previous sections. The relevance of using sequential sampling strategies based on the maximization has already been investigated before (see, [7, 47, 29]) and all these articles pointed out the importance of calculation speed which often limits the use of based strategies to moderate . We do not aim again at proving the performance of based sequential strategies. Instead we aim at illustrating the gain, in computation time, brought by the fast formulas and show that using the approximate gradient obtained in Equation (19) does not impair the ability to find batches with (close to) maximal .

### 4.1 Objective function and pure calculation speed

The objective function is the so-called Borehole function [21]. It has been previously used for testing methods using a surrogate model [48, 20]. The function computes a rate of water flow, , through a borehole. The problem is described by input variables, , , , , , , , and is given below

(20) |

Here, the objective function is obtained by rescaling on the input domain . An analytical study of variations shows that there is a unique global minimum at , with .

Before using sequential strategies to minimize , we look at empirical computation times for evaluating and its gradient as a function of the batch size . For the computations, the so-called “analytic” method relies on the state of the art formulas of [7, 29] with a number of calls to the multivariate normal cdf of respectively and . The “tangent moment” method uses our formula for moment calculation to yield and its gradient (see, Equations (14) and (15),(16)). Finally, for computing the gradient only, the “proxy” method relies on Equation (19).

Figure 3 exhibits computation times averaged over batches drawn uniformly. The Gaussian process model is based on an initial design of points drawn from a optimum-LHS procedure [25]. We use the tensor product covariance function and estimate the hyperparameters by maximum likelihood using the DiceKriging R package [11]. Figure 3 shows significant computational savings. For instance with , one gradient computation takes respectively , and using respectively the proxy, the tangent moment and analytic methods. Since the complexity for computing a gradient with the proxy is of against and for the two other methods, the computational savings of the proxy tend to increase with . It should also be noted that these savings will be larger with decreasing domain dimension . If we look at computations, the tangent moment method is times faster than the analytic one when and times faster when ; thanks to an complexity against .

### 4.2 Experimental setup: sequential minimization strategies

We now perform a total of minimizations of , each using an initial design of experiments of points drawn from an optimum-LHS procedure with a different seed. Three different batch-sequential strategies are investigated.

The first one – serving as a benchmark – is a variation of the “Constant Liar Mix” heuristic

[7, 47] where, at each iteration, the batch of size is chosen among several batches obtained from the Constant Liar heuristic [17] with different lie levels. We use lie levels fixed to the current maximum observation the current minimum observation, and the quantiles of the conditional distribution of the point selected in the batch. A total of batches are proposed at each iteration and the CL-mix heuristic picks the one with maximum .The two other strategies considered here rely on pure maximization using a multistart BFGS algorithm with a stopping criterion of precision 2.2 (parameter control$factr of the R function optim [37]). The gradients involved in the optimization are computed either with the tangent moment formula or the proxy. For the gradient-based maximization, we use a total of starting batches obtained, again, using a Constant Liar heuristic with random lies sampled from the conditional distribution at the selected point. Finally we use two different batch sizes. When we run a total of iterations and when we run iterations. The hyperparameters of the GP model are re-estimated at each iteration after having incorporated the new observations.

### 4.3 First q-EI maximization

We first compare the performances, in terms of , of the multistart BFGS algorithm when the proxy gradient and the tangent moment methods are used. Table 2 compares the results at iteration for these two methods and the CL-mix strategy. The results are averaged over the 50 initial designs.

tangent moment | 12.45 (22.6 s) | 15.35 (700.2 s) |
---|---|---|

proxy | 12.46 (14.3 s) | 15.35 (127.0 s) |

CL-mix | 11.80 (7.7 s) | 14.34 (15.6 s) |

As expected, the CL-mix heuristic yields batches with lower than the strategies directly maximizing . Also, for both and , the two based methods have the same performance, which stresses out the relevance of the proxy method since the latter is about 1.6 times faster when and 5.5 faster when .

### 4.4 Several q-EI maximization steps

We now compare the performances of the different maximization approaches after multiple batch evaluations. Figure 4 displays the average regret as a function of the iteration number (first row) and the total computation time (i.e. the time to evaluate and find the next batch to evaluate) assuming respectively that the computation time of is seconds (i.e. instantaneous), two minutes and one hour (rows respectively).

Looking at the performances as a function of the iteration number (first row on Figure 4), the CL-mix heuristic, which samples a batch with lower at each step, leads in average to a slower convergence than the two other methods, for both and . In contrast, the two strategies based on maximization have similar performances.

However, these conclusions do not hold when the regret is plotted as a function of the total computation time (rows on Figure 4). First, when the computation time of is null (row ) it is clear that -based sequential strategies are not adapted since they are too expensive. In this case, the CL-mix heuristic performs better and some other optimization strategies which are not metamodel-based would probably be more relevant. Second, when is moderately expensive (i.e. minutes), the proxy method and CL-mix have comparable performances when , but the proxy outperforms when . Besides, the proxy shows a much faster convergence than the tangent moment method when . The use of based strategies thus becomes relevant when is larger than a few minutes, if the proxy is used. Finally, when is equal to one hour, the use of based strategies is particularly recommended. In that case the relative improvement of the proxy compared to the tangent moment method tends to naturally vanish because of the long computation time of . When is extremely expensive to compute, using the proxy is thus not essential. However, since it does not impair the ability to find a batch with large we still recommend to use it, especially when is large.

## Conclusion

In this article we provide a closed-form expression of generalized -points Expected Improvement criterion for batch-sequential Bayesian global optimization. An interpretation based on moments of truncated Gaussian vectors yields fast formulas with arbitrary precision. Furthermore an new approximation for the gradient is shown to be even faster while preserving ability to find batches close to maximal . As the use of these strategies was previously considered cumbersome from a dozen of batch points, these formulas happen to be of particular interest to run based batch-sequential strategies for larger batch sizes. We show that these methods are implementable and efficient on a classic -dimensional test case. Additionally, some of the intermediate results established here might be of interest for other research questions involving moments of truncated Gaussian vectors and their gradients. Perspectives include deriving second order derivatives of and fast numerical estimates thereof. Also, we aim at improving the sampling of initial batches in multistart derivative-based maximization.

Acknowledgements: Part of this work has been conducted within the frame of the ReDice consortium, gathering industrial (CEA, EDF, IFPEN, IRSN, Renault) and academic (École des Mines de Saint-Étienne, INRIA, Universität Bern) partners around advanced methods for computer experiments.

## Appendix A Differentiating multivariate Gaussian CDF

We consider the CDF dimension . We use the convention .

### a.1 Gradient

Using the following identity, derived from conditional distributions of a Gaussian vector,

with and , we reformulate the integral of the Gaussian CDF:

Here indexed minus symbols, e.g. in , refer to exclusions of a line or a column.

Finally we have

(21) |

### a.2 Hessian

As for the computation of the gradient, we write

with and