Introduction
The vast majority of optimization problems encountered in practical applications involve objective functions whose level of complexity prohibits use of classical optimization algorithms such as Newton methods, conjugate gradient methods, or evolutionary algorithms
[12, 8]. In fact, so little is known about the internal workings of those systems that practitioners have no choice but to treat them as “black boxes”, for which a high evaluation cost adds to the issue of opacity, making optimization a daunting endeavor.To solve such optimization problems while keeping the number of blackbox evaluations at a minimum, one possibility is to use an iterative approach. In this arena, Bayesian optimization is the undisputed champion [2, 23, 7]. Its success is largely due to the fact that Bayesian optimization a) incorporates prior belief one may have about the blackbox objective function, and b) explores the input space carefully, compromising between exploration and exploitation before each function evaluation.
A key component in Bayesian optimization lies in the choice of acquisition function, which is the ultimate decider as it commands where to next query the objective function. Acquisition functions come in many shapes and forms, ranging from traditional improvementbased [14, 16, 12] and optimistic policies [24, 13] to more recent informationbased strategies [27, 8, 9]. The latter differ from the former in that they account for the importance of the output relative to the input (usually through the entropy of the posterior distribution over the unknown minimizer), leading to significant gains when the objective function is noisy and highly multimodal.
Despite superior empirical performance, informationbased acquisition functions suffer from several shortcomings, including slow evaluation caused by heavy sampling requirements, laborious implementation, intractability in high dimensions, and limited choice of Gaussian process kernels [9, 10, 28]. The only exception of which we are aware is the FITBO algorithm [20], in which efficiency and flexibility come at the cost of a decline in performance, the latter being just “on par” with traditional improvementbased and optimistic policies.
While it is clear that incorporating information about the output space in an acquisition function has significant merit, doing so while eliminating the above limitations calls for a radically different approach. Inspired by the theory of importance sampling [18] and rareevent quantification [17, 22], we propose to equip acquisition functions with the likelihood ratio, a quantity that accounts for the importance of the output relative to the input, and which appears frequently in uncertainty quantification and experimental design problems [3, 11]. The likelihood ratio acts as a sampling weight and guides the search algorithm towards regions of the input space where the magnitude of the objective function is unusually large, which has the potential of accelerating convergence by a significant amount.
Formulation of the problem
We consider the problem of finding a global minimizer of a function over a compact set . This is usually written as
(1) 
We assume that the objective function is unknown and therefore treat it as a black box. In words, this means that has no simple closed form, and neither do its gradients. This allows for the possibility of being nonlinear and nonconvex, although we do require that be Lipschitzcontinuous to avoid pathological cases [2]. The objective function can, however, be evaluated at any arbitrary query point , each evaluation producing a potentially noisecorrupted output . In this work, we model uncertainty in observations with additive Gaussian noise:
(2) 
In practice, the function
can be a machinelearning algorithm (with
the hyperparameters), a largescale computer simulation of a physical system (with the physical parameters), or a field experiment (with the experimental parameters). Evaluating can thus be very costly, so to solve the minimization problem each query point must be selected very meticulously.A Brief Review of Bayesian Optimization
Bayesian optimization [2, 23] is a sequential approach which, starting from an initial dataset of input–output pairs, iteratively probes the input space and, with each point visited, attempts to construct a surrogate model for the objective function. At each iteration, the “next best point” to visit is determined by minimizing an acquisition function which serves as a guide for the algorithm as it explores the input space. After a specified number of iterations, the algorithm uses the surrogate model it has constructed to make a final recommendation for what it believes is the true minimizer of the objective function (SI Appendix, Algorithm S1).
The two key issues in Bayesian optimization are the choice of surrogate model and the choice of acquisition function . The former is important because it represents our belief about what the objective function looks like given the data collected by the algorithm; the latter is important because it guides the algorithm in its exploration of the input space. For Bayesian optimization to provide any sort of advantage over a bruteforce approach, the costs of constructing the surrogate model and optimizing the acquisition function must be small compared to that of evaluating the blackbox function .
Model Selection in Bayesian Optimization
Many different surrogate models have been used in Bayesian optimization, with various levels of success [23]. In this work, we use a nonparametric Bayesian approach based on Gaussian process (GP) regression [19]. This choice is appropriate for Bayesian optimization because Gaussian processes a) are agnostic to the details of the black box, and b) provide a way to quantify uncertainty associated with noisy observations. In what follows, we denote by and
the posterior mean and variance of the GP process, respectively, and use a radialbasisfunction (RBF) kernel with automatic relevance determination (SI Appendix, Section S1). Other choices of kernels are possible (e.g., linear kernel or Matérn kernel), but we will see that the RBF function has several desirable properties that will come in handy when designing our algorithm.
Acquisition Functions
The acquisition function is at the core of the Bayesian optimization algorithm, as it solely determines the points at which to query the objective function. The role of the acquisition function is to find a good compromise between exploration (i.e., visiting regions where uncertainty is high) and exploitation (i.e., visiting regions where the surrogate model predicts small values). In this work, we consider three of the most celebrated acquisition functions, namely, the probability of improvement (PI)
(3) 
the expected improvement (EI)
(4) 
and the lower confidence bound (LCB)
(5) 
where , denotes the current best observation, and
are the cumulative and probability density functions (pdf) of the standard normal distribution, respectively, and
and are positive parameters that control the tradeoff between exploration and exploitation [12, 24].The popularity of PI, EI and LCB can be largely explained by the facts that a) implementation is straightforward, b) evaluation is inexpensive, and c) gradients are readily accessible, opening the door for gradientbased optimizers. The combination of these features makes the search for the next best point considerably more efficient than otherwise.
It is important to note that the rationale behind LCB is to “repurpose” a purely explorative acquisition function which reduces uncertainty globally, , into one suitable for Bayesian optimization which is more agressive towards minima. This is done by appending and introducing the tradeoff parameter as in Eq. S28. Following the same logic, we repurpose the Integrated Variance Reduction (IVR),
(6) 
into
(7) 
In the above, denotes the posterior covariance between and [21]. The formula for IVR involves an integral over the input space, but for the RBF kernel this integral can be computed analytically, and the same is true of its gradients (SI Appendix, Section S2). Therefore, IVRBO retains the three important features discussed earlier for PI, EI and LCB.
Bayesian Optimization with OutputWeighted Importance Sampling
The Role of the Likelihood Ratio
In the context of experimental design, Sapsis [22] showed that using an acquisition function that accounts for the importance of the output relative to the input dramatically improves the performance of the sequentialsearch algorithm. For a surrogate model , the importance of the output relative to the input is quantified by the likelihood ratio
(8) 
where denotes the pdf of the input and the pdf of the posterior mean conditioned on the input.
The likelihood ratio is important in cases where some points are more important than others in determining the value of the output. It assigns more weight to a point which has low probability of being observed but has a large impact on the magnitude of output, than to a point which is likely to be observed but whose impact on the output is small (i.e., remains close to its mean). In other words, it favors points for which the magnitude of the output is unusually large over points associated with frequent, average output values.
It is thus no surprise that the first use of the likelihood ratio in an acquisition function was in the context of rareevent quantification. Specifically, Sapsis [22] introduced the Q criterion
(9a)  
(9b) 
where is the predictive variance at had been observed. The Q criterion has been shown to outperform many of the purely explorative acquisition functions traditionally used in experimental design because of precisely the use of the likelihood ratio. Therefore, there is reason to believe that gains of similar proportion can be achieved in Bayesian optimization.
In Bayesian optimization, the likelihood ratio can be beneficial in at least two ways. First, it should improve performance in situations in which the global minimum of the objective function “lives” in the (heavy) left tail of the output pdf . Second, the likelihood ratio makes it possible to distill any prior knowledge one may have about the distribution of the input. For “vanilla” Bayesian optimization, it is natural to use a uniform prior for because in general any point
is as good as any other. But if the optimization problem arises in the context of uncertainty quantification of a physical system (where the input space comes equipped with a generally nonuniform probability distribution), or if one has prior beliefs about where the global minimizers may be located, then use of a nontrivial
in the likelihood ratio can considerably improve performance of the search algorithm.LikelihoodWeighted Acquisition Functions
When the likelihood ratio appears in an acquisition function, we refer to the latter as a “likelihoodweighted” (LW) acquisition function. The only LW acquisition function we have seen so far is the Q criterion, and it is a purely explorative one. To construct LW acquisition functions suitable for Bayesian optimization, we proceed methodically.
We first note that we could repurpose the Q criterion as we did IVR. However, computing and its gradient is cumbersome, so instead we prove that is strictly equivalent to
(10) 
which can be viewed as the LW counterpart of IVR defined in Eq. 6 (SI Appendix, Section S3). We can then construct our first LW acquisition function for Bayesian optimization by repurposing IVRLW as
(11) 
We also introduce
(12) 
as the likelihoodweighted counterpart to LCB. Here, it is natural to incorporate the likelihood ratio in a product with the explorative term . This is in the same spirit as Eq. 9a where naturally appears as a sampling weight for the posterior variance. By contrast, there is no principled way to incorporate in PI and EI. These two functions will therefore be used as baselines.
Approximation of the Likelihood Ratio
We must ensure that introduction of the likelihood ratio does not compromise our ability to compute the acquisition functions and their gradients efficiently.
We first note that to evaluate the likelihood ratio, we must estimate the conditional pdf of the posterior mean , typically at each iteration. This is done by computing
for a large number of input points and applying kernel density estimation (KDE) to the resulting samples. Fortunately, KDE is to be performed in the (onedimensional) output space, allowing use of fast FFTbased algorithms which scale linearly with the number of samples
[4].Yet, the issue remains that in IVRLW(BO), the likelihood ratio appears in an integral over the input space. To avoid resorting to Monte Carlo integration, we approximate
with a Gaussian mixture model (GMM):
(13) 
The GMM approximation has three key advantages. First, when combined with the RBF kernel, the problematic integrals and their gradients become analytic (SI Appendix, Section S3). This is important because it makes the approach tractable in high dimensions, a situation with which other samplingbased acquisition functions are known to struggle. Second, the GMM approximation imposes no restriction on the nature of and . This is an improvement over the approach of Sapsis [22] whose approximation of by a secondorder polynomial led to the requirement that be Gaussian or uniform. Third, it allows analytical computation of the gradients of LCBLW.
We illustrate the benefits of using the likelihood ratio in the 2D Ackley and Michalewicz functions, two test functions notoriously challenging for optimization algorithms. (Analytical expressions are given in the SI Appendix, Section S4.) For these functions, Fig. 1 makes it visually clear that the likelihood ratio gives more emphasis to the region where the global minimum is located (i.e., the center region for the Ackley function, and the slightly offcenter region for the Michalewicz function). Figure 1 also shows that can be approximated satisfactorily with a small number of Gaussian mixtures, a key prerequisite for preserving algorithm efficiency.
The number of Gaussian mixtures to be used in Eq. 13 is at the discretion of the user. In this work, it is kept constant throughout the search, although the option exists to modify it “on the fly”, either according to a predefined schedule or by selecting the value of that minimizes the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) at each iteration [26]. While promising, this approach is left for future investigation. (Further details about implementation may be found in the SI Appendix, Section S5.)
Results
To demonstrate the benefits of using the likelihood ratio, we conduct a series of numerical experiments with the acquisition functions introduced above (SI Appendix, Section S6). Specifically, we compare EI, PI, LCB(LW), IVR(BO), and IVRLW(BO). We include IVR and IVRLW in this list because they provide insight about the behavior of IVRBO and IVRLWBO, respectively, in the limit of large .
Benchmark of Test Functions
We begin with the benchmark of six test functions commonly used to evaluate performance of optimization algorithms (SI Appendix, Section S4). The Branin and 6D Hartmann functions are moderately difficult, while the other four are particularly challenging: the Ackley function features numerous local minima found in a nearly flat region surrounding the global minimum; the Bukin function is nondifferentiable with a steep ridge filled with local minima; and the Michalewicz functions have steep valleys and ridges with a global minimum accounting for a tiny fraction of the search space.
For each test function, the input space is rescaled to the unit hypercube in order to facilitate training of the GP hyperparameters. The noise variance is specified as and appropriately rescaled to account for the variance of the objective function. We use for the 2D functions and otherwise. In each case the initial points are selected by Latin Hypercube Sampling (LHS).
For , Fig. 2 shows that the LW acquisition functions systematically and substantially outperform their unweighted counterparts in identifying the location and objective value of the global minimum. The only exception is with the Branin function, for which the likelihood ratio appears to have no positive effect. For a possible explanation, consider that the output pdf of the Branin function has a very light left tail (SI Appendix, Fig. S1). This is stark contrast with the other objective functions, suggesting that a heavy left tail is a prerequisite for LW acquisition functions to be competitive.
Figure 2 confirms that use of the likelihood ratio allows the algorithm to explore the search space more efficiently by focusing on regions where the magnitude of the objective function is thought to be unusually large. For a clear manifestation of this, consider the left panel in Fig. (c)c, where the three LW acquisition functions dramatically outclass the competition, a consequence of the fact that the likelihood ratio helps the algorithm target the ridge of the Bukin function much more rapidly and thoroughly than otherwise.
Figures (e)e and (f)f demonstrate superiority of our approach in high dimensions. Overall, Fig. 2 suggests that the bestperforming acquisition functions are LCBLW and IVRLWBO, solidifying the utility of the likelihood ratio in Bayesian optimization. Figure 2 also makes clear that there is no “warmup” period for the LW acquisition functions. That the latter “zero in” much faster than the competition is invaluable since the power of Bayesian optimization lies in keeping the number of blackbox evaluations at a minimum.
We have investigated how computation of the likelihood ratio affects algorithm efficiency (SI Appendix, Section S7). We have found that the benign increase in runtime a) can be largely mitigated if sampling of the posterior mean is done frugally, and b) is inconsequential when each blackbox query takes days or weeks, as in most practical applications.
Computation of Precursors for Extreme Events in Dynamical Systems
For a realworld application, we consider the problem of predicting the occurrence of extreme events in dynamical systems. This is a topic worthy of investigation because extreme events (e.g., earthquakes, road accidents, wildfires) have the potential to cause significant damage to people, infrastructure and nature [1]. From a prediction standpoint, the central issue is to identify precursors, i.e., those states of the system which are most likely to lead to an extreme event in the near future. But searching for precursors is no easy task because extreme events often arise in highly complex dynamical systems, which adds to the issue of low frequency of occurrence. To combat this, one can use Bayesian optimization to parsimoniously probe the state space of the system and thus identify “dangerous” regions with as little data as possible.
Formally, the dynamical system is treated as a black box, which assigns to an initial condition a measure of dangerousness, e.g.,
(14) 
Here, denotes the time variable, the flow map of the system (i.e., the dynamics of the black box), the observable of interest, and the time horizon over which prediction is to be performed. In words, records the maximum value attained by the observable during the time interval given initial condition . The role of Bayesian optimization is to search for those that give rise to large values of indicative of an extreme event occurring within the next time units.
In practice, not the whole space of initial conditions is explored by the algorithm, as this would allow sampling of unrealistic initial conditions. Instead, one may view extreme events as excursions from a “background” attractor [6]
, for which a lowdimensional representation can be constructed by principal component analysis (PCA). Searching for precursors in the PCA subspace makes Bayesian optimization computationally tractable. Another advantage is that the PCA subspace comes equipped with the Gaussian prior
, where the diagonal matrixcontains the PCA eigenvalues.
We consider the dynamical system introduced in Ref. [5] in the context of extremeevent prediction in turbulent flow. (Governing equations are given in the SI Appendix, Section S8.) Figure (a)a shows that the system features successive “cycles” during which a trajectory initiated close to the origin spirals away towards the point , only to find itself swiftly repelled from the plane. After some “hovering about”, the trajectory ultimately heads back to the origin, and the cycle repeats itself. Here, extreme events correspond to “bursts” in the coordinate, as shown in Fig. (b)b.
To identify precursors for these bursts, we apply Bayesian optimization to the function with observable . The background attractor is approximated with the leading two principal components, which roughly span the plane. We use for the prediction horizon, for the noise variance, and two Gaussian mixtures for the GMM fit. The search algorithm is initialized with three points sampled from an LHS design. For the search space, we require that
lie no further than four PCA standard deviations in any direction.
Figure 4 shows that here again the LW acquisition functions trumps their unweighted cousins by a significant margin. The two bestperforming acquisition functions are again LCBLW and IVRLWBO, with the former predicting an objective value (about in Fig. 4) within a few percent of the observation (about in Fig. (b)b). We note that use of the Gaussian prior for may lead to even greater gains when the dynamical system features multiple dangerous regions, with some more “exotic” than others.
Discussion
We have established the overwhelmingly positive effect of the likelihood ratio, a quantity that accounts for the importance of the output relative to the input, on the performance of Bayesian optimization algorithms. We have shown that use of the likelihood ratio in an acquisition function can dramatically improve algorithm efficiency, with faster convergence seen in a number of synthetic test functions.
The significance of our approach is manifold. First, it is the first of its kind to leverage information about the output space without resorting to tedious entropy calculations, facilitating implementation in an existing Bayesian optimization code. Second, in addition to output information, the likelihood ratio encapsulates any prior knowledge one may have about the distribution of the input, which has significant implications in uncertainty quantification and experimental design where the input space often comes equipped with a nonuniform probability distribution. Third, the likelihood ratio can be satisfactorily approximated with a Gaussian mixture model, making it possible to compute any “likelihoodweighted” acquisition function (and its gradients) analytically for a range of Gaussian process kernels. Analytical tractability allows for the possibility of the input space being highdimensional, providing a significant advantage over entropybased acquisition functions, known to struggle in such situations.
While we have successfully applied the proposed method to the problem of extremeevent prediction in complex dynamical systems, it is important to note that our approach can be applied to any
system that can be viewed as a black box. This includes problems related to optimal path planning of autonomous vehicles, hyperparameter training of deeplearning algorithms, scheduling and planning of operational tasks for revenue maximization, and drug synthesis and discovery.
References
 [1] (2006) Extreme events in nature and society. Springer Verlag, New York. Cited by: Computation of Precursors for Extreme Events in Dynamical Systems.

[2]
(2010)
A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning
. arXiv preprint arXiv:1012.2599. Cited by: Introduction, A Brief Review of Bayesian Optimization, Formulation of the problem.  [3] (1995) Bayesian experimental design: a review. Statistical Science 10, pp. 273–304. Cited by: Introduction.
 [4] (1994) Fast implementations of nonparametric curve estimators. Journal of Computational and Graphical Statistics 3, pp. 35–56. Cited by: Approximation of the Likelihood Ratio.
 [5] (2016) Dynamical indicators for the prediction of bursting phenomena in highdimensional systems. Physical Review E 94, pp. 032212. Cited by: Appendix S8, Computation of Precursors for Extreme Events in Dynamical Systems.
 [6] (2017) A variational approach to probing extreme events in turbulent dynamical systems. Science Advances 3, pp. e1701533. Cited by: Computation of Precursors for Extreme Events in Dynamical Systems.
 [7] (2018) A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811. Cited by: Introduction.
 [8] (2012) Entropy search for informationefficient global optimization. Journal of Machine Learning Research 13, pp. 1809–1837. Cited by: Introduction, Introduction.
 [9] (2014) Predictive entropy search for efficient global optimization of blackbox functions. In Advances in neural information processing systems, pp. 918–926. Cited by: Introduction, Introduction.
 [10] (2015) Outputspace predictive entropy search for flexible global optimization. In NIPS workshop on Bayesian Optimization, Cited by: Introduction.
 [11] (2016) Sequential Bayesian optimal experimental design via approximate dynamic programming. arXiv preprint arXiv:1604.08320. Cited by: Introduction.
 [12] (1998) Efficient global optimization of expensive blackbox functions. Journal of Global Optimization 13, pp. 455–492. Cited by: Introduction, Introduction, Acquisition Functions.
 [13] (2012) On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600. Cited by: Introduction.
 [14] (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering 86, pp. 97–106. Cited by: Introduction.
 [15] (2013) Differentiating Gaussian processes. Note: http://mlg.eng.cam.ac.uk/mchutchon/DifferentiatingGPs.pdf Cited by: Appendix S2, Appendix S3.
 [16] (1978) The application of Bayesian methods for seeking the extremum. Towards Global Optimization 2, pp. 117–129. Cited by: Introduction.
 [17] (2018) Sequential sampling strategy for extreme event statistics in nonlinear dynamical systems. Proceedings of the National Academy of Sciences 115, pp. 11138–11143. Cited by: Introduction.
 [18] (2013) Monte Carlo theory, methods and examples. Cited by: Introduction.
 [19] (2006) Gaussian processes for machine learning. MIT Press, Cambridge, MA. Cited by: Appendix S1, Appendix S1, Model Selection in Bayesian Optimization.
 [20] (2017) Fast informationtheoretic Bayesian optimisation. arXiv preprint arXiv:1711.00673. Cited by: Appendix S6, Introduction.
 [21] (1989) Design and analysis of computer experiments. Statistical Science 4, pp. 409–423. Cited by: Acquisition Functions.
 [22] (2020) Outputweighted optimal sampling for Bayesian regression and rare event statistics using few samples. Proceedings of the Royal Society A 476, pp. 20190834. Cited by: Introduction, The Role of the Likelihood Ratio, The Role of the Likelihood Ratio, Approximation of the Likelihood Ratio.
 [23] (2015) Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104, pp. 148–175. Cited by: Introduction, A Brief Review of Bayesian Optimization, Model Selection in Bayesian Optimization.
 [24] (2009) Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: Appendix S6, Introduction, Acquisition Functions.
 [25] (2013) Virtual library of simulation experiments: test functions and datasets. Note: http://www.sfu.ca/~ssurjano Cited by: Appendix S4.

[26]
(2016)
Python data science handbook: essential tools for working with data
. O’Reilly Media. Cited by: Approximation of the Likelihood Ratio.  [27] (2009) An informational approach to the global optimization of expensivetoevaluate functions. Journal of Global Optimization 44, pp. 509–534. Cited by: Introduction.
 [28] (2017) Maxvalue entropy search for efficient Bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning, pp. 3627–3635. Cited by: Appendix S6, Introduction.
Appendix S1 Review of Gaussian process regression
A Gaussian process is completely specified by its mean function and covariance function . For a dataset of input–output pairs (written in matrix form as ) and a Gaussian process with constant mean , the random process conditioned on follows a normal distribution with posterior mean and variance
(S15a)  
(S15b) 
respectively, where . Equation S15a can be used to predict the value of the surrogate model at any point , and Eq. S15b to quantify uncertainty in prediction at that point [19].
In GP regression, the choice of covariance function is crucial, and in this work we use the radialbasisfunction (RBF) kernel with automatic relevance determination (ARD):
(S16) 
where is a diagonal matrix containing the lengthscales for each dimension. The advantage of the RBF kernel will become clear in the next sections.
Exact inference in Gaussian process regression requires inverting the matrix , typically at each iteration. This is usually done by Cholesky decomposition whose cost scales like , with the number of observations [19]. (A cost of can be achieved if the parameters of the covariance function are fixed.) Although an scaling may seem daunting, it is important to note that in Bayesian optimization the number of observations (i.e., function evaluations) rarely exceeds a few dozens (or perhaps a few hundreds), as an unreasonably large number of observations would defeat the whole purpose of the algorithm.
Appendix S2 Analytical Expressions for the IVR Acquisition Function with RBF Kernel
We begin by rewriting the formula for IVR using the GP expression for the posterior covariance:
(S17a)  
(S17b) 
If we introduce
(S18) 
then Eq. S17b can be rewritten as
(S19) 
This helps us realize that to compute IVR and its gradients, we only need a mechanism to compute Eq. S18 and its gradients, regardless of the choice of GP kernel.
For the RBF kernel
(S20) 
we have
(S21a)  
and  
(S21b) 
For further details, we refer the reader to Ref. [15].
Appendix S3 Equivalence Between the IVRLW Acquisition Function and the Q Criterion
We first note that the optimization problem
(S22) 
is strictly equivalent to
(S23) 
as the term involving in Eq. S23 does not depend on the optimization variable . Rewriting the Q criterion in this form has the advantage of making it clear that the next best point is to be chosen such that the total likelihoodweighted variance is reduced maximally.
Assuming that the GP hyperparameters are not updated upon addition of the “ghost point” , we have
(S24) 
where , and we have defined . But can be written as
(S25) 
Therefore, the Cholesky decomposition of is simply , with
(S26) 
where we have Choleskydecomposed as . With this in hand, we compute
(S27) 
which leads to
(S28) 
We then substitute Eq. S28 into Eq. S24, and after simplification obtain
(S29) 
which is nothing more than the square of the covariance between and multiplying the inverse of . Thus, we have established that minimizing the Q criterion,
(S30) 
is strictly equivalent to maximizing IVRLW,
(S31) 
This proof of equivalence has significant implications from the standpoint of implementation. To see this, consider that when the likelihood ratio is approximated with a Gaussian mixture model, the Q criterion can be written as a linear combination of terms, with the th one given by
(S32a)  
(S32b) 
The advantage of rewriting Eq. S32a as Eq. S32b is that to evaluate the Q criterion at for a given GP kernel, we only need a mechanism to compute
(S33) 
since the first term on the righthand side of Eq. S32b does not depend on . The problem arises with the gradients of , as the dependence of the latter on is rather complicated. The variable appears in the inverse of and in a like term, the product of which is composed with the trace operator. By contrast, the terms involved in the computation of IVRLW have the form
(S34a)  
(S34b)  
(S34c) 
for which the dependence on is simpler. (Notice the similarities between Eqs. S34c and S19.) It follows that to evaluate and its gradients, we only need a mechanism to compute and its gradients.
For the RBF kernel, it is straightforward to show that
(S35a)  
and  
(S35b) 
For further details, we refer the reader to Ref. [15].
Appendix S4 Analytical Expressions for Benchmark Test Functions
The analytical expressions for the synthetic test functions considered in Results are given below. Further details may be found in Ref. [25]. For these test functions, Fig. S5
shows the conditional pdfs of the output for uniformly distributed input. We used
samples for the 2D functions, and samples for the 6D Hartmann and 10D Michalewicz functions.Ackley function:
(S36) 
where denotes the dimensionality of the function, and , , and .
Branin function:
(S37) 
where , , , , , and .
Bukin function:
(S38) 
Michalewicz:
(S39) 
where denotes the dimensionality of the function, and controls the steepness of the valleys and ridges. In this work we use , making optimization extremely challenging.
6D Hartmann function:
(S40) 
where
(S41a)  
(S41b)  
(S41c) 
Appendix S5 Implementation
We summarize our algorithm for computation of likelihoodweighted acquisition functions in Algorithm S2. We also note that we have strived to adhere to conventional notation, at the expense of making it clear whether the acquisition function should be minimized or maximized. The summary presented in Table S1 should dissipate any ambiguity.
Although several opensource libraries are available for Bayesian optimization, we develop our own inhouse package because it allows for more flexibility. Our code is written in Python and uses
GPy^{1}^{1}1https://sheffieldml.github.io/GPy for Gaussian process regression. We use KDEpy^{2}^{2}2https://kdepy.readthedocs.io to estimate the output pdf , and scikitlearn^{3}^{3}3https://scikitlearn.org for the GMM fit. By default, the search for the next best point is conducted in the normalized output space, as we have found that it facilitates convergence of the optimization algorithm. For the latter, we use the scipy^{4}^{4}4https://www.scipy.org implementation of the lbfgsb algorithm, although the tnc and slsqp algorithms are also supported.Apart from the obvious fact that it provides implementation of LW acquisition functions, our package has the following advantages:

[noitemsep,topsep=0pt]

it has much less code than other packages, making it suitable for experimentation and research;

it uses parallelization (using joblib^{5}^{5}5https://joblib.readthedocs.io) to optimize the acquisition function and compute the final recommendation;

it offers the possibility of benchmarking multiple metrics in the same run, in parallel; and

it makes it simple to extend the analytical formulas for IVR and its variants to other kernels beside RBF.
We have released the package on GitHub under the name gpsearch^{6}^{6}6[Link will be inserted here]. We note that implementing an LW acquisition function on top of an existing package (e.g., Emukit^{7}^{7}7https://amzn.github.io/emukit) requires little effort.
Appendix S6 Experimental protocol
For each example considered in Results, we indicate the number of Gaussian mixtures used in the GMM approximation. We use for EI and PI, and for LCB, IVR and their respective variants. (Although not considered in this work, use of a schedule for has the potential of favorably affecting the algorithm [24].) We have made no attempt to optimize these parameters, as this issue is beyond the scope of the present work. Unlike others, we do not set the noise variance , but rather let the algorithm learn it from the data.
When the location of the global minimum and corresponding function value are known, we report the simple regret
(S42) 
where denotes the optimizer recommendation at iteration , and is the index of the current iteration [28]. Because several of the functions we are to consider are highly multimodal and oscillatory, we also report the distance
(S43) 
as in Ref. [20]. (When the function has multiple global minimizers, we compute for each minimizer and report the smallest value.) When the global minimum of the objective function is not known a priori, we use Eq. S42 with , which we complement with the observation regret
(S44) 
For each example considered, we run 100 Bayesian experiments, each differing in the choice of initial points, and report the median and median absolute deviation for the metrics introduced above. (The shaded error bands in Figs. 2 and 4 indicate a quarter of the latter.) Experiments were performed on a computer cluster equipped with 40 Intel Xeon E52630v4 processors clocking at 2.2 GHz.
Appendix S7 Comparison of Runtime for LikelihoodWeighted and Unweighted Acquisition Functions
To investigate how computation of the likelihood ratio affects the overall efficiency of the algorithm, we proceed as follows. For the Ackley function, we consider an initial dataset composed of ten LHS input–output pairs and, for the eight acquisition functions considered in Results, record the time required to perform a single iteration of the Bayesian algorithm. During that time interval, the following operations are performed: computation of the likelihood ratio (for LW acquisition functions only), minimization of the acquisition function over the input space, query of the objective function at the next best point, and update of the surrogate model.
Since the focus is on the likelihood ratio, we investigate the effect of the following parameters on runtime: number of samples drawn from the posterior mean used in KDE, number of Gaussian mixtures used in the GMM fit, and dimensionality of the objective function . The baseline case uses , , and . (These are the parameters used to generate Fig. 2a in the main body.) For each parameter, we perform 100 experiments (each with a different LHS initialization, followed by a single Bayesian iteration) and report the median runtime (in seconds) for EI, PI, IVR(BO), IVRLW(BO), and LCB(LW).
Figure S6 shows that computation of the likelihood ratio has a mildly adverse effect on algorithm efficiency, with the main culprit being the sampling of the posterior mean. The left panel in Fig. S6 shows a relatively strong dependence of the runtime on . We have verified that for the LW acquisition functions, very little time is spent in the KDE part of the algorithm—as discussed in the main body, onedimensional FFTbased KDE scales linearly with the number of samples. Rather, most of the iteration time is spent in generating the samples of the posterior mean . Little can be done to avoid this issue, except for using a reasonably small number of samples. For , the left panel in Fig. S6 shows that runtime for LW acquisition functions is not dramatically larger than that for unweighted acquisition functions. We note that for relatively low dimensions, reducing the number of samples has virtually no effect on the accuracy of the KDE. For , the center and right panels in Fig. S6 suggest that runtime scales linearly with the number of Gaussian mixtures and dimensionality, which is unproblematic from the standpoint of efficiency.
As a final word, we note that the runtimes for LW acquisition functions remain on the same order of magnitude as that for unweighted acquisition functions. In practical applications, evaluation of the blackbox objective function may take days or even weeks, so whether computation and optimization of the acquisition function takes one or ten seconds is inconsequential.
Appendix S8 Governing Equations for Dynamical System with Extreme Events
The governing equations for the dynamical system consider in Results are given by
(S45a)  
(S45b)  
(S45c) 
with parameters , , and . For further details we refer the reader to Ref. [5].
Acquisition Function  Equation  Rule 

PI  3  maximize 
EI  4  maximize 
LCB(LW)  5, 12  minimize 
IVR(LW)  6, 10  maximize 
IVRBO(LW)  7, 11  minimize 
Q  9a  minimize 