Bayesian Optimization with Output-Weighted Importance Sampling

04/22/2020
by   Antoine Blanchard, et al.
MIT
0

In Bayesian optimization, accounting for the importance of the output relative to the input is a crucial yet perilous exercise, as it can considerably improve the final result but often involves inaccurate and cumbersome entropy estimations. We approach the problem from a radically new perspective and, inspired by the theory of importance sampling and extreme events, advocate the use of the likelihood ratio to guide the search algorithm towards regions where the magnitude of the objective function is unusually large. The likelihood ratio acts as a sampling weight and can be approximated in a way that makes the approach tractable in high dimensions. The "likelihood-weighted" acquisition functions introduced in this work are found to outperform their unweighted counterparts substantially in a number of applications.

READ FULL TEXT VIEW PDF
02/23/2020

Weighting Is Worth the Wait: Bayesian Optimization with Importance Sampling

Many contemporary machine learning models require extensive tuning of hy...
06/22/2020

Output-Weighted Importance Sampling for Bayesian Experimental Design and Uncertainty Quantification

We introduce a class of acquisition functions for sample selection that ...
06/01/2011

AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks

Stochastic sampling algorithms, while an attractive alternative to exact...
10/14/2019

A unified view of likelihood ratio and reparameterization gradients and an optimal importance sampling scheme

Reparameterization (RP) and likelihood ratio (LR) gradient estimators ar...
02/19/2021

Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs

We present a new type of acquisition functions for online decision makin...
06/04/2018

Efficiency of adaptive importance sampling

The sampling policy of stage t, formally expressed as a probability dens...
03/27/2013

An Empirical Analysis of Likelihood-Weighting Simulation on a Large, Multiply-Connected Belief Network

We analyzed the convergence properties of likelihood- weighting algorith...

Introduction

The vast majority of optimization problems encountered in practical applications involve objective functions whose level of complexity prohibits use of classical optimization algorithms such as Newton methods, conjugate gradient methods, or evolutionary algorithms

[12, 8]. In fact, so little is known about the internal workings of those systems that practitioners have no choice but to treat them as “black boxes”, for which a high evaluation cost adds to the issue of opacity, making optimization a daunting endeavor.

To solve such optimization problems while keeping the number of black-box evaluations at a minimum, one possibility is to use an iterative approach. In this arena, Bayesian optimization is the undisputed champion [2, 23, 7]. Its success is largely due to the fact that Bayesian optimization a) incorporates prior belief one may have about the black-box objective function, and b) explores the input space carefully, compromising between exploration and exploitation before each function evaluation.

A key component in Bayesian optimization lies in the choice of acquisition function, which is the ultimate decider as it commands where to next query the objective function. Acquisition functions come in many shapes and forms, ranging from traditional improvement-based [14, 16, 12] and optimistic policies [24, 13] to more recent information-based strategies [27, 8, 9]. The latter differ from the former in that they account for the importance of the output relative to the input (usually through the entropy of the posterior distribution over the unknown minimizer), leading to significant gains when the objective function is noisy and highly multi-modal.

Despite superior empirical performance, information-based acquisition functions suffer from several shortcomings, including slow evaluation caused by heavy sampling requirements, laborious implementation, intractability in high dimensions, and limited choice of Gaussian process kernels [9, 10, 28]. The only exception of which we are aware is the FITBO algorithm [20], in which efficiency and flexibility come at the cost of a decline in performance, the latter being just “on par” with traditional improvement-based and optimistic policies.

While it is clear that incorporating information about the output space in an acquisition function has significant merit, doing so while eliminating the above limitations calls for a radically different approach. Inspired by the theory of importance sampling [18] and rare-event quantification [17, 22], we propose to equip acquisition functions with the likelihood ratio, a quantity that accounts for the importance of the output relative to the input, and which appears frequently in uncertainty quantification and experimental design problems [3, 11]. The likelihood ratio acts as a sampling weight and guides the search algorithm towards regions of the input space where the magnitude of the objective function is unusually large, which has the potential of accelerating convergence by a significant amount.

Formulation of the problem

We consider the problem of finding a global minimizer of a function over a compact set . This is usually written as

(1)

We assume that the objective function is unknown and therefore treat it as a black box. In words, this means that has no simple closed form, and neither do its gradients. This allows for the possibility of being nonlinear and non-convex, although we do require that be Lipschitz-continuous to avoid pathological cases [2]. The objective function can, however, be evaluated at any arbitrary query point , each evaluation producing a potentially noise-corrupted output . In this work, we model uncertainty in observations with additive Gaussian noise:

(2)

In practice, the function

can be a machine-learning algorithm (with

the hyper-parameters), a large-scale computer simulation of a physical system (with the physical parameters), or a field experiment (with the experimental parameters). Evaluating can thus be very costly, so to solve the minimization problem each query point must be selected very meticulously.

A Brief Review of Bayesian Optimization

Bayesian optimization [2, 23] is a sequential approach which, starting from an initial dataset of input–output pairs, iteratively probes the input space and, with each point visited, attempts to construct a surrogate model for the objective function. At each iteration, the “next best point” to visit is determined by minimizing an acquisition function which serves as a guide for the algorithm as it explores the input space. After a specified number of iterations, the algorithm uses the surrogate model it has constructed to make a final recommendation for what it believes is the true minimizer of the objective function (SI Appendix, Algorithm S1).

The two key issues in Bayesian optimization are the choice of surrogate model and the choice of acquisition function . The former is important because it represents our belief about what the objective function looks like given the data collected by the algorithm; the latter is important because it guides the algorithm in its exploration of the input space. For Bayesian optimization to provide any sort of advantage over a brute-force approach, the costs of constructing the surrogate model and optimizing the acquisition function must be small compared to that of evaluating the black-box function .

Model Selection in Bayesian Optimization

Many different surrogate models have been used in Bayesian optimization, with various levels of success [23]. In this work, we use a non-parametric Bayesian approach based on Gaussian process (GP) regression [19]. This choice is appropriate for Bayesian optimization because Gaussian processes a) are agnostic to the details of the black box, and b) provide a way to quantify uncertainty associated with noisy observations. In what follows, we denote by and

the posterior mean and variance of the GP process, respectively, and use a radial-basis-function (RBF) kernel with automatic relevance determination (SI Appendix, Section S1). Other choices of kernels are possible (e.g., linear kernel or Matérn kernel), but we will see that the RBF function has several desirable properties that will come in handy when designing our algorithm.

Acquisition Functions

The acquisition function is at the core of the Bayesian optimization algorithm, as it solely determines the points at which to query the objective function. The role of the acquisition function is to find a good compromise between exploration (i.e., visiting regions where uncertainty is high) and exploitation (i.e., visiting regions where the surrogate model predicts small values). In this work, we consider three of the most celebrated acquisition functions, namely, the probability of improvement (PI)

(3)

the expected improvement (EI)

(4)

and the lower confidence bound (LCB)

(5)

where , denotes the current best observation, and

are the cumulative and probability density functions (pdf) of the standard normal distribution, respectively, and

and are positive parameters that control the trade-off between exploration and exploitation [12, 24].

The popularity of PI, EI and LCB can be largely explained by the facts that a) implementation is straightforward, b) evaluation is inexpensive, and c) gradients are readily accessible, opening the door for gradient-based optimizers. The combination of these features makes the search for the next best point considerably more efficient than otherwise.

It is important to note that the rationale behind LCB is to “repurpose” a purely explorative acquisition function which reduces uncertainty globally, , into one suitable for Bayesian optimization which is more agressive towards minima. This is done by appending and introducing the trade-off parameter as in Eq. S28. Following the same logic, we repurpose the Integrated Variance Reduction (IVR),

(6)

into

(7)

In the above, denotes the posterior covariance between and [21]. The formula for IVR involves an integral over the input space, but for the RBF kernel this integral can be computed analytically, and the same is true of its gradients (SI Appendix, Section S2). Therefore, IVR-BO retains the three important features discussed earlier for PI, EI and LCB.

Bayesian Optimization with Output-Weighted Importance Sampling

The Role of the Likelihood Ratio

In the context of experimental design, Sapsis [22] showed that using an acquisition function that accounts for the importance of the output relative to the input dramatically improves the performance of the sequential-search algorithm. For a surrogate model , the importance of the output relative to the input is quantified by the likelihood ratio

(8)

where denotes the pdf of the input and the pdf of the posterior mean conditioned on the input.

The likelihood ratio is important in cases where some points are more important than others in determining the value of the output. It assigns more weight to a point which has low probability of being observed but has a large impact on the magnitude of output, than to a point which is likely to be observed but whose impact on the output is small (i.e., remains close to its mean). In other words, it favors points for which the magnitude of the output is unusually large over points associated with frequent, average output values.

It is thus no surprise that the first use of the likelihood ratio in an acquisition function was in the context of rare-event quantification. Specifically, Sapsis [22] introduced the Q criterion

(9a)
(9b)

where is the predictive variance at had been observed. The Q criterion has been shown to outperform many of the purely explorative acquisition functions traditionally used in experimental design because of precisely the use of the likelihood ratio. Therefore, there is reason to believe that gains of similar proportion can be achieved in Bayesian optimization.

In Bayesian optimization, the likelihood ratio can be beneficial in at least two ways. First, it should improve performance in situations in which the global minimum of the objective function “lives” in the (heavy) left tail of the output pdf . Second, the likelihood ratio makes it possible to distill any prior knowledge one may have about the distribution of the input. For “vanilla” Bayesian optimization, it is natural to use a uniform prior for because in general any point

is as good as any other. But if the optimization problem arises in the context of uncertainty quantification of a physical system (where the input space comes equipped with a generally non-uniform probability distribution), or if one has prior beliefs about where the global minimizers may be located, then use of a non-trivial

in the likelihood ratio can considerably improve performance of the search algorithm.

Likelihood-Weighted Acquisition Functions

When the likelihood ratio appears in an acquisition function, we refer to the latter as a “likelihood-weighted” (LW) acquisition function. The only LW acquisition function we have seen so far is the Q criterion, and it is a purely explorative one. To construct LW acquisition functions suitable for Bayesian optimization, we proceed methodically.

We first note that we could repurpose the Q criterion as we did IVR. However, computing and its gradient is cumbersome, so instead we prove that is strictly equivalent to

(10)

which can be viewed as the LW counterpart of IVR defined in Eq. 6 (SI Appendix, Section S3). We can then construct our first LW acquisition function for Bayesian optimization by repurposing IVR-LW as

(11)

We also introduce

(12)

as the likelihood-weighted counterpart to LCB. Here, it is natural to incorporate the likelihood ratio in a product with the explorative term . This is in the same spirit as Eq. 9a where naturally appears as a sampling weight for the posterior variance. By contrast, there is no principled way to incorporate in PI and EI. These two functions will therefore be used as baselines.

Approximation of the Likelihood Ratio

We must ensure that introduction of the likelihood ratio does not compromise our ability to compute the acquisition functions and their gradients efficiently.

We first note that to evaluate the likelihood ratio, we must estimate the conditional pdf of the posterior mean , typically at each iteration. This is done by computing

for a large number of input points and applying kernel density estimation (KDE) to the resulting samples. Fortunately, KDE is to be performed in the (one-dimensional) output space, allowing use of fast FFT-based algorithms which scale linearly with the number of samples

[4].

Yet, the issue remains that in IVR-LW(BO), the likelihood ratio appears in an integral over the input space. To avoid resorting to Monte Carlo integration, we approximate

with a Gaussian mixture model (GMM):

(13)

The GMM approximation has three key advantages. First, when combined with the RBF kernel, the problematic integrals and their gradients become analytic (SI Appendix, Section S3). This is important because it makes the approach tractable in high dimensions, a situation with which other sampling-based acquisition functions are known to struggle. Second, the GMM approximation imposes no restriction on the nature of and . This is an improvement over the approach of Sapsis [22] whose approximation of by a second-order polynomial led to the requirement that be Gaussian or uniform. Third, it allows analytical computation of the gradients of LCB-LW.

We illustrate the benefits of using the likelihood ratio in the 2-D Ackley and Michalewicz functions, two test functions notoriously challenging for optimization algorithms. (Analytical expressions are given in the SI Appendix, Section S4.) For these functions, Fig. 1 makes it visually clear that the likelihood ratio gives more emphasis to the region where the global minimum is located (i.e., the center region for the Ackley function, and the slightly off-center region for the Michalewicz function). Figure 1 also shows that can be approximated satisfactorily with a small number of Gaussian mixtures, a key pre-requisite for preserving algorithm efficiency.

(a) 2-D Ackley function
(b) 2-D Michalewicz function
Figure 1: Contour plots of the objective function (left panel), the likelihood ratio with uniform (center panel), and the GMM approximation (right panel). We used for the Ackley function and for the Michalewicz function.

The number of Gaussian mixtures to be used in Eq. 13 is at the discretion of the user. In this work, it is kept constant throughout the search, although the option exists to modify it “on the fly”, either according to a pre-defined schedule or by selecting the value of that minimizes the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) at each iteration [26]. While promising, this approach is left for future investigation. (Further details about implementation may be found in the SI Appendix, Section S5.)

Results

To demonstrate the benefits of using the likelihood ratio, we conduct a series of numerical experiments with the acquisition functions introduced above (SI Appendix, Section S6). Specifically, we compare EI, PI, LCB(-LW), IVR(-BO), and IVR-LW(BO). We include IVR and IVR-LW in this list because they provide insight about the behavior of IVR-BO and IVR-LWBO, respectively, in the limit of large .

Benchmark of Test Functions

We begin with the benchmark of six test functions commonly used to evaluate performance of optimization algorithms (SI Appendix, Section S4). The Branin and 6-D Hartmann functions are moderately difficult, while the other four are particularly challenging: the Ackley function features numerous local minima found in a nearly flat region surrounding the global minimum; the Bukin function is non-differentiable with a steep ridge filled with local minima; and the Michalewicz functions have steep valleys and ridges with a global minimum accounting for a tiny fraction of the search space.

For each test function, the input space is rescaled to the unit hypercube in order to facilitate training of the GP hyper-parameters. The noise variance is specified as and appropriately rescaled to account for the variance of the objective function. We use for the 2-D functions and otherwise. In each case the initial points are selected by Latin Hypercube Sampling (LHS).

For , Fig. 2 shows that the LW acquisition functions systematically and substantially outperform their unweighted counterparts in identifying the location and objective value of the global minimum. The only exception is with the Branin function, for which the likelihood ratio appears to have no positive effect. For a possible explanation, consider that the output pdf of the Branin function has a very light left tail (SI Appendix, Fig. S1). This is stark contrast with the other objective functions, suggesting that a heavy left tail is a pre-requisite for LW acquisition functions to be competitive.

Figure 2 confirms that use of the likelihood ratio allows the algorithm to explore the search space more efficiently by focusing on regions where the magnitude of the objective function is thought to be unusually large. For a clear manifestation of this, consider the left panel in Fig. (c)c, where the three LW acquisition functions dramatically outclass the competition, a consequence of the fact that the likelihood ratio helps the algorithm target the ridge of the Bukin function much more rapidly and thoroughly than otherwise.

Figures (e)e and (f)f demonstrate superiority of our approach in high dimensions. Overall, Fig. 2 suggests that the best-performing acquisition functions are LCB-LW and IVR-LWBO, solidifying the utility of the likelihood ratio in Bayesian optimization. Figure 2 also makes clear that there is no “warm-up” period for the LW acquisition functions. That the latter “zero in” much faster than the competition is invaluable since the power of Bayesian optimization lies in keeping the number of black-box evaluations at a minimum.

We have investigated how computation of the likelihood ratio affects algorithm efficiency (SI Appendix, Section S7). We have found that the benign increase in runtime a) can be largely mitigated if sampling of the posterior mean is done frugally, and b) is inconsequential when each black-box query takes days or weeks, as in most practical applications.

(a) 2-D Ackley function
(b) Branin function
(c) Bukin function
(d) 2-D Michalewicz function
(e) 6-D Hartmann function
(f) 10-D Michalewicz function
Figure 2: Performance of EI, PI, IVR(-BO), IVR-LW(BO) and LCB(-LW) for six benchmark test functions (left: distance to global minimizer; right: simple regret).

Computation of Precursors for Extreme Events in Dynamical Systems

For a real-world application, we consider the problem of predicting the occurrence of extreme events in dynamical systems. This is a topic worthy of investigation because extreme events (e.g., earthquakes, road accidents, wildfires) have the potential to cause significant damage to people, infrastructure and nature [1]. From a prediction standpoint, the central issue is to identify precursors, i.e., those states of the system which are most likely to lead to an extreme event in the near future. But searching for precursors is no easy task because extreme events often arise in highly complex dynamical systems, which adds to the issue of low frequency of occurrence. To combat this, one can use Bayesian optimization to parsimoniously probe the state space of the system and thus identify “dangerous” regions with as little data as possible.

Formally, the dynamical system is treated as a black box, which assigns to an initial condition a measure of dangerousness, e.g.,

(14)

Here, denotes the time variable, the flow map of the system (i.e., the dynamics of the black box), the observable of interest, and the time horizon over which prediction is to be performed. In words, records the maximum value attained by the observable during the time interval given initial condition . The role of Bayesian optimization is to search for those that give rise to large values of indicative of an extreme event occurring within the next time units.

In practice, not the whole space of initial conditions is explored by the algorithm, as this would allow sampling of unrealistic initial conditions. Instead, one may view extreme events as excursions from a “background” attractor [6]

, for which a low-dimensional representation can be constructed by principal component analysis (PCA). Searching for precursors in the PCA subspace makes Bayesian optimization computationally tractable. Another advantage is that the PCA subspace comes equipped with the Gaussian prior

, where the diagonal matrix

contains the PCA eigenvalues.

We consider the dynamical system introduced in Ref. [5] in the context of extreme-event prediction in turbulent flow. (Governing equations are given in the SI Appendix, Section S8.) Figure (a)a shows that the system features successive “cycles” during which a trajectory initiated close to the origin spirals away towards the point , only to find itself swiftly repelled from the plane. After some “hovering about”, the trajectory ultimately heads back to the origin, and the cycle repeats itself. Here, extreme events correspond to “bursts” in the coordinate, as shown in Fig. (b)b.

(a)
(b)
Figure 3: (a) Phase-space view of trajectory with extreme events; (b) time series for the coordinate.

To identify precursors for these bursts, we apply Bayesian optimization to the function with observable . The background attractor is approximated with the leading two principal components, which roughly span the plane. We use for the prediction horizon, for the noise variance, and two Gaussian mixtures for the GMM fit. The search algorithm is initialized with three points sampled from an LHS design. For the search space, we require that

lie no further than four PCA standard deviations in any direction.

Figure 4 shows that here again the LW acquisition functions trumps their unweighted cousins by a significant margin. The two best-performing acquisition functions are again LCB-LW and IVR-LWBO, with the former predicting an objective value (about in Fig. 4) within a few percent of the observation (about in Fig. (b)b). We note that use of the Gaussian prior for may lead to even greater gains when the dynamical system features multiple dangerous regions, with some more “exotic” than others.

Figure 4: Performance of EI, PI, IVR(-BO), IVR-LW(BO) and LCB(-LW) for computation of extreme-event precursor (left: simple regret; right: observation regret).

Discussion

We have established the overwhelmingly positive effect of the likelihood ratio, a quantity that accounts for the importance of the output relative to the input, on the performance of Bayesian optimization algorithms. We have shown that use of the likelihood ratio in an acquisition function can dramatically improve algorithm efficiency, with faster convergence seen in a number of synthetic test functions.

The significance of our approach is manifold. First, it is the first of its kind to leverage information about the output space without resorting to tedious entropy calculations, facilitating implementation in an existing Bayesian optimization code. Second, in addition to output information, the likelihood ratio encapsulates any prior knowledge one may have about the distribution of the input, which has significant implications in uncertainty quantification and experimental design where the input space often comes equipped with a non-uniform probability distribution. Third, the likelihood ratio can be satisfactorily approximated with a Gaussian mixture model, making it possible to compute any “likelihood-weighted” acquisition function (and its gradients) analytically for a range of Gaussian process kernels. Analytical tractability allows for the possibility of the input space being high-dimensional, providing a significant advantage over entropy-based acquisition functions, known to struggle in such situations.

While we have successfully applied the proposed method to the problem of extreme-event prediction in complex dynamical systems, it is important to note that our approach can be applied to any

system that can be viewed as a black box. This includes problems related to optimal path planning of autonomous vehicles, hyper-parameter training of deep-learning algorithms, scheduling and planning of operational tasks for revenue maximization, and drug synthesis and discovery.

References

  • [1] S. Albeverio, V. Jentsch, and H. Kantz (2006) Extreme events in nature and society. Springer Verlag, New York. Cited by: Computation of Precursors for Extreme Events in Dynamical Systems.
  • [2] E. Brochu, V. M. Cora, and N. De Freitas (2010)

    A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

    .
    arXiv preprint arXiv:1012.2599. Cited by: Introduction, A Brief Review of Bayesian Optimization, Formulation of the problem.
  • [3] K. Chaloner and I. Verdinelli (1995) Bayesian experimental design: a review. Statistical Science 10, pp. 273–304. Cited by: Introduction.
  • [4] J. Fan and J. S. Marron (1994) Fast implementations of nonparametric curve estimators. Journal of Computational and Graphical Statistics 3, pp. 35–56. Cited by: Approximation of the Likelihood Ratio.
  • [5] M. Farazmand and T. P. Sapsis (2016) Dynamical indicators for the prediction of bursting phenomena in high-dimensional systems. Physical Review E 94, pp. 032212. Cited by: Appendix S8, Computation of Precursors for Extreme Events in Dynamical Systems.
  • [6] M. Farazmand and T. P. Sapsis (2017) A variational approach to probing extreme events in turbulent dynamical systems. Science Advances 3, pp. e1701533. Cited by: Computation of Precursors for Extreme Events in Dynamical Systems.
  • [7] P. I. Frazier (2018) A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811. Cited by: Introduction.
  • [8] P. Hennig and C. J. Schuler (2012) Entropy search for information-efficient global optimization. Journal of Machine Learning Research 13, pp. 1809–1837. Cited by: Introduction, Introduction.
  • [9] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani (2014) Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pp. 918–926. Cited by: Introduction, Introduction.
  • [10] M. W. Hoffman and Z. Ghahramani (2015) Output-space predictive entropy search for flexible global optimization. In NIPS workshop on Bayesian Optimization, Cited by: Introduction.
  • [11] X. Huan and Y. M. Marzouk (2016) Sequential Bayesian optimal experimental design via approximate dynamic programming. arXiv preprint arXiv:1604.08320. Cited by: Introduction.
  • [12] D. R. Jones, M. Schonlau, and W. J. Welch (1998) Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13, pp. 455–492. Cited by: Introduction, Introduction, Acquisition Functions.
  • [13] E. Kaufmann, O. Cappé, and A. Garivier (2012) On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592–600. Cited by: Introduction.
  • [14] H. J. Kushner (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Fluids Engineering 86, pp. 97–106. Cited by: Introduction.
  • [15] A. McHutchon (2013) Differentiating Gaussian processes. Note: http://mlg.eng.cam.ac.uk/mchutchon/DifferentiatingGPs.pdf Cited by: Appendix S2, Appendix S3.
  • [16] J. Mockus, V. Tiesis, and A. Zilinskas (1978) The application of Bayesian methods for seeking the extremum. Towards Global Optimization 2, pp. 117–129. Cited by: Introduction.
  • [17] M. A. Mohamad and T. P. Sapsis (2018) Sequential sampling strategy for extreme event statistics in nonlinear dynamical systems. Proceedings of the National Academy of Sciences 115, pp. 11138–11143. Cited by: Introduction.
  • [18] A. B. Owen (2013) Monte Carlo theory, methods and examples. Cited by: Introduction.
  • [19] C. E. Rasmussen and C. K. I. Williams (2006) Gaussian processes for machine learning. MIT Press, Cambridge, MA. Cited by: Appendix S1, Appendix S1, Model Selection in Bayesian Optimization.
  • [20] B. Ru, M. McLeod, D. Granziol, and M. A. Osborne (2017) Fast information-theoretic Bayesian optimisation. arXiv preprint arXiv:1711.00673. Cited by: Appendix S6, Introduction.
  • [21] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn (1989) Design and analysis of computer experiments. Statistical Science 4, pp. 409–423. Cited by: Acquisition Functions.
  • [22] T. P. Sapsis (2020) Output-weighted optimal sampling for Bayesian regression and rare event statistics using few samples. Proceedings of the Royal Society A 476, pp. 20190834. Cited by: Introduction, The Role of the Likelihood Ratio, The Role of the Likelihood Ratio, Approximation of the Likelihood Ratio.
  • [23] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas (2015) Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104, pp. 148–175. Cited by: Introduction, A Brief Review of Bayesian Optimization, Model Selection in Bayesian Optimization.
  • [24] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger (2009) Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: Appendix S6, Introduction, Acquisition Functions.
  • [25] S. Surjanovic and D. Bingham (2013) Virtual library of simulation experiments: test functions and datasets. Note: http://www.sfu.ca/~ssurjano Cited by: Appendix S4.
  • [26] J. VanderPlas (2016)

    Python data science handbook: essential tools for working with data

    .
    O’Reilly Media. Cited by: Approximation of the Likelihood Ratio.
  • [27] J. Villemonteix, E. Vazquez, and E. Walter (2009) An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization 44, pp. 509–534. Cited by: Introduction.
  • [28] Z. Wang and S. Jegelka (2017) Max-value entropy search for efficient Bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning, pp. 3627–3635. Cited by: Appendix S6, Introduction.

Appendix S1 Review of Gaussian process regression

A Gaussian process is completely specified by its mean function and covariance function . For a dataset of input–output pairs (written in matrix form as ) and a Gaussian process with constant mean , the random process conditioned on follows a normal distribution with posterior mean and variance

(S15a)
(S15b)

respectively, where . Equation S15a can be used to predict the value of the surrogate model at any point , and Eq. S15b to quantify uncertainty in prediction at that point [19].

In GP regression, the choice of covariance function is crucial, and in this work we use the radial-basis-function (RBF) kernel with automatic relevance determination (ARD):

(S16)

where is a diagonal matrix containing the lengthscales for each dimension. The advantage of the RBF kernel will become clear in the next sections.

Exact inference in Gaussian process regression requires inverting the matrix , typically at each iteration. This is usually done by Cholesky decomposition whose cost scales like , with the number of observations [19]. (A cost of can be achieved if the parameters of the covariance function are fixed.) Although an scaling may seem daunting, it is important to note that in Bayesian optimization the number of observations (i.e., function evaluations) rarely exceeds a few dozens (or perhaps a few hundreds), as an unreasonably large number of observations would defeat the whole purpose of the algorithm.

Appendix S2 Analytical Expressions for the IVR Acquisition Function with RBF Kernel

We begin by rewriting the formula for IVR using the GP expression for the posterior covariance:

(S17a)
(S17b)

If we introduce

(S18)

then Eq. S17b can be rewritten as

(S19)

This helps us realize that to compute IVR and its gradients, we only need a mechanism to compute Eq. S18 and its gradients, regardless of the choice of GP kernel.

For the RBF kernel

(S20)

we have

(S21a)
and
(S21b)

For further details, we refer the reader to Ref. [15].

Appendix S3 Equivalence Between the IVR-LW Acquisition Function and the Q Criterion

We first note that the optimization problem

(S22)

is strictly equivalent to

(S23)

as the term involving in Eq. S23 does not depend on the optimization variable . Rewriting the Q criterion in this form has the advantage of making it clear that the next best point is to be chosen such that the total likelihood-weighted variance is reduced maximally.

Assuming that the GP hyper-parameters are not updated upon addition of the “ghost point” , we have

(S24)

where , and we have defined . But can be written as

(S25)

Therefore, the Cholesky decomposition of is simply , with

(S26)

where we have Cholesky-decomposed as . With this in hand, we compute

(S27)

which leads to

(S28)

We then substitute Eq. S28 into Eq. S24, and after simplification obtain

(S29)

which is nothing more than the square of the covariance between and multiplying the inverse of . Thus, we have established that minimizing the Q criterion,

(S30)

is strictly equivalent to maximizing IVR-LW,

(S31)

This proof of equivalence has significant implications from the standpoint of implementation. To see this, consider that when the likelihood ratio is approximated with a Gaussian mixture model, the Q criterion can be written as a linear combination of terms, with the th one given by

(S32a)
(S32b)

The advantage of rewriting Eq. S32a as Eq. S32b is that to evaluate the Q criterion at for a given GP kernel, we only need a mechanism to compute

(S33)

since the first term on the right-hand side of Eq. S32b does not depend on . The problem arises with the gradients of , as the dependence of the latter on is rather complicated. The variable appears in the inverse of and in a -like term, the product of which is composed with the trace operator. By contrast, the terms involved in the computation of IVR-LW have the form

(S34a)
(S34b)
(S34c)

for which the dependence on is simpler. (Notice the similarities between Eqs. S34c and S19.) It follows that to evaluate and its gradients, we only need a mechanism to compute and its gradients.

For the RBF kernel, it is straightforward to show that

(S35a)
and
(S35b)

For further details, we refer the reader to Ref. [15].

Appendix S4 Analytical Expressions for Benchmark Test Functions

The analytical expressions for the synthetic test functions considered in Results are given below. Further details may be found in Ref. [25]. For these test functions, Fig. S5

shows the conditional pdfs of the output for uniformly distributed input. We used

samples for the 2-D functions, and samples for the 6-D Hartmann and 10-D Michalewicz functions.

Ackley function:

(S36)

where denotes the dimensionality of the function, and , , and .

Branin function:

(S37)

where , , , , , and .

Bukin function:

(S38)

Michalewicz:

(S39)

where denotes the dimensionality of the function, and controls the steepness of the valleys and ridges. In this work we use , making optimization extremely challenging.

6-D Hartmann function:

(S40)

where

(S41a)
(S41b)
(S41c)

Appendix S5 Implementation

We summarize our algorithm for computation of likelihood-weighted acquisition functions in Algorithm S2. We also note that we have strived to adhere to conventional notation, at the expense of making it clear whether the acquisition function should be minimized or maximized. The summary presented in Table S1 should dissipate any ambiguity.

Although several open-source libraries are available for Bayesian optimization, we develop our own in-house package because it allows for more flexibility. Our code is written in Python and uses

GPy111https://sheffieldml.github.io/GPy for Gaussian process regression. We use KDEpy222https://kdepy.readthedocs.io to estimate the output pdf , and scikit-learn333https://scikit-learn.org for the GMM fit. By default, the search for the next best point is conducted in the normalized output space, as we have found that it facilitates convergence of the optimization algorithm. For the latter, we use the scipy444https://www.scipy.org implementation of the l-bfgs-b algorithm, although the tnc and slsqp algorithms are also supported.

Apart from the obvious fact that it provides implementation of LW acquisition functions, our package has the following advantages:

  • [noitemsep,topsep=0pt]

  • it has much less code than other packages, making it suitable for experimentation and research;

  • it uses parallelization (using joblib555https://joblib.readthedocs.io) to optimize the acquisition function and compute the final recommendation;

  • it offers the possibility of benchmarking multiple metrics in the same run, in parallel; and

  • it makes it simple to extend the analytical formulas for IVR and its variants to other kernels beside RBF.

We have released the package on GitHub under the name gpsearch666[Link will be inserted here]. We note that implementing an LW acquisition function on top of an existing package (e.g., Emukit777https://amzn.github.io/emukit) requires little effort.

Appendix S6 Experimental protocol

For each example considered in Results, we indicate the number of Gaussian mixtures used in the GMM approximation. We use for EI and PI, and for LCB, IVR and their respective variants. (Although not considered in this work, use of a schedule for has the potential of favorably affecting the algorithm [24].) We have made no attempt to optimize these parameters, as this issue is beyond the scope of the present work. Unlike others, we do not set the noise variance , but rather let the algorithm learn it from the data.

When the location of the global minimum and corresponding function value are known, we report the simple regret

(S42)

where denotes the optimizer recommendation at iteration , and is the index of the current iteration [28]. Because several of the functions we are to consider are highly multimodal and oscillatory, we also report the distance

(S43)

as in Ref. [20]. (When the function has multiple global minimizers, we compute for each minimizer and report the smallest value.) When the global minimum of the objective function is not known a priori, we use Eq. S42 with , which we complement with the observation regret

(S44)

For each example considered, we run 100 Bayesian experiments, each differing in the choice of initial points, and report the median and median absolute deviation for the metrics introduced above. (The shaded error bands in Figs. 2 and 4 indicate a quarter of the latter.) Experiments were performed on a computer cluster equipped with 40 Intel Xeon E5-2630v4 processors clocking at 2.2 GHz.

Appendix S7 Comparison of Runtime for Likelihood-Weighted and Unweighted Acquisition Functions

To investigate how computation of the likelihood ratio affects the overall efficiency of the algorithm, we proceed as follows. For the Ackley function, we consider an initial dataset composed of ten LHS input–output pairs and, for the eight acquisition functions considered in Results, record the time required to perform a single iteration of the Bayesian algorithm. During that time interval, the following operations are performed: computation of the likelihood ratio (for LW acquisition functions only), minimization of the acquisition function over the input space, query of the objective function at the next best point, and update of the surrogate model.

Since the focus is on the likelihood ratio, we investigate the effect of the following parameters on runtime: number of samples drawn from the posterior mean used in KDE, number of Gaussian mixtures used in the GMM fit, and dimensionality of the objective function . The baseline case uses , , and . (These are the parameters used to generate Fig. 2a in the main body.) For each parameter, we perform 100 experiments (each with a different LHS initialization, followed by a single Bayesian iteration) and report the median runtime (in seconds) for EI, PI, IVR(-BO), IVR-LW(BO), and LCB(-LW).

Figure S6 shows that computation of the likelihood ratio has a mildly adverse effect on algorithm efficiency, with the main culprit being the sampling of the posterior mean. The left panel in Fig. S6 shows a relatively strong dependence of the runtime on . We have verified that for the LW acquisition functions, very little time is spent in the KDE part of the algorithm—as discussed in the main body, one-dimensional FFT-based KDE scales linearly with the number of samples. Rather, most of the iteration time is spent in generating the samples of the posterior mean . Little can be done to avoid this issue, except for using a reasonably small number of samples. For , the left panel in Fig. S6 shows that runtime for LW acquisition functions is not dramatically larger than that for unweighted acquisition functions. We note that for relatively low dimensions, reducing the number of samples has virtually no effect on the accuracy of the KDE. For , the center and right panels in Fig. S6 suggest that runtime scales linearly with the number of Gaussian mixtures and dimensionality, which is unproblematic from the standpoint of efficiency.

As a final word, we note that the runtimes for LW acquisition functions remain on the same order of magnitude as that for unweighted acquisition functions. In practical applications, evaluation of the black-box objective function may take days or even weeks, so whether computation and optimization of the acquisition function takes one or ten seconds is inconsequential.

Appendix S8 Governing Equations for Dynamical System with Extreme Events

The governing equations for the dynamical system consider in Results are given by

(S45a)
(S45b)
(S45c)

with parameters , , and . For further details we refer the reader to Ref. [5].

(a) 2-D Ackley function
(b) Branin function
(c) Bukin function
(d) 2-D Michalewicz function
(e) 6-D Hartmann function
(f) 10-D Michalewicz function
Figure S5: Conditional pdf with uniform input for the benchmark test functions considered in this work.
Figure S6: Effect of , , and , on single-iteration runtime for the Ackley function.
Acquisition Function Equation Rule
PI 3 maximize
EI 4 maximize
LCB(-LW) 5, 12 minimize
IVR(-LW) 6, 10 maximize
IVR-BO(LW) 7, 11 minimize
Q 9a minimize
Table S1: Summary of the acquisition functions considered in this work.
1:  Input: Number of iterations
2:  Initialize: Surrogate model on initial dataset of input–output pairs
3:  for  to  do
4:     Select next best point by minimizing acquisition function :
5:     Evaluate objective function at and record
6:     Augment dataset:
7:     Update surrogate model
8:  end for
9:  Return: Final recommendation from surrogate model
Algorithm S1 Bayesian optimization
1:  Input: Posterior mean , input pdf , functional form for LW acquisition function , number of Gaussian mixtures
2:  do 
3:   Sample posterior mean
4:   Estimate by KDE
5:   Compute
6:   Fit GMM to
7:  end do 
8:  Return: and gradients in analytic form
Algorithm S2 Likelihood-weighted acquisition function