 # Uncertainty Quantification for Bayesian Optimization

Bayesian optimization is a class of global optimization techniques. It regards the underlying objective function as a realization of a Gaussian process. Although the outputs of Bayesian optimization are random according to the Gaussian process assumption, quantification of this uncertainty is rarely studied in the literature. In this work, we propose a novel approach to assess the output uncertainty of Bayesian optimization algorithms, in terms of constructing confidence regions of the maximum point or value of the objective function. These regions can be computed efficiently, and their confidence levels are guaranteed by newly developed uniform error bounds for sequential Gaussian process regression. Our theory provides a unified uncertainty quantification framework for all existing sequential sampling policies and stopping criteria.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The empirical and data-driven nature of data science field makes uncertainty quantification one of the central questions that need to be addressed in order to guide and safeguard decision makings. In this work, we focus on Bayesian optimization, which is effective in solving global optimization problems for complex blackbox functions. Our objective is to quantify the uncertainty of Bayesian optimization outputs. Such uncertainty comes from the Gaussian process prior, random input and stopping time. Closed-form solution of the output uncertainty is usually intractable because of the complicated sampling scheme and stopping criterion.

### 1.1 Problem of interest

Let be an underlying deterministic continuous function over , a compact subset of . The goal of global optimization is to find the maximum of , denoted by , or the point which satisfies . In many scenarios, objective functions can be expensive to evaluate. For example, defined by a complex computer model may take a long time to run. Bayesian optimization is a powerful technique to deal with this type of problems, and has been widely used in areas including designing engineering systems (Forrester et al., 2008; Jones et al., 1998; Mockus et al., 1978), materials and drug design (Frazier and Wang, 2016; Negoescu et al., 2011; Solomou et al., 2018), chemistry (Häse et al., 2018)

, deep neural networks

(Diaz et al., 2017; Klein et al., 2017)(Marco et al., 2017; Wilson et al., 2014).

In Bayesian optimization, is treated as a realization of a stochastic process, denoted by . Usually, people assume that

is a Gaussian process. Every Bayesian optimization algorithm defines a sequential sampling procedure, which successively generates new input points, based on the acquired function evaluations over all previous input points. Usually, the next input point is determined by maximizing an acquisition function. Examples of acquisition functions include probability of improvement

(Kushner, 1964), expected improvement (Huang et al., 2006; Jones et al., 1998; Mockus et al., 1978; Picheny et al., 2013), Gaussian process upper confidence bound (Azimi et al., 2010; Contal et al., 2013; Desautels et al., 2014; Srinivas et al., 2010), predictive entropy search (Hernández-Lobato et al., 2014), entropy search portfolio (Shahriari et al., 2014), knowledge gradient (Scott et al., 2011; Wu and Frazier, 2016; Wu et al., 2017), etc. We refer to (Frazier, 2018) for an introduction to popular Bayesian optimization methods.

Although Bayesian optimization has received considerable attention and numerous techniques have emerged in recent years, how to quantify the uncertainty of the outputs from a Bayesian optimization algorithm is rarely discussed in the literature. Since we assume that is a random realization of , and should also be random. However, the highly nontrivial distributions of and make uncertainty quantification rather challenging. Monte Carlo approaches can be employed to compute the posterior distributions of and , but they are usually computationally expensive, especially when a large number of observations are available.

It is worth noting that uncertainty quantification typically differs from convergence analysis of algorithms. In Bayesian optimization, the latter topic has been studied more often. See, for instance, (Bect et al., 2019; Bull, 2011; Calvin, 2005, 1997; Ryzhov, 2016; Vazquez and Bect, 2010; Yarotsky, 2013). These analyses do not naturally produce uncertainty quantification techniques.

### 1.2 Our results

Inspired by a recent theoretical advance of Gaussian process regression Wang et al. (2019), we develop efficient methods to construct confidence regions of and for Bayesian optimization algorithms.

Uniform confidence intervals for Gaussian process regression.

We find sharp uniform error bounds for Gaussian process regression predictor, given by Theorem 1. To the best of our knowledge, this is the first

theoretical result of this kind. Compared with the traditional point-wise predictive standard deviation of Gaussian process regression, denoted by

, our uniform bound is only inflated by a factor proportional to , where is the prior standard deviation. This bound leads to a uniform confidence upper limit of the reconstructed function, given by Algorithm 1. Subsequently, we construct confidence regions of and based on the confidence upper limit and show the theoretical guarantees of their confidence level in Corollary 2. The results in this part work for fixed designs and will serve as a fundamental tool in the development of the general results under sequential sampling schemes.

Uncertainty quantification under sequential samplings. There exist various Bayesian optimization approaches, based on different sequential sampling schemes. Our uncertainty quantification method does not rely on the specific formulae or strategies. We introduce a rather general framework called abstract Bayesian optimization (ABO), which covers all existing methods in an abstract sense. We show that by using the collected data of any instance algorithm of ABO, Algorithm 1 also gives a confidence upper limit with the same theoretical properties. Although sequential sampling is more complicated and has many different approaches, our theory shows that uncertainty quantification for sequential sampling done in the same way as that under fixed designs achieves the same confidence level.

## 2 Optimization with Gaussian process regression under fixed designs

In Bayesian optimization, we evaluate over a set of input points, denoted by . We call them the design points, because these points can be chosen according to our will. There are two categories of strategies to choose design points. We can choose all the design points before we evaluate at any of them. Such a design set is call a fixed design. An alternative strategy is called sequential sampling, in which the design points are not fully determined at the beginning. Instead, points are added sequentially, guided by the information from the previous input points and the corresponding acquired function values.

In Bayesian optimization, sequential samplings are more popular, because such approaches can utilize the information from the previous responses and choose new design points in the area which is more likely to contain the maximum points. Before investigating the more important sequential sampling schemes, we shall first consider fixed designs in this section, because the latter situation is simpler and will serve as an important intermediate step to the general problem in Section 4.

### 2.1 Preliminaries

In this work, we suppose that the Gaussian process

has mean zero, variance

and correlation function , i.e., with . Under certain regularity conditions, Bochner’s theorem (Wendland, 2004)

suggests that the Fourier transform (with a specific choice of the constant factor) of

, denoted by

, is a probability density function and satisfies the inversion formula

. We call the spectral density of . Some popular choices of correlation functions and their spectral densities are discussed in Section 2.4.

Suppose the set of design points is given. Then can be reconstructed via Gaussian process regression. Let

be the vector of evaluations of the Gaussian process at the design points. The following results are well-known and can be found in

Rasmussen and Williams (2006). For any untried point , conditional on ,

follows a normal distribution. The conditional mean and variance of

are

 μ(x) := E[Z(x)|Y]=rT(x)K−1Y, (1) σ2(x) := Var[Z(x)|Y]=σ2(1−rT(x)K−1r(x)), (2)

where . Since we assume that is a realization of , can serve as a reconstruction of .

### 2.2 Uniform error bound

Although the conditional distribution of is simple as shown in (1)-(2), those for and are highly non-trivial because they are nonlinear functionals of . In this work, we construct confidence regions for the maximum points and values using a uniform error bound for Gaussian process regression, given in Theorem 1. We will use the notion . Also, we shall use the convention in all statements in this article related to error bounds.

We need the following condition. For a vector , define its -norm as .

###### Condition 1.

The correlation has a spectral density, denoted by , and

 A0=∫Rp∥ω∥1~Ψ(ω)dω<+∞. (3)
###### Theorem 1.

Suppose Condition 1 holds. Let , where and are given in (1) and (2), respectively. Then the followings are true.

1. , where is a universal constant, is as in Condition 1, and is the Euclidean diameter of .

2. For any ,

###### Remark 1.

The -norm in (3) can be replaced by the usual Euclidean norm. However, we use the former here because they usually have explicit expressions. See Section 2.4 for details.

In practice, Part 2 of Theorem 1 is hard to use directly because is difficult to calculate accurately. Instead, we can replace by its upper bound in Part 1 of Theorem 1. We state such a result in Corollary 1. Its proof is trivial.

###### Corollary 1.

Under the conditions and notation of Theorem 1, for any constant such that , we have

 P(M−C√p(1∨log(A0DΩ))>t)≤e−t2/2,

for any , where the constants and are the same as those in Theorem 1.

To use Theorem 1, we need to determine the universal constant

and the moment of the spectral density

. We shall discuss the calculation of in Section 2.4. According to our numerical simulations in Section 5 and Section F of the Supplementary material, we recommend using in practice.

### 2.3 Uncertainty quantification

In light of Theorem 1, we can construct a confidence upper limit of . Algorithm 1 describes how to compute the confidence upper limit of at a given untried . For notational simplicity, we regard the dimension , the variance , the moment and the universal constant as global variables so that Algorithm 1 has access to all of them.

Based on the UpperCL function in Algorithm 1, we can construct a confidence region for and a confidence interval for . These regions do not have explicit expressions. However, they can be approximated by calling UpperCL with many different ’s. Let . The confidence region for is defined as

 CRt:={x∈Ω:\textscUpperCL(x,t,X,Y)≥max1≤i≤nf(xi)}. (4)

The confidence interval for is defined as

 CIt:=[max1≤i≤nf(xi),maxx∈Ω\textscUpperCL(x,t,X,Y)]. (5)

It is worth noting that the probability in Corollary 1 is not

a posterior probability. Therefore, the regions given by (

4) and (5) should be regarded as frequentist confidence regions under the Gaussian process model, rather than Bayesian credible regions. Such a frequentist nature has an alternative interpretation, shown in Corollary 2. Corollary 2 simply translates Corollary 1 from the language of stochastic processes to a deterministic function approximation setting, which fits the Bayesian optimization framework better. It shows that in (4) and in (5) are confidence region of and with confidence level , respectively. In particular, to obtain a confidence region, we use .

###### Corollary 2.

Let be the space of continuous functions on and be the law of . Then there exists a set so that and for any , its maximum point is contained in defined in (4), and is contained in defined in (5).

In practice, the shape of can be highly irregular and representing the region of can be challenging. If is of one or two dimensions, we can choose a fine mesh over and call UpperCL for each mesh grid point . In a general situation, we suggest calling UpperCL with randomly chosen ’s and using the -nearest neighbors algorithm to represent .

### 2.4 Calculating A0

For an arbitrary , calculation of in (3) can be challenging. Fortunately, for two most popular correlation functions in one dimension, namely the Gaussian and the Matérn correlation functions (Rasmussen and Williams, 2006; Santner et al., 2003), can be calculated in closed form. The results are summarized in Table 1.

For multi-dimensional problems, a common practice is to use product correlation functions. Specifically, suppose are one-dimensional correlation functions. Then their product forms a -dimensional correlation function, where . If a product correlation function is used, the calculation of is easy. It follows from the elementary properties of Fourier transform that . Let

be a random variable with probability density function

. Then , i.e., the value of corresponding to a product correlation function is the sum of those given by the marginal correlation functions. If each is either a Gaussian or Matérn correlation function, then ’s can be read from Table 1.

## 3 Sequential samplings

We now turn to sequential samplings. In Section 3.1, we will review some existing methods in Bayesian optimization. In Section 3.2, we will propose a general Bayesian optimization framework called abstract Bayesian optimization (ABO), which, in an abstract sense, covers all existing methods. Our uncertainty quantification methodology proposed in Section 4 works for all instance algorithms under ABO.

### 3.1 Existing methods

In general, a Bayesian optimization algorithm defines a sequential sampling scheme which determines the next input point by maximizing an acquisition function , where consists of previous input points, and consists of corresponding outputs. The acquisition function can be either deterministic or random given . A general Bayesian optimization procedure is shown in Algorithm 2.

A number of acquisition functions are proposed in the literature, for example:

1. Expected improvement (EI) (Jones et al., 1998; Mockus et al., 1978), with where is the indicator function, and .

2. Gaussian process upper confidence bound (Srinivas et al., 2010), with , where is a parameter, and and are as in (7) and (8), respectively.

3. Predictive entropy search (Hernández-Lobato et al., 2014), with , where is an approximate simulation via spectral sampling Lázaro-Gredilla et al. (2010); Rahimi and Recht (2008) from .

Among the above acquisition functions, and are deterministic functions of , whereas is random because it depends on a random sample from the posterior Gaussian process. We refer to Shahriari et al. (2016) for general discussions and popular methods in Bayesian optimization.

In practice, one also needs to determine when to stop Algorithm 2. Usually, decisions are made in consideration of the budget and the accuracy requirement. For instance, practitioners can stop Algorithm 2 after finishing a fixed number of iterations (Frazier, 2018) or no further significant improvement of function values can be made (Acerbi and Ji, 2017). Although stopping criteria plays no role in the analysis of the algorithms’ asymptotic behaviors, they can greatly affect the output uncertainty.

### 3.2 Abstract Bayesian optimization

We now propose a general framework of Bayesian optimization on which our uncertainty quantification methodology relies. This framework does not involve any specific formulae or criteria. In contrast, we extract from the existing methods the following three basic concepts: information, policy and stopping rule. These concepts and their minimum requirements are described as follows.

1. Information. Information consists of previous input and output points. At the beginning, the information is empty, denoted by . As we evaluate the objective function at increasing number of input points, the set of information increases.

2. Policy. A policy is a sequential sampling strategy which determines the next set of input data based on the current information. A policy may be random. However, its random-sampling mechanism should not depend on unobserved data. The mathematical formulae of a policy may vary at different stages of the process. For example, one can choose initial designs and subsequent sampling points with different strategies. A policy can sample one point or a batch of points at a time.

3. Stopping rule. A Bayesian optimization algorithm should stop at some point. After each sampling-evaluation iteration, a stopping criterion is checked and to determine whether to terminate the algorithm. A stopping decision should depend only on the current information and/or prespecified values such as computational budget, and should not depend on unobserved data.

Algorithm 3 describes the procedure of the abstract Bayesian optimization (ABO). Algorithm 3 is said abstract because it accepts a policy p and a stopping rule t as its inputs. Once p and t are specified, we obtain a concrete algorithm, called an instance algorithm of ABO. As before, we denote the objective function as . For notational simplicity, we denote for a set of input points . The policy p returns the new input point(s), and the stopping rule t returns either or . The algorithm stops when t returns .

Clearly, all existing methods discussed in Section 3.1 are instance algorithms of ABO. Although Algorithm 3 is call abstract Bayesian optimization, it does not involve any specific optimization strategies. Actually this framework also covers algorithms with other purposes such as root finding.

## 4 Uncertainty quantification for sequential samplings

In this section, we extend our uncertainty quantification methodology for Bayesian optimization under fixed designs to that under sequential samplings.

### 4.1 Theory

To facilitate our mathematical analysis, we now state the ABO framework in a rigorous manner. Recall that we assume that is a realization of a Gaussian process . From this Bayesian point of view, we shall not differentiate and in this section.

Denote the vectors of input and output points in the th iteration as and , respectively. Denote and . Then, we can write the information we obtain after the th iteration as . Because and are random, the information is associated with the -algebra , defined as the -algebra generated by . When the algorithm just starts, no information is gain and we set . The empty is associated with the trivial -algebra , which consists of only the empty set and the entire probability space. Let be the number of iterations when the algorithm stops, i.e., . Then an ABO algorithm must satisfy the following conditions.

1. Conditional on , and are mutually independent for .

2. is a stopping time with respect to the filtration . We further require to ensure a meaningful Bayesian optimization procedure.

For every policy p and stopping time , the pair defines an instance algorithm of ABO. We shall establish a generic theory that bounds the uniform prediction error of any instance algorithms of ABO. Given , we denote

 X1:n=(x1,…,xmn), (6)

where each is corresponding to one data point and is the number of sampled points after iterations of the algorithm. Inspired by the Gaussian process regression method (1)-(2), we define

 μn(x) := rTn(x)K−1nY1:n, (7) σ2n(x) := σ2(1−rTn(x)K−1nrn(x)), (8)

where .

The main objective of Theorem 2 is to quantify the uncertainty of . Note that is generally not a Gaussian process. Nonetheless, an error bound similar to that in Corollary 1 is still valid.

###### Theorem 2.

Suppose Condition 1 holds. Given , let

 Mn=supx∈ΩZ(x)−μn(x)σn(x)log1/2(eσ/σn(x)),

where and are given in (7) and (8), respectively. Then for any ,

 P(MT−C√p(1∨log(A0DΩ))>t)≤e−t2/2, (9)

where are the same as in Corollary 1.

The probability bound (9) has a major advantage: the constant is independent of the specific , and it can be chosen the same as that for fixed designs. As we shall calibrate with numerical simulations (see Section 5 and Section F of the Supplementary material), Theorem 2 suggests that we only need to simulate for fixed-design problems, and the resulting constant can be used for the uncertainty quantification of all past and possible future Bayesian optimization algorithms.

### 4.2 Methodology

The conclusion of Theorem 2 is similar to that of Corollary 1. Consequently, we can again apply Algorithm 1 to get a uniform confidence upper limit of the underlying function. Given , similar to (4) and (5), we define

 CRseqt := {x∈Ω:\textscUpperCL(x,t,X1:T,Y1:T)≥max1≤i≤mTf(xi)}, (10) CIseqt := [max1≤i≤mTf(xi),maxx∈Ω\textscUpperCL(x,t,X1:T,Y1:T)], (11)

where is defined in (6) with replaced by the stopping time . Then Theorem 2 implies that, under the condition that is a realization of , and are confidence regions of and , respectively, with a simultaneous confidence level .

Analogous to Corollary 2, we can restate Theorem 2 under a deterministic setting in terms of Corollary 3. In this situation, we have to restrict ourselves to deterministic ABO algorithms, in the sense that the policy p is a deterministic map, such as the first two examples in Section 3.1.

###### Corollary 3.

Let be the space of continuous functions on and be the law of . Given a deterministic ABO algorithm , there exists a set so that and for any , its maximum point is contained in defined in (10), and is contained in defined in (11).

## 5 Calibrating C0 via simulation studies

To use (10) and (11), we need to specify the constant in Theorem 2, which relies on in Theorem 1. In this work, we identify by numerical simulations. The details are presented in Section F of the Supplementary material. Here we outline the main conclusions of our simulation studies.

Our main conclusions are: 1) is a robust choice for most of the cases; 2) for the cases with Gaussian correlation functions or small , choosing may lead to very conservative confidence regions. We suggest practitioners first consider to obtain robust confidence regions. When users believe that this robust confidence region is too conservative, they can use the value in Table 2 or 3 corresponding to their specific setting, or run similar numerical studies as in Section F of the Supplementary material to calibrate their own .

## 6 Numerical experiments

We compare the performance between the proposed confidence interval as in (11) and the naive bound of Gaussian process, denoted by . The nominal confidence levels are 95% for both methods. Other technical details are given in Section G of the Supplementary material. The comparison results are shown in Figure 1. Figure 1 shows the coverage rates and the width of the confidence intervals under different smoothness with . From the left plot in Figure 1, we find that the coverage rate of is almost 100% for all the experiments, while has a lower coverage rate no more than . Thus the proposed method is conservative while the naive one is permissive. Such a result shows that using the naive method may be risky in practice. The coverage results support our theory and conclusions made in Sections 4-5. As shown by the right plot in Figure 1, the widths of are about five times of . The ratio decrease as the number of iterations increases. The inflation in the width of confidence intervals is the cost of gaining confidence.

## 7 Discussion

In this work we propose a novel methodology to construct confidence regions for the outputs given by any Bayesian optimization algorithm with theoretical guarantees. To the best of our knowledge, this is the first result of this kind. The confidence regions may be conservative, because they are constructed based on probability inequalities that may not be tight enough. Given the fact that naive methods may be highly permissive, the proposed method can be useful when a conservative approach is preferred, such as in reliability assessments. To improve the power of the proposed method, one needs to seek for more accurate inequalities in a future work.

## References

• Acerbi and Ji (2017) Acerbi, L. and Ji, W. (2017). Practical Bayesian optimization for model fitting with Bayesian adaptive direct search. In Advances in Neural Information Processing Systems, pages 1836–1846.
• Adler and Taylor (2009) Adler, R. J. and Taylor, J. E. (2009). Random Fields and Geometry. Springer Science & Business Media.
• Azimi et al. (2010) Azimi, J., Fern, A., and Fern, X. Z. (2010). Batch Bayesian optimization via simulation matching. In Advances in Neural Information Processing Systems, pages 109–117.
• Bect et al. (2019) Bect, J., Bachoc, F., and Ginsbourger, D. (2019). A supermartingale approach to Gaussian process based sequential design of experiments. Bernoulli. To appear.
• Bull (2011) Bull, A. D. (2011). Convergence rates of efficient global optimization algorithms.

Journal of Machine Learning Research

, 12(Oct):2879–2904.
• Calvin (2005) Calvin, J. (2005). One-dimensional global optimization for observations with noise. Computers & Mathematics with Applications, 50(1-2):157–169.
• Calvin (1997) Calvin, J. M. (1997). Average performance of a class of adaptive algorithms for global optimization. The Annals of Applied Probability, 7(3):711–730.
• Contal et al. (2013) Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer.
• Desautels et al. (2014) Desautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 15(1):3873–3923.
• Diaz et al. (2017) Diaz, G. I., Fokoue-Nkoutche, A., Nannicini, G., and Samulowitz, H. (2017).

An effective algorithm for hyperparameter optimization of neural networks.

IBM Journal of Research and Development, 61(4/5):9–1.
• Forrester et al. (2008) Forrester, A., Sobester, A., and Keane, A. (2008). Engineering Design via Surrogate Modelling: A Practical Guide. John Wiley & Sons.
• Frazier (2018) Frazier, P. I. (2018). A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
• Frazier and Wang (2016) Frazier, P. I. and Wang, J. (2016). Bayesian optimization for materials design. In Information Science for Materials Discovery and Design, pages 45–75.
• Häse et al. (2018) Häse, F., Roch, L. M., Kreisbeck, C., and Aspuru-Guzik, A. (2018). Phoenics: A Bayesian optimizer for chemistry. ACS Central Science, 4(9):1134–1145.
• Hernández-Lobato et al. (2014) Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems, pages 918–926.
• Huang et al. (2006) Huang, D., Allen, T. T., Notz, W. I., and Zeng, N. (2006). Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of Global Optimization, 34(3):441–466.
• Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492.
• Klein et al. (2017) Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. (2017). Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, pages 528–536.
• Kushner (1964) Kushner, H. J. (1964). A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106.
• Lázaro-Gredilla et al. (2010) Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal, A. R. (2010). Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11(Jun):1865–1881.
• Marco et al. (2017) Marco, A., Berkenkamp, F., Hennig, P., Schoellig, A. P., Krause, A., Schaal, S., and Trimpe, S. (2017). Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with Bayesian optimization. In 2017 IEEE International Conference on Robotics and Automation, pages 1557–1563.
• Mockus et al. (1978) Mockus, J., Tiesis, V., and Zilinskas, A. (1978). The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2(117-129).
• Negoescu et al. (2011) Negoescu, D. M., Frazier, P. I., and Powell, W. B. (2011). The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363.
• Niederreiter (1992) Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods, volume 63. SIAM.
• Picheny et al. (2013) Picheny, V., Ginsbourger, D., Richet, Y., and Caplin, G. (2013). Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics, 55(1):2–13.
• Pollard (1990) Pollard, D. (1990). Empirical processes: Theory and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, pages i–86. JSTOR.
• Rahimi and Recht (2008) Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177–1184.
• Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
• Ryzhov (2016) Ryzhov, I. O. (2016). On the convergence rates of expected improvement methods. Operations Research, 64(6):1515–1528.
• Santner et al. (2003) Santner, T. J., Williams, B. J., and Notz, W. I. (2003). The Design and Analysis of Computer Experiments. Springer Science & Business Media.
• Scott et al. (2011) Scott, W., Frazier, P., and Powell, W. (2011). The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM Journal on Optimization, 21(3):996–1026.
• Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. (2016). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
• Shahriari et al. (2014) Shahriari, B., Wang, Z., Hoffman, M. W., Bouchard-Côté, A., and de Freitas, N. (2014). An entropy search portfolio for Bayesian optimization. Stat, 1050:18.
• Solomou et al. (2018) Solomou, A., Zhao, G., Boluki, S., Joy, J. K., Qian, X., Karaman, I., Arróyave, R., and Lagoudas, D. C. (2018). Multi-objective Bayesian materials discovery: Application on the discovery of precipitation strengthened niti shape memory alloys through micromechanical modeling. Materials & Design, 160:810–827.
• Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In the 27th International Conference on Machine Learning.
• Stocki (2005) Stocki, R. (2005). A method to improve design reliability using optimal latin hypercube sampling. Computer Assisted Mechanics and Engineering Sciences, 12(4):393.
• van der Vaart and Wellner (1996) van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer.
• Vazquez and Bect (2010) Vazquez, E. and Bect, J. (2010). Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and inference, 140(11):3088–3095.
• Wang et al. (2019) Wang, W., Tuo, R., and Wu, C. J. (2019). On prediction properties of kriging: Uniform error bounds and robustness. Journal of the American Statistical Association, (just-accepted):1–38.
• Wendland (2004) Wendland, H. (2004). Scattered Data Approximation, volume 17. Cambridge University Press.
• Wilson et al. (2014) Wilson, A., Fern, A., and Tadepalli, P. (2014). Using trajectory data to improve Bayesian optimization for reinforcement learning. Journal of Machine Learning Research, 15(1):253–282.
• Wu and Frazier (2016) Wu, J. and Frazier, P. (2016). The parallel knowledge gradient method for batch Bayesian optimization. In Advances in Neural Information Processing Systems, pages 3126–3134.
• Wu et al. (2017) Wu, J., Poloczek, M., Wilson, A. G., and Frazier, P. (2017). Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pages 5267–5278.
• Yarotsky (2013) Yarotsky, D. (2013).

Univariate interpolation by exponential functions and Gaussian RBFs for generic sets of nodes.

Journal of Approximation Theory, 166:163–175.

## Appendix A Inequalities for Gaussian processes

In this section, we review some inequalities on the maximum of a Gaussian process. Let be a separable zero-mean Gaussian process with . Define the metric on by

 dg(G(x1),G(x2))=√E(G(x1)−G(x2))2.

The -covering number of the metric space , denoted as , is the minimum integer so that there exist distinct balls in with radius , and the union of these balls covers . Let be the diameter of with respect to the metric . The supremum of a Gaussian process is closely tied to a quantity called the entropy integral, defined as

 ∫D/20√logN(ϵ,Γ,dg)dϵ. (12)

For detailed discussion of entropy integral, we refer to Adler and Taylor (2009).

Lemma 1 provides an upper bound on the expectation of the maximum value of a Gaussian process, which is Theorem 1.3.3 of Adler and Taylor (2009).

###### Lemma 1.

Let be a separable zero-mean Gaussian process with lying in a -compact set , where is the metric. Let be the -covering number. Then there exists a universal constant such that

 E(supx∈ΓG(x))≤η∫D/20√logN(ϵ,Γ,dg)dϵ. (13)

Lemma 2, which is Theorem 2.1.1 of Adler and Taylor (2009), presents a concentration inequality.

###### Lemma 2.

Let be a separable Gaussian process on a -compact with mean zero, then for all ,

 P(supx∈ΓG(x)−E(supx∈ΓG(x))>u)≤e−u2/2σ2Γ, (14)

where .

Theorem 3 is a slightly strengthened version of Theorem 1 of Wang et al. (2019). Its proof, in Section E, is based on Lemmas 1-2 and some machinery from scattered data approximation Wendland (2004).

###### Theorem 3.

Suppose Condition 1 holds. Let and be as in (1) and (2), respectively, and be the Euclidean diameter of . Then for any , and any closed deterministic subset , with probability at least , the kriging prediction error has the upper bound

 supx∈AZ(x)−μ(x)≤η1σA√p(1∨log(A0DΩ))√log(eσ/σA))+u, (15)

where is defined in Condition 1, is a universal constant, and .

## Appendix B Proof of Theorem 1

We proof Theorem 1 by partitioning into subregions, and applying Theorem 3 on each of them. Let for . Let .

Take . By Theorem 3, we have

 P(supx∈ΩZ(x)−μ(x)σ(x)log1/2(eσ/σ(x))>η2√p(1∨log(A0DΩ))+u) ⩽ ∞∑i=1P(supx∈ΩiZ(x)−μ(x)σ(x)log1/2(eσ/σ(x))>η2√p(1∨log(A0DΩ))+u) ⩽ ∞∑i=1P(supx∈ΩiZ(x)−μ(x)>(η2√p(1∨log(A0DΩ))+u)σe−i√i) ⩽ ∞∑i=1P(supx∈ΩiZ(x)−μ(x)>(η2√p(1∨log(A0DΩ))+u)σilog1/2(eσ/σi)/(√2e)) ⩽ ∞∑i=1exp{−u2log(eσ/σi)/(4e2)} ⩽ ∞∑i=1exp{−iu2/(4e2)}=exp{−u2/(4e2)}1−exp{−u2/(4e2)},

which, together with the fact that , implies the following upper bound of

 EM =∫∞0P(M>x)dx ≤(∫η2√p(1∨log(A0DΩ))+10+∫∞η2√p(1∨log(A0DΩ))+1)P(M>x)dx ≤η2√p(1∨log(A0DΩ))+1+∫∞12exp{−x2/(4e2)}1−exp{−x2/(4e2)}dx ≤C0√p(1∨log(A0DΩ)).

To access the tail probability, we note that is also a Gaussian process with mean zero. Thus by Lemma 2, we have

 P(M−EM>t)≤e−t2/2σ2M,

where

 σ2M=supx∈ΩE(Z(x)−μ(x))2σ(x)2log(eσ/σ(x))≤1.

Hence, we complete the proof.

## Appendix C Independence in sequential Gaussian process modeling

The proof of Theorem 2 relies on certain independence properties of sequential Gaussian process modeling shown in Lemmas 3-4. First we introduce some notation. For an arbitrary function , and , define , and

 IΨ,Xf(x)=rT(x)K−1f(X), (16)

where . For notational convenience, we define .

###### Lemma 3.

Let be a stationary Gaussian process with mean zero and correlation function . For two sets of scattered points , we have

 Z−IΨ,X′Z=(Z−IΨ,XZ)+IΨ,X(Z−IΨ,X′Z). (17)

In addition, if and are deterministic sets, then the residual and the vector of observed data are mutually independent Gaussian process and vector, respectively.

###### Proof.

It is easily seen that and are linear operators and , which implies (17).

The residual is a Gaussian process because is linear. The independence between the Gaussian process and the vector can be proven by calculation the covariance

 Cov(Z(x′)−IΨ,X′Z(x′),Z(X)) = Cov(Z(x′)−rT(x′)K−1Z(X),Z(X)) = r(x′)−r(x′)=0,

which completes the proof. ∎

###### Lemma 4.

For any instance algorithm of ABO, the following statements are true.

1. Conditional on and , the residual process is independent of .

2. Conditional on , the residual process is a Gaussian process with same law as , where is an independent copy of .

###### Proof.

We use induction on . For , the desired results are direct consequences of Lemma 3, because the design set is suppressed conditional on .

Now suppose that we have proven already the assertion for and want to conclude it for . First, we invoke the decomposition given by Lemma 3 to have

 Z′−IΨ,X1:nZ′=(Z′−IΨ,X1:(n+1)Z′)+IΨ,X1:(n+1)(Z′−IΨ,X1:nZ′). (18)

Because , we also have

 Z−μn=(Z−μn+1)+IΨ,X1:(n+1)(Z−μn). (19)

By the inductive hypothesis, has the same law as conditional on , denoted by . Our assumption that is independent of conditional on implies that is independent of as well. Thus,

 Z−μn\mathclapd=Z′−IΨ,X1:nZ′|Fn,Xn+1.

Clearly, this equality in distribution is preserved by acting on both sides, which implies

 (Z−μn,IΨ,X1:(n+1)(Z−μn))\mathclapd=(Z′−IΨ,X1:nZ′,IΨ,X1:(n+1)(Z′−IΨ,X1:nZ′))|Fn,Xn+1.

Incorporating the above equation with (18) and (19) yields

 (Z−μn+1,Z−μn))\mathclapd=(Z′−IΨ,X1:(n+1)Z′,Z′−IΨ,X1:nZ′)|Fn,Xn+1. (20)

Now we consider the vectors and . Then (20) implies

 (Z−μn+1,V))\mathclap%d=(Z′−IΨ,X1:(n+1)Z′,V′)|Fn,Xn+1. (21)

Because consists of observed data, we can apply Lemma 3 to obtain that, conditional on and , is independent of , which, together with (21), implies that and are independent conditional on and . Because is measurable with respect to the -algebra generated by and , we obtain that