# A hierarchical expected improvement method for Bayesian optimization

Expected improvement (EI) is one of the most popular Bayesian optimization (BO) methods, due to its closed-form acquisition function which allows for efficient optimization. However, one key drawback of EI is that it is overly greedy; this results in suboptimal solutions even for large sample sizes. To address this, we propose a new hierarchical EI (HEI) framework, which makes use of a hierarchical Gaussian process model. HEI preserves a closed-form acquisition function, and corrects the over-greediness of EI by encouraging exploration of the optimization space. Under certain prior specifications, we prove the global convergence of HEI over a broad objective function space, and derive global convergence rates under smoothness assumptions on the objective function. We then introduce several hyperparameter estimation methods, which allow HEI to mimic a fully Bayesian optimization procedure while avoiding expensive Markov-chain Monte Carlo sampling. Numerical experiments show the improvement of HEI over existing BO methods, for synthetic functions as well as a semiconductor manufacturing optimization problem.

## Authors

• 7 publications
• 19 publications
• 12 publications
• ### Bayesian Optimization of Composite Functions

We consider optimization of composite objective functions, i.e., of the ...
06/04/2019 ∙ by Raul Astudillo, et al. ∙ 6

• ### Bayesian optimization of the PC algorithm for learning Gaussian Bayesian networks

The PC algorithm is a popular method for learning the structure of Gauss...
06/28/2018 ∙ by Irene Córdoba, et al. ∙ 0

• ### On a New Improvement-Based Acquisition Function for Bayesian Optimization

Bayesian optimization (BO) is a popular algorithm for solving challengin...
08/21/2018 ∙ by Umberto Noè, et al. ∙ 0

• ### Efficient batch-sequential Bayesian optimization with moments of truncated Gaussian vectors

We deal with the efficient parallelization of Bayesian global optimizati...
09/09/2016 ∙ by Sébastien Marmin, et al. ∙ 0

• ### Convergence rates of efficient global optimization algorithms

Efficient global optimization is the problem of minimizing an unknown fu...
01/18/2011 ∙ by Adam D. Bull, et al. ∙ 0

• ### Distributionally Ambiguous Optimization Techniques in Batch Bayesian Optimization

We propose a novel, theoretically-grounded, acquisition function for bat...
07/13/2017 ∙ by Nikitas Rontsis, et al. ∙ 0

• ### Bayesian Optimization of Hyperparameters when the Marginal Likelihood is Estimated by MCMC

Bayesian models often involve a small set of hyperparameters determined ...
04/21/2020 ∙ by Oskar Gustafsson, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian optimization (BO) provides a principled way for solving the black-box optimization problem:

 x∗=argminx∈Ωf(x). (1)

Here, are the input variables, is the feasible domain, and the objective function is assumed to be black-box and expensive to evaluate. The key idea in BO is to view as a random realization from a stochastic process, which captures prior beliefs on the objective function. Using this model, BO sequentially queries at points which maximize the acquisition function – the expected utility of a new point given observed data. BO has wide applicability in real-world problems, ranging from rocket engine design (mak2018efficient), nanowire yield optimization (dasgupta2008statistical)

, and neural network training

(bergstra2012random).

Many existing BO methods vary in their choice of (i) the stochastic model on , and (ii) the utility function for sequential sampling. For (i), the most popular stochastic model by far is the Gaussian process (GP) model (santner2003design). Under a GP model, several well-known BO methods have been derived using different utility functions for (ii). These include the expected improvement (EI) method (mockus1978application; jones1998efficient), the upper confidence bound (UCB) method (Srinivas:2010:GPO:3104322.3104451), and the Knowledge Gradient method (Frazier:2008:KPS:1461633.1461641; scott2011correlated). Of these, EI is arguably the most popular method, since it admits a simple closed-form acquisition function, which can be efficiently optimized to yield subsequent query points on . EI has been subsequently developed for a variety of black-box optimization problems, including multi-fidelity optimization (zhang2018variable), constrained optimization (feliot2017bayesian), and parallel/batch-sequential optimization (marmin2015differentiating).

Despite the popularity of EI, it does have notable limitations. One such limitation is that it is too greedy (qin2017improving): EI focuses nearly all sampling efforts near the optima of the fitted GP model, and does not sufficiently explore other regions. In terms of the exploration-exploitation trade-off (kearns2002near), EI over-exploits the fitted model on , and under-explores the optimization space . Because of this, EI often gets stuck in local optima and fails to converge to a global optimum (bull2011convergence). There have some recent efforts on remedying this greediness of EI. snoek2012practical proposed a fully Bayesian EI, where all GP parameters are sampled using Markov chain Monte Carlo (MCMC); this incorporates parameter uncertainty within EI and encourages exploration. chen2017sequential proposed a variation of EI under an additive Bayesian model, which encourages exploration by increasing model uncertainty. Both methods, however, require expensive MCMC sampling; this sampling can take hours of computation to optimize the next query point, which may exceeds the evaluation cost of ! Such methods diminish a key advantage of EI: efficient queries via a closed-form criterion.

To address this, we propose a hierarchical EI (HEI) framework which corrects the greediness of EI while preserving a closed-form criterion. The key idea is a hierarchical GP model for (handcock1993bayesian), with hierarchical priors on process parameters. Under this model, we show that HEI has a closed-form acquisition function which encourages further exploration. We then prove that, under certain prior specifications, HEI converges to a global optimum over a broad function space for . This addresses the over-greediness of EI, which can fail to find any global optimum even for smooth . We further derive global convergence rates for HEI under smoothness assumptions on .

We note that a simpler version of HEI, called the Student EI (SEI), was proposed earlier in benassi2011robust. Our HEI has important novelties over SEI: the HEI incorporates uncertainty on process nonstationarity, has provable global convergence and convergence rates for optimization, and can mimic a fully Bayesian optimization procedure. Numerical experiments show that HEI considerably outperforms existing BO methods, whereas the SEI yields only comparable (or worse) performance to existing methods.

The paper is organized as follows: Section 2 reviews the GP model and the EI method. Section 3 presents the HEI method. Section 4 proves the global convergence for HEI and its associated convergence rates. Section 5 provides methodological developments on hyperparameter specification and basis selection. Sections 6 and 7 compare HEI with existing methods for synthetic functions and in a semiconductor manufacturing problem, respectively. Concluding remarks are given in Section 8.

## 2 Background and Motivation

We first introduce the GP model, then review the EI method and its deficiencies, which motivates the proposed HEI method.

Gaussian Process. Assume the function follows the Gaussian process model:

 f(x)=μ(x)+Z(x),μ(x)=p⊤(x)β,Z(x)∼GP(0,σ2K). (2)

Here, consists of basis functions for the mean function , denotes its corresponding coefficients, and

denotes a stationary GP with mean zero, process variance

, and correlation function . The model (2) is known as the universal kriging model in geostatistics (wackernagel1995multivariate). When there is no trend, i.e., , this model reduces to the so-called ordinary kriging model.

Suppose function values have been observed at inputs , yielding data . Let

be the vector of observed function values,

be the correlation vector between the unobserved response and observed responses , be the correlation matrix for observed points, and be the model matrix for observed points. Then, the posterior distribution of at an unobserved input has the closed form (santner2003design):

 [f(x)∣∣Dn]∼N(^fn(x),σ2s2n(x)) . (3)

Here, is the posterior mean, is the posterior variance, where , and

. These expressions can be equivalently viewed as the best linear unbiased estimator of

and its variance (jones1998efficient).

Of course, the process variance is also unknown in practice and needs to be estimated from data. A common approach is to estimate using its maximum likelihood estimator (MLE):

 ^σ2n=1n(yn−Pn^βn)⊤K−1n(yn−Pn^βn).

One can then plug-in into (3) to estimate the posterior distribution .

Expected Improvement. The idea behind EI (jones1998efficient) is as follows. Let be the current best objective value, and let be the improvement utility function. Given data , the expected improvement acquisition function becomes:

 EIn(x)=Ef|Dn(y∗n−f(x))+. (4)

For an unobserved point , the criterion can be interpreted as the expected improvement to the current best objective value, if the next query is at point . Under the posterior distribution (3) with plug-in estimate , has the closed-form expression:

 EIn(x)=In(x)Φ(In(x)^σnsn(x))Exploitation+^σnsn(x)ϕ(In(x)^σnsn(x))Exploration. (5)

Here, and

denote the probability density function (p.d.f.) and cumulative density function (c.d.f) of the standard normal distribution, respectively, and

.

After we obtain the acquisition function (5), the next query point is obtained by maximizing . The acquisition function (5) implicitly encodes a tradeoff between exploration of the feasible region and exploitation near the current best solution. The first term in (5) encourages exploitation, by assigning larger values for points with smaller predicted values ; the second term in (5) encourages exploration, by assigning greater values for points with larger estimated posterior variance .

However, one drawback of EI is that it fails to capture the full uncertainty of model parameters within the acquisition function . This results in an over-exploitation of the fitted GP model for optimization, which explains why EI can fail to find any global optimum . This over-greediness has been noted in several recent works (bull2011convergence; qin2017improving). In particular, Theorem 3 of bull2011convergence showed that, for a common class of correlation functions for (see Assumption 1 later), there always exists some smooth function (within a function space , defined later in Section 4) such that EI fails to find any global optimum with a positive probability. This is stated formally below:

###### Proposition 1

Suppose Assumption 1 holds with . Let be the points generated by maximizing in (5). Suppose initial points are sampled according to some probability measure over . Then, for any , there exist some and some constant such that

 PF(limn→∞y∗n−minx∈Ωf(x)≥δ)>1−ϵ.

## 3 Hierarchical Expected Improvement

To overcome this, we propose a hierarchical EI framework which provides a richer quantification of uncertainty within the acquisition function. The key ingredient in HEI is a hierarchical GP model on . Assume the universal kriging model (2), with hierarchical priors on parameters :

 [β] ∝1,[σ2]∼IG(a,b). (6)

In words, the coefficients are assigned a flat improper (i.e., non-informative) prior over , and the process variance is assigned a conjugate inverse-Gamma prior with shape and scale parameters and , respectively. The idea is to leverage this hierarchical structure on model parameters to account for estimation uncertainty, while preserving a closed-form criterion. The next lemma provides the posterior distribution of under such a hierarchical model.

###### Lemma 2

Assume the hierarchical model (2) and (6), with . Given data , we have

where , , , and is a

-dimensional non-central t-distribution with degrees of freedom

, location vector and scale matrix . Furthermore, the posterior distribution of is

 (7)

The proof of Lemma 2 follows from Chapter 4.4 in santner2003design. Lemma 2 shows that under the hierarchical model (2) and (6), the posterior distribution of is now t-distributed, with closed-form expressions for its location and scale parameters and .

Comparing the predictive distributions in (7) and (3), there are several differences which highlight the increased uncertainty from the hierarchical model. First, the new posterior (7) is now t-distributed, whereas the earlier posterior (3) is normally distributed. This suggests that the hierarchical model imposes heavier tails on the predictive distribution, which increases uncertainty. Second, the variance term in (7) can be decomposed as:

 ~σ2n=(2b+n^σ2n)/(2a+(n−q))>n/(2a+(n−q))⋅^σ2n. (8)

For (which is satisfied via a weakly informative prior on ), is larger than the MLE , which increases predictive uncertainty.

Similar to the EI criterion (4), we define the HEI acquisition function as:

 HEIn(x)=Ef|Dn(y∗n−f(x))+,

where the conditional expectation over is under the hierarchical GP model. The theorem below gives a closed-form expression for :

###### Theorem 3

Assume the hierarchical model (2) and (6), with . Then:

 (9)

where , , and , denote the p.d.f. and c.d.f. of a Student’s t-distribution with degrees of freedom, respectively.

Theorem 3 shows that the HEI criterion preserves the desirable properties of the original EI criterion (5): it has an easily-computable, closed-form expression, which allows for efficient optimization of the next query point. This new criterion also has an interpretable exploration-exploitation trade-off: the first term encourages exploitation near the current best solution , and the second term encourages exploration of regions with high predictive variance.

More importantly, the differences between the HEI criterion (9) and the EI criterion (5) show how our method addresses the over-greediness of the latter. There are three notable differences. First, the HEI exploration term depends on the t-p.d.f. , whereas the EI exploration term depends on the normal p.d.f. . Since the former has heavier tails, the HEI exploration term is inflated, which encourages exploration. Second, the larger variance term (see (8)) also inflates the HEI exploration term and encourages exploration. Third, the HEI contains an additional adjustment factor in its exploration term. Since this factor is larger than 1, HEI again encourages exploration. This adjustment is most prominent for small sample sizes, since the factor as sample size . All three differences correct the over-exploitation of EI via a principled hierarchical Bayesian framework.

We also note several important differences between the proposed HEI and the SEI in benassi2011robust. First, the SEI considers a stationary GP model, with constant mean , while the proposed criterion considers a broader non-stationary GP model with mean function , which accounts for uncertainty on coefficients . This allows HEI to incorporate uncertainty on GP nonstationarity, which encourages more exploration in sequential sampling. Second, we prove next the global convergence of HEI (and its convergence rates) under certain prior specifications, which directly addresses the overgreediness of EI. To our knowledge, such results do not exist for the SEI. Lastly, we develop (in Section 5) hyperparameter estimation methods which allow HEI to efficiently mimic a fully Bayesian optimization procedure. Because of this, HEI performs considerably better than existing BO methods, whereas the SEI gives only comparable performance.

## 4 Convergence Analysis of HEI

We now show that HEI indeed finds a global optimum over a broad function class for . We first present this global convergence result (and its associated convergence rate) for Matérn-type correlation functions, then provide an improved convergence rate for more smooth correlation functions.

Let us first adopt the following form for the kernel :

 Kθ(x,z):=C{x1−z1θ1,…,xd−zdθd},

where is a stationary correlation function with and length-scale parameters . From this, we can then define a function space – the reproducing kernel Hilbert space (RKHS, wendland2004scattered) – for the objective function . Given kernel (which is symmetric and positive definite), define the linear space

 Fθ(Ω)={N∑i=1αiKθ(⋅,xi):N∈N+,xi∈Ω,αi∈R},

and equip this space with the bilinear form

The RKHS of kernel is defined as the closure of under , with its inner product induced by .

Next, we make the following two regularity assumptions. The first is a smoothness assumption on the correlation function .

###### Assumption 1

is continuous, integrable, and satisfies:

 |C(x)−Qr(x)|=O(∥x∥2ν2(log∥x∥2)2α)as∥x∥2→0,

for some constants and . Here, and is the -th order Taylor approximation of

. Furthermore, its Fourier transform

is isotropic, radially non-increasing, and satisfies either: as

or    for any .

As noted in bull2011convergence, the choice of as the Matérn correlation function (cressie1991statistics) satisfies Assumption 1.

For the scale parameters , HEI uses maximum a posterior (MAP) estimation under a prior , and updates these parameter estimates after each sampled point. The second assumption is a regularity condition on this MAP estimator.

###### Assumption 2

Given data and prior , let be the MAP of . For any , we have

 θL≤~θn≤θUfor some constants θL,θU∈Rd+. (10)

In our implementation, we use a flat prior on over the compact space .

The following theorem shows that, under specific prior settings, the proposed HEI method rectifies the poor convergence of EI.

###### Theorem 4

Suppose Assumptions 1 and 2 hold, and assume is a constant in and for the hyperparameters in (6). Let be the points generated by maximizing in (9), with iterative plug-in MAP estimates . Then, for any and any initial points, we have:

The proof of this theorem is given in Appendix A.2. The key idea is to upper bound the optimality gap by , which is a generalization of the power function used in the function approximation literature (wendland2004scattered). Then, with the condition , which prevents the variance estimate from collapsing to , we can apply approximation bounds on to obtain the desired global convergence result.

Theorem 4 shows that HEI indeed finds to a global optimum for all in the RKHS , which remedies the overgreediness of EI from Proposition 1. When is the Matérn correlation with smoothness parameter , the RKHS consists of functions with continuous derivatives of order (santner2003design). Under these conditions, HEI achieves a global convergence rate of for , and for .

At first glance, the prior specification in Theorem 4 may appear strange, since the hyperparameter depends on the sample size . However, such data-size-dependent

priors have been studied extensively in the context of high-dimensional Bayesian linear regression, particularly in its connection to optimal minimax estimation (see, e.g.,

castillo2015bayesian). The data-size-dependent prior in Theorem 4 can be interpreted in a similar way: the hyperparameter condition is sufficient in encouraging enough exploration, so that HEI converges to a global optimum for all in the RKHS .

One potential drawback of the global convergence rate in Theorem 4 is that it grows exponentially in dimension . Suppose . With , HEI achieves a rate of in dimensions, but this rate deteriorates to in dimensions! This is the well-known curse-of-dimensionality (bellman2015adaptive). One way to provide relief from this curse is to assume further smoothness on ; this strategy is used extensively in the Quasi-Monte Carlo literature (dick2013high; mak2017projected). We adopt a similar approach below to derive an improved global convergence rate for HEI which is less affected by dimensionality.

In addition to Assumption 2, we will require the following two assumptions. Assumption 3 is on the smoothness of .

###### Assumption 3

The correlation function is a radial function of the form

 C{x1−z1θ1,…,xd−zdθd}=g⎛⎜⎝ ⎷d∑i=1(xi−ziθi)2⎞⎟⎠.

Moreover, the function satisfies

 |g(l)(x)|≤l!Mlfor all l≥l0 and x≥0,

where is the -th derivative of , and is a fixed constant. Furthermore, its Fourier transform is isotropic, radially non-increasing, and as , it satisfies either:

 ˆg(x)=O(x−2λ−d)for any λ>0orˆg(x)=Θ(x−ν−d),

where is a positive constant.

This assumption imposes greater smoothness on than Assumption 1, since the Matérn correlation with smoothness parameter (which satisfies Assumption 1) can be shown to violate Assumption 3. One correlation which satisfies Assumption 3 is the Gaussian correlation, which is much more smooth than the Matérn correlation. The second assumption is a regularity condition on the boundary of the domain .

###### Assumption 4

Domain is a Lipschitz domain, i.e., for any

, there exist a hyperplane

of dimension through , a Lipschitz continuous function , and positive constants , such that

 Ω∩X={z+yn:z∈Br(x)∩H, −h
 and∂Ω∩X={z+yn:z∈Br(x)∩H,η(z)=y},

where is a unit vector normal to , , and

Under these two additional assumptions, the following theorem gives an improved global convergence rate which is less affected by dimensionality.

###### Theorem 5

Suppose Assumptions 2, 3 and 4 hold, and assume is a constant in and for the hyperparameters in (6). Let be the points generated by maximizing in (9), with iterative plug-in MAP estimates . Then, for any and any initial points, we have:

 y∗n−minx∈Ωf(x)=O(n−1).

The proof of this theorem is provided in Appendix A.3. Theorem 5 shows that, by imposing greater smoothness on the objective function (via smoothness conditions on correlation ), HEI enjoys a much improved rate of , one which is less affected by dimension . For example, when is the Gaussian correlation, its RKHS consists of functions with continuous derivatives of any order (minh2010some), which is clearly more restrictive than the RKHS of the Matérn correlation from Theorem 4. By trading off on function smoothness, the convergence rate of HEI improves from to .

## 5 Methodological Developments

Next, we discuss two methodological developments for HEI, concerning hyperparameter specifications and order selection for basis functions. We then provide a full algorithm statement for HEI.

### 5.1 Hyperparameter Specification

We first present several plausible specifications for the hyperparameters in the hierarchical prior in (6), and discuss why certain specifications may yield better BO performance over others.

(i) Weakly Informative. Consider first a weakly informative specification of hyperparameters , which provides weak information on the variable parameter . Following gelman2006prior, we set for some small choice of , e.g., . The limiting case of yields the non-informative Jeffreys prior for variance parameters.

While weakly informative (and non-informative) priors are widely used in Bayesian analysis, such priors may result in poor optimization performance for HEI (as will be shown in Section 6). One reason is that, oftentimes, only a small sample size can be afforded on the black-box function , since each evaluation is expensive. One can perhaps address this with a carefully elicited subjective prior, but such priors can be difficult to formulate when the objective is black-box. We present next two specifications which may offer improved optimization performance in practice.

(ii) Empirical Bayes. Consider next an empirical Bayes (EB, carlin2010bayes) approach, which uses the observed data to estimate the hyperparameters by maximizing the marginal likelihood:

 p(yn;a,b)=∫L(β,σ2;yn)π(β)π(σ2;a,b)dβdσ2,

where is the likelihood function of the GP model (2) (see santner2003design for the full expression). Using these estimated hyperparameters, EB provides a close approximation to a fully Bayesian approach – the “gold standard” approach yielding a full quantification of uncertainty. For BO, EB estimates of hyperparameters allow HEI to closely mimic a fully Bayesian optimization procedure (the “gold standard”), while avoiding expensive MCMC sampling via a closed-form criterion.

Unfortunately, the proposition below shows that a direct application of EB for HEI gives unbounded hyperparameter estimates:

###### Proposition 6

The marginal likelihood for the hierarchical GP model with (6) is given by:

 p(yn;a,b)=det(GnKn)−12baΓ(a)Γ(a+(n−q)/2)(b+wn)a+n−q2, (11)

where . The maximization of (11) is unbounded for .

The proof of Proposition 6 is provided in Appendix A.4. To address the issue of unboundedness, one can perform a modification on EB, called the marginal maximum a posterior (MMAP, doucet2002marginal

to the marginal likelihood:

 ~p(yn;a,b)=p(yn;a,b)π(a,b).

The MMAP approach of hyperparameter estimation has been used for efficient analysis of large-scale Bayesian networks

(JMLR:v14:liu13b). The next proposition shows that the MMAP yields finite estimates:

###### Proposition 7

Assume independent hyperpriors and , where and are the shape and scale parameters, respectively. Then the maximization of is always finite for .

The proof of Proposition 7 is provided in Appendix A.5. As we show later, this MMAP approach can greatly outperform the weakly informative approach for HEI, since it better approximates a fully Bayesian optimization procedure.

(iii) DSD. Lastly, consider the so-called “data-size-dependent” (DSD) specification. Recall from Theorem 4 that the data-size-dependent condition is sufficient for global convergence. To reflect this, the DSD specification assumes the shape parameter to be constant, and the scale parameter to grow at the same order as the sample size , i.e., .

To mimic a fully Bayesian EI, we can again use MMAP to estimate hyperparameters from data. Suppose initial data is collected from design points (which we take to be space-filling, see Section 5.3). Then and can be estimated as:

 (a∗,κ∗)=argmaxa,κ>0{p(ynini;a,κn% ini)⋅πΓ(a;ζ,ι)},

where

denotes the p.d.f. of a Gamma distribution with shape and scale parameters

. Using these estimated parameters, subsequent points are then queried using HEI with and , where is the current sample size. One appealing property of this DSD specification is that it ensures HEI converges to a global optimum (Theorem 4).

### 5.2 Order Selection for Basis Functions

In our implementation, we take the basis functions to be complete polynomials up to a certain order. For different order choices, this results in different polynomial models for the mean function, e.g., for model (no trend); (linear trend) for model ; for model (second-order trend). Choosing a model with high polynomial order can reduce bias, but can also cause inflated variance due to overfitting. A model with a high order also requires more initial points, which may not be feasible in some situations. Of course, one can choose to use other basis functions (e.g., orthogonal polynomials; xiu2010numerical) depending on the application at hand.

We adopt the BIC selection criterion (schwarz1978estimating) to select the model with “best” order to use within HEI. Let be the fitted model with complete polynomials of maximum order . Denote the likelihood of model as (this expression can be found in santner2003design). Given initial data , the BIC selects the model with order:

 l∗=argminl{−2logL(M(l))+qllog(nini)}, (12)

where denotes the number of basis functions in model . Having selected this optimal order, subsequent samples are then obtained using HEI with mean function following this polynomial order.

### 5.3 Algorithm Statement

Algorithm 1 provides the detailed steps for HEI. First, initial data is collected on a space-filling design, such as the maximin Latin hypercube design (MmLHD, santner2003design). Here, the number of initial points is set at , as recommended in loeppky2009choosing. Next, the polynomial model order for HEI is selected using (12), and the hyperparameters and are estimated from data (if necessary). Finally, sequential function queries are collected by maximizing the proposed HEI criterion, until a sample size budget is reached.

## 6 Simulation Studies

We now investigate the numerical performance of HEI in comparison to existing BO methods, for a suite of test optimization functions. We consider the following four test functions, taken from simulationlib:

• Branin (2-dimensional function on domain ):

 f(x)=(x2−5.1/(4π2)⋅x21+5/π⋅x1−6)2+10(1−1/(8π))cos(x1)+10,
• Three-Hump Camel (2-dimensional function on domain ):

 f(x)=2x21−1.05x41+x61/6+x1x2+x22,
• Six-Hump Camel (2-dimensional function on domain ):

 f(x)=(4−2.1x21+x41/3)x21+x1x2+(−4+4x22)x22,
• Levy Function (6-dimensional function on domain ):

 f(x)=sin2(πω1)+5∑i=1(ωi−1)2[1+10sin2(πωi+1)]+(ω6−1)2[1+sin2(2πω6)],

where for .

The simulation set-up is as follows. We compare the proposed HEI method under different hyperparameter specifications (HEI-Weak, HEI-MMAP, HEI-DSD), with the EI method under ordinary kriging (EI-OK), the EI method under universal kriging (EI-UK), the Student EI (SEI) method with hyperparameters as recommended in benassi2011robust, and the UCB method under ordinary kriging (UCB-OK) with default exploration parameter . For HEI-Weak, hyperparameters are set as ; for HEI-MMAP and HEI-DSD, hyperparameters are set as . All methods use the Matérn correlation with smoothness parameter , and are run with a total of function evaluations. Here, the kriging model is fitted using the R package kergp (kergp). All results are averaged over 10 replications.

Figures 1(a)1(c) and 1(d) show the log-optimality gap against the number of samples for the first three functions, and Figure 1(e) shows the optimality gap for the Levy function. We see that the three HEI methods outperform the three existing EI methods: the optimality gap for the latter methods stagnates for larger sample sizes, whereas the former enjoys steady improvements as increases. This shows that the proposed method indeed corrects the over-greediness of EI. Furthermore, of the HEI methods, HEI-MMAP and HEI-DSD appear to greatly outperform HEI-Weak. This is in line with the earlier observation that weakly informative priors may yield poor optimization for HEI; the MMAP and DSD specifications give better performance by mimicking a fully Bayesian optimization procedure. The steady improvement of HEI-DSD also supports the data-size-dependent prior condition needed for global convergence in Theorem 4.

Figure 1(b) shows the sampled points from HEI-DSD and UCB-OK for one run of the Branin function. The points for HEI-Weak and HEI-MMAP are quite similar to HEI-DSD, and the points for EI-OK, EI-UK and SEI are quite similar to UCB-OK, so we only plot one of each for easy visualization. We see that HEI indeed encourages more exploration in optimization: it successfully finds all three global optima for , whereas existing methods cluster points near only one optimum. The need to identify multiple global optima often arises in multiobjective optimization. For example, a company may wish to offer multiple product lines to suit different customer preferences (mak2019analysis). For such problems, HEI can provide more practical solutions over existing methods.

Lastly, we compare the performance of HEI with the SEI method (benassi2011robust). From Figure 1, the SEI achieves only comparable performance with EI-OK (which is in line with the results reported in benassi2011robust), and is one of the worst-performing methods. This shows that HEI, by (i) incorporating uncertainty on GP non-stationarity and (ii) mimicking a fully Bayesian EI via hyperparameter estimation, indeed yields considerable improvements. These novel developments play a key role in the excellent numerical performance of the proposed method.

## 7 Process optimization in semiconductor manufacturing

We now investigate the performance of HEI in a process optimization problem in semiconductor wafer manufacturing. In semiconductor manufacturing (jin2012sequential), thin silicon wafers undergo a series of refinement stages. Of these, thermal processing is one of the most important stage, since it facilitates necessary chemical reactions and allows for surface oxidation (singh2000rapid). Figure 2(a) visualizes a typical thermal processing procedure: a laser beam is moved radially in and out across the wafer, while the wafer itself is rotated at a constant speed. There are two objectives here. First, the wafer should be heated to a target temperature to facilitate the desired chemical reactions. Second, temperature fluctuations over the wafer surface should be made as small as possible, to reduce unwanted strains and improve wafer fabrication (brunner2013characterization). The goal is to find an “optimal” setting of the manufacturing process which achieves these two objectives.

We consider five control parameters: wafer thickness, rotation speed, laser period, laser radius, and laser power (a full specification is given in Table 1). The heating is performed over 60 seconds, and a target temperature of F is desired over this timeframe. We use the following objective function:

 f(x):=60∑t=1maxs∈S|Tt(s;x)−T∗|. (13)

Here, denotes a spatial location on the wafer domain , denotes the heating time (in seconds), and denotes the wafer temperature at location and time , using control setting . Note that incorporates both objectives of the study: wafer temperatures close to results in smaller values of , and the same is true when is stable over .

Clearly, each evaluation of is expensive, since it requires a full run of wafer heating process. We will simulate each run using COMSOL Multiphysics (comsol)

, a reliable finite-element analysis software for solving complex systems of partial differential equations (PDEs). COMSOL models the incident heat flux from the moving laser as a spatially distributed heat source on the surface, then computes the transient thermal response by solving the coupled heat transfer and surface-to-ambient radiation PDEs. Figure 2(b) visualizes the simulation output from COMSOL: the average, maximum, and minimum temperature over the wafer domain at every time step. Experiments are performed on a desktop computer with quad-core Intel I7-8700K processors, and take around

minutes per run.

Figure 2(c) shows the objective value for HEI-MMAP and HEI-DSD (the best-performing HEI methods from simulations), and for the UCB-OK and SEI methods. We see that the HEI-MMAP and HEI-DSD methods both achieve good performance in terms of low objective values, whereas UCB-OK and SEI perform noticeably worse.

Figure 3 shows the average, maximum, and minimum temperature over the wafer surface, as a function of time. For HEI-DSD and HEI-MMAP, the average temperature quickly hits 600 F, with a slight temperature oscillation over the wafer. For SEI, the average temperature reaches the target temperature slowly, but the temperature fluctuation is much higher than for HEI-DSD and HEI-MMAP. For UCB-OK, the average temperature does not even reach the target temperature. Clearly, the two proposed HEI methods return much better manufacturing settings compared to the two existing methods.

## 8 Conclusion

In this paper, we present a hierarchical expected improvement (HEI) framework for Bayesian optimization of a black-box objective . The motivation behind HEI is the greediness of the popular expected improvement (EI) method, which over-exploits the underlying fitted GP model and can fail to converge to a global optimum even for smooth functions . HEI addresses this via the use of a hierarchical GP model on , which accounts for uncertainty in the fitted model. One advantage of HEI is that it preserves a closed-form acquisition function, which allows for efficient optimization even for high dimensions. Under certain prior specifications, we prove the global convergence of HEI over a large function class for , and derive global convergence rates under smoothness assumptions on . We then introduce several hyperparameter specifications which allow HEI to efficiently approximate a fully Bayesian optimization procedure. In numerical experiments, HEI provides improved optimization performance over existing BO methods, for both simulations and a real-world process optimization problem in semiconductor manufacturing.

## Appendix A Proofs

### a.1 Proof of Theorem 3

By Theorem 1, the posterior distribution follows a non-central t-distribution:

 [f(x)∣∣Dn]∼T(2a+n−q,^fn(x),~σ2ns2n(x)).

Let . The density function of then takes the following form:

 g(f;νn,^fn,~σn,sn)=Γ((νn+1)/2)~σnsn√νnπ⋅Γ(νn/2)(1+(f−^fn)2νn~σ2ns2n)−(νn+1)/2.

Using this density function, the HEI criterion can then be simplified as:

 HEIn(x) =Ef|Dn(y∗n−f(x))+=∫y∗n−∞[(y∗n−^fn)+(^fn−f)]g(f)df =(y∗n−^fn)Φνn(y∗n−^fn~σnsn)+∫y∗n−∞(^fn−f)g(f)df. (14)

The second term in (A.1) can be further simplified as:

 ∫y∗n−∞(^fn−f)g(f)df = −~σnsn2∫(y∗n−^fn~σnsn)2−∞Γ((νn+1)/2)√νnπ⋅Γ(νn/2)(1+tνn)−νn+12dt = ~σnsnνnνn−1Γ((νn+1)/2)√νnπ⋅Γ(νn/2)(1+(y∗n−^fn)2νn(~σnsn)2)−νn−12 = √νn~σnsnΓ((νn−1)/2)(νn−2)√π⋅Γ((νn−2)/2)⎛⎝1+1νn(y∗n−^fn~σnsn)2⎞⎠−νn−12 = √νnνn−2~σnsnϕνn−2(y∗n−^fn√νn/(νn−2)~σnsn).

Therefore, we prove the claim.

### a.2 Proof of Theorem 4

The proof of Theorem 4 requires the following three lemmas. The first lemma provides an upper bound for the RKHS norm of function for changing scale parameters:

###### Lemma 8

If , then for all , and

 ∥f∥2Hθ′(Ω)≤(∏di=1θi/θ′i)∥f∥2Hθ(Ω).

The RKHS norm of , , can be written as:

 ∥f∥2Hθ=∫ξ|ˆf(ξ)|2ˆKθ(ξ)dξ.

The Fourier transform of kernel can be further decomposed as

 ˆKθ(ξ)=ˆC(∑di=1(ξi/θi)2)1/2∏di=1θi.

Suppose Assumption 1 holds, i.e., is isotropic and radially non-increasing. Then

 ˆKθ′(ξ)=∏di=1(θ′i/θi)ˆKθ((θ′1/θ1)ξ1,…,(θ′d/θd)ξd)≥~CˆKθ(ξ),

where . Given , we obtain

 ∥f∥2Hθ′(Ω)=∫|ˆf|2ˆKθ′≤∫|ˆf|2~C⋅ˆKθ=~C−1∥f∥2Hθ(Ω),

which proves the desired result. The following two lemmas describe the posterior distribution of in terms of . For simplicity, we denote for