In this paper, we consider the problem of global optimization of expensive functions, i.e., functions which require large computational costs to evaluate. For physical and computational experiments, these functions represent the relationship between input and output variables, and may require days or even weeks to evaluate at a single input setting. One example is the high-pressure mixing and combustion processes in liquid rocket engines, which requires numerically solving a large, coupled system of partial differential equations; seeOefelein and Yang (1998). Even when computation is parallelized over thousands of processing cores, a comprehensive simulation of a single injector may take months to complete. An important problem about expensive functions is how to optimize the output/response by choosing appropriate settings of the input variables. This problem can be challenging for two reasons. First, it is not feasible to conduct extensive runs of function evaluations to find the optimal input settings, since each function evaluation is expensive. It is thus desirable to identify the optimal input settings with as few runs as possible. The second challenge comes from the complicated nature of the functional relationship. They are usually regarded as “black-boxes”, because there is no explicit relationship between the input and output. Although various local optimization methods are available when the derivatives of the functions are known or can be easily obtained, see Boyd and Vandenberghe (2004), such methods are not applicable in the present scenario.
In the literature, a widely used practice for global optimization of expensive functions is to sequentially select input settings for function evaluations based on some criterions. The approach consists of two steps. First, it constructs a surrogate model to approximate the true function based on all the observed function outputs. The advantage of employing surrogate model is that it can provide predictions at any input settings with much cheaper computation. Second, it identifies a new input setting for function evaluation according to some surrogate model based selection criteria. With this approach, it is feasible to approximate the global optimizer of the true function via the surrogate model optimization. The commonly used surrogate models are the kriging models (Jones et al., 1998; Jones, 2001) and models based on radial basis functions (Gutmann, 2001; Regis and Shoemaker, 2007). Chen et al. (2011, 2017)
proposed to construct the surrogate models via overcomlpete pre-specified basis functions. In addition, another type of optimization approach is statistical global optimization which chooses the next point based on a probability improvement criterion, like P-algorithm(Torn and Žilinskas, 1989). Gutmann (2001) and Žilinskas (2010) have showed the equivalence of the P-algorithm and the surrogate approach proposed in Gutmann (2001) under certain conditions. For more details along these lines, see a review in Žilinskas and Zhigljavsky (2016).
The primary objective of this paper is to propose a novel global optimization framework for optimizing expensive functions. Our approach is motivated by Regis and Shoemaker (2007), in which they utilize Radial Basis Functions (RBF) to build a deterministic surrogate model and guide the selection of the next explored point based on the predicted response and some distance criteria. The rationale of using RBFs is that they can capture the nonlinear trend of functions. However, the RBFs they used are pre-determined and lack the flexibility of modeling. Also, it is less efficient to perform function evaluation from their surrogate model, because they use RBFs in a deterministic way without providing prediction uncertainties. Although a distance criterion is used to avoid getting trapped at local optima, it does not incorporate the information in response values for the surrogate models. To make better use of all information in the data, we propose to construct surrogate model with RBFs that are chosen adaptively based on the updated outputs, and to select new points based on surrogate models with quantified uncertainties.
There are other approaches for global optimization of expensive functions in the literature. Jones et al. (1998) propose a global optimization scheme by constructing a surrogate model with the kriging method. Our approach is different in that they make strong assumptions on the correlation structure between explored points while ours does not. A detailed review related to the kriging model in global optimization can be found in Jones (2001). Chen et al. (2017) propose a global optimization scheme that builds a mean prediction model with linear basis functions selected from a dictionary of functions, and then imposes a Bayesian structure over the mean model to quantify the uncertainty of the prediction. Our approach is also different from Chen et al. (2017). Instead of using a predetermined discrete function dictionary with a large number of linear functions, we use a moderate number of RBFs that can be adaptively updated based on observed data.
The paper is organized as follows. In Section 2, we give a mathematical formulation of the global optimization problem, and provide a review of the RBFs. In Section 3, we present the proposed global optimization framework that utilizes adaptive RBF-based Bayesian surrogate model. In Section 4, we present simulation studies to validate and compare our proposed method with the methods by Regis and Shoemaker (2007) and Jones et al. (1998). Further simulation results for 3- and 4-dimensional functions are given in Section 5. A modification of the proposed method to avoid getting trapped in local optima is presented in Section 6.1. In addition, we study the effect of the grid size which is used as the candidate points in the proposed method, and also come out a grid-free approach. Concluding remarks and future research directions are given in Section 7.
2 Problem Formulation and Review of RBFs
Suppose is an expensive function of interest, where and is a -dimensional convex domain in . The objective is to identify an optimal input setting that maximizes
Because it is not practical to evaluate over to search the global maximizer due to the huge computational cost, a well-established practice is to sequentially select a few input settings for function evaluation using a two-step strategy. Suppose a set of function evaluations are taken. In step 1, a surrogate model is constructed and the resulting model approximation is denoted by Unlike the true function the surrogate model is much cheaper to build and evaluate. Therefore it is feasible to predict function values over all In step 2, the next input setting is selected for function evaluation via certain criterion based on the surrogate model from step 1. Steps 1 and 2 iterate until the total computational budget is met. The best point among all the chosen input settings, , can be treated as an approximation to the true optimal point . By adopting this two-step strategy, we will present in Section 3
our proposed framework in detail. Note that the surrogate construction may not necessarily be an interpolator of the observed points, i.e.,Because our goal is optimization, the surrogate is used to predict the location of the optimal points, rather than to approximate the response with high accuracy (Chen et al., 2011). Thus we want to capture the trend of the true response surface quickly and to serve this purpose, our surrogate model does not have to meet the interpolation requirement.
In the remaining part of this section, we give a brief review of the RBFs, which will be used in the proposed framework for the surrogate model construction. In the literature, the RBF is popularly deployed in applied mathematics and neural networks. SeeBuhmann (2003) and Bishop (2006). Several commonly used functions are: (1) Gaussian functions: (2) generalized multi-quadratic functions: with (3) generalized inverse multi-quadratic functions: with (4) thin plate spline functions: where is the center of the function, and is a pre-specified constant which varies with the chosen function. In our work, we will focus on the Gaussian RBFs. The Gaussian RBFs have two types of parameters: the center parameter that determines the location of the RBFs, and the scale parameter that measures the degree of fluctuation of the function. One advantage of using the Gaussian RBFs over other basis functions is that it can capture different trends of response by choosing different centers and scales. For example, a larger indicates a more concentrated change in the surface, and vice versa.
3 General Global Optimization Framework
In this section, we propose a global optimization framework that utilizes adaptive RBF-based surrogate model via uncertainty quantification. In Section 3.1, we propose a novel hierarchical normal mixture Bayesian surrogate model with RBFs to approximate the true function, where the model coefficients are sparsely represented to avoid over-fitting, and the parameters of the RBFs are adaptively updated each time a new point is explored. This allows us to predict the function value at any given candidate point. In Section 3.2, we propose a model-guided selection criterion and based on the posterior samples, a sample version of the expected improvement criterion is adopted. A new point can then be selected to identify a more promising area of global maximizer. A summary of the algorithm and some discussions will be presented in Section 3.3.
3.1 Normal Mixture Surrogate Model with RBFs
Suppose we observe explored points and its function values . Without loss of generality, we assume because otherwise we can approximate ’s instead of ’s, where is the sample mean of , i.e., . We propose to construct a surrogate model by a summation of Gaussian RBFs and an error term :
Here, an error term is used to model the discrepancy between the model approximation constructed by the RBFs and the true function We assume that
are independent normal distributions with mean 0 and vairance. Note that if the center parameters ’s and the scale parameters ’s are known and fixed, then the surrogate model in (2
) is exactly the same as linear regression.
3.1.1 Prior Distributions
Because both ’s and ’s are unknown, the proposed modeling approach can handle highly nonlinear functions. A uniform prior over a rectangular region is used for
where i.e., the smallest hyper-rectangle to cover the current explored points, and it is adaptively changed with the addition of new explored points, see Andrieu et al. (2001). A gamma prior is used for the scale parameters
where and are common to all ’s.
We also impose a hierarchical structure on the coefficients ’s. Define a latent variable to indicate whether a certain basis function is important or not: indicates that the th basis is important, while indicates the opposite. Specifically, we set with small and with relatively large , where
can be interpreted as a variance ratio. This hierarchical setting is first employed in the Stochastic Search Variable Selection (SSVS) scheme byGeorge and McCulloch (1993) and is also used for uncertainty quantification studies in Chen et al. (2017). Indeed, it is one type of the “g-prior” (see Zellner (1986)) for avoiding over-fitting. Now the mixture normal prior of the model coefficient can be written as follows:
with if and if and a binomial prior for the latent variable
Note that the choice of plays an important role in the posterior sampling and control the model complexity. We also impose an inverse-gamma prior for the residual variance
3.1.2 Posterior Sampling
The posterior distribution defined in (8) is computationally intractable. Markov Chain Monte Carlo (MCMC) method is utilized to solve this problem, see Andrieu et al. (2001) and Koutsourelakis (2009). That is, we use the MCMC method to generate the posterior samples from . Thus we sequentially sample , , and by fixing the other components and the data . Under certain conditions, we can guarantee that these samples can be treated as the posterior samples of . Here the MCMC method iterates the following two steps:
Sample by fixing , , and .
Sample and by fixing , , , and .
Specifically, we use the Gibbs sampler to generate the posterior samples for the parameters and the Metropolis-Hasting algorithm to obtain the posterior samples for the parameters and because there is no explicit formula for the posterior distributions of and .
Start with the posterior distributions for Denote and . Then, the samples of can be generated by
The samples of can be generated by
For the samples of , it would be simple to sample sequentially conditional on the other components, and can be generated by
with and is with ;
with and is with Here the notation
represents the vector of all’s except
Now we turn to the parameters and First, consider the sampling procedure for . Instead of directly sampling the vector , we suggest sampling sequentially from
where denotes the vector of all ’s except We use the Metropolis-Hasting algorithm to generate posterior samples for Specifically, at a new step , we set the proposed density to be a mixture of two densities, and a temporary sample can be obtained from the whole domain with uniform probability, or it can be a perturbation of the current iteration within its local neighborhood, i.e.,
Then we accept this temporary sample with the acceptance rate
Similarly, we can use the Metropolis-Hasting algorithm to generate samples of At step we choose a temporary as a perturbation of the current sample by the proposed density
And we accept such sample with the acceptance rate
can be obtained, where is the total number of iterations. After discarding the first say 40% samples, the remaining samples can be treated as the posterior samples of , , , and from . Thus the posterior sample for model approximation at a candidate explored point can be calculated by
Then the function prediction can then be calculated as the average of ’s, and the prediction uncertainties can be calculated as the sample variance of ’s.
Finally, we note that the mean value of the posterior density of in (12) is which is a biased estimator of with a nugget value Hence, this estimate of can be regarded as a ridge-type regression estimate. It is deployed to prevent the model coefficients from being too large. Its use can lead to a more stable surrogate model.
3.1.3 Tuning Parameters
A remaining issue in the Bayesian computation is the tuning of the hyper-parameters, which is critical for the model performance. For the hyper-parameters related to the RBF, we adopt the settings in Andrieu et al. (2001) and Koutsourelakis (2009). Specifically, for the proposed density of the RBF centers in (16), we set For the prior of the RBF scales in (4), we set and for the proposed density of in (18), we set For the hyper-parameters related to model coefficients and residuals, we follow the settings in Chipman et al. (1997). Specifically, for , we suggest to set , where i.e., the largest change in and . For the prior of the indicator variable , we set i.e., the probability of selecting a variable is 50%. For the hyper parameter and in (7), we set and
to be the 99% quantile of the inverse gamma prior that is close toConsider the variance ratio . Usually we choose a large positive value for , e.g., . From our experience, we fix in the first simulation example. However, it should be problem-dependent and can be changed for different optimization problems.
3.2 A Point Selection Criterion
In this section, we discuss how to select new explored points based on the uncertainty of the response prediction for exploring uncertain regions. The ideal selection criterion should perfectly balance between exploration and exploitation properties to efficiently identify the optimal points within the given search budget. Here a sample version of the Expected Improvement criterion is adopted.
The EI criterion, initially proposed by Mockus et al. (1978), is used to select points close to the global maxima based on a chosen surrogate model. Using this criterion, an explored point is selected to maximize the expected improvement over the best observed response
where is the maximum of the observed model outputs. It is pointed out in Jones et al. (1998) that under the Gaussian assumption of has the following closed form expression:
By examining the terms, we see that the expected improvement is large for those having either (i) a predicted value at that is much larger than the maximum of outputs obtained so far, i.e., or (ii) having much uncertainty about the value of i.e., when is large.
In our scenario, since the proposed surrogate model does not satisfy the Gaussian assumption, there is no analytical form for and thus it is not practical to calculate directly. Instead, we calculate the Sampled Expected Improvement (SEI) as suggested in Chipman et al. (2012) and Chen et al. (2017), i.e., to estimate based on the posterior samples of
where is the th posterior sample by (19), and is the total number of posterior samples. Unlike in the Gaussian case, the SEI value in (22) may not be expressed as a weighted sum of the improvement term and the prediction uncertainty term. From its definition, only the prediction posterior samples that are larger than the current best value, , are taken in the summation. Thus SEI first identifies the possible “improvement”’ area, , and then sums over these terms.
A new explored point at step is then selected to maximize the SEI criterion ,
where is the current explored point set. Since the SEI criterion does not have a closed form, it may not be easy to identify the next explored point by solving (23). Thus to identify the next point, we may limit the search over a pre-specified grid, , instead of over the whole region, .
3.3 The Proposed Algorithm and Remarks
In the first part of this section, we will present a summary of the algorithm and the flexible usage of the proposed adaptive RBF-based global optimization framework. For abbreviation, we will refer to the proposed method as BaRBF, where “Ba” stands for “Bayesian adaptive”. In the second part, we will compare our method with the baseline method proposed in Regis and Shoemaker (2007).
Algorithm 1 summarizes the proposed global optimization method. In the beginning, the initial design is chosen as a space-filling design. In this paper, the maximin Latin hypercube design (Morris and Mitchell, 1995) is used. Then the main body of Algorithm 1 is to iterate between the two steps for the surrogate model construction in Section 3.1 and the point selection criterion in Section 3.2.
|Algorithm 1 Global Optimization Algorithm|
|1:||Choose a small set of initial explored points using a maximin|
|Latin hypercube design, and evaluate on|
|3:||Construct a Bayesian surrogate model|
|as in Section 3.1 based on|
|4:||Calculate the SEI in (22), and|
|select a new explored point via selection criterion in (23) over|
|6:||Update and evaluate|
|8:||Return the current best optimal point, and|
|the corresponding function value, .|
Note that the proposed BaRBF can be flexibly used in different scenarios. For example, when the number of available function evaluations is small to moderate, there may not be enough observations to estimate all the parameters. In this case, we only need to update some part of RBF parameters, say the scale parameter by setting all the scale parameters and need not update the parameters. The choice of whether to update all parameters or part of them can be decided based on the magnitude of the model residuals at the initial stage. If updating all parameters leads to relative large model residuals, then we can fix certain parameters instead. The formulas of the posterior distribution in (8)-(18) need some minor changes accordingly if certain RBF parameters are fixed. For the above example, one only needs to set in eq. (8) to be the same , and update only one in (18), and does not need to update the ’s in (15) and (16).
For the remaining part of this section, we will compare our BaRBF with the Global metric stochastic RBF (G-MSRBF) algorithm proposed by Regis and Shoemaker (2007) from a theoretical perspective. The G-MSRBF method will be regarded as the baseline method from now on. First we give a brief review. The G-MSRBF employs a surrogate model using RBFs,
The RBF parameters in (24) are pre-specified, i.e., the RBF centers are set at the explored points , and is pre-calculated at the initial stage of optimization. The model coefficients in (24) are estimated by solving a deterministic linear system of equation where And their point selection criterion
is a weighted average of the scaled response prediction with
and the maximin distance criterion with
where The can take values in periodically. For example, if at time then at the next time Then a new point is selected to maximize , and finally the global maximizer is also estimated by .
Although both methods use RBFs, there are two main differences. First, the surrogate model is different. BaRBF uses a Bayesian surrogate model that provides not only predictions but also its uncertainties, while the G-MSRBF utilizes a deterministic surrogate model that only provides predictions. Because our proposed surrogate model is similar to ridge regression, the approximation of response is more robust and smooth compared to the interpolation surrogate model in G-MSRBF. The second difference lies in the choice of the selection criterion for new explored points. In our method, we utilize the expected improvement criterionwhich can be regarded as a soft-thresholding version of . As previously discussed, thresholding the prediction makes it easier to identify global optima. In addition, under the Gaussian prediction, EI criterion contains the measure of the prediction improvement and the model uncertainty. Here the part of the prediction improvement can be treated as exploitation and the uncertainty part is used to explore the search space. In G-MSRBF, the global exploration is based on the maximin distance criterion, , and the is used for local refining. The weighted average of these two criteria is adopted with pre-specific weight pattern. Simulation studies will be presented in Sections 4 and 5 to further understand and compare the empirical performance of the two methods.
4 Simulation Study
To assess the performance of BaRBF, we compare it with G-MSRBF, which is regarded as the baseline method. In G-MSRBF, the candidate set for the next point selection is chosen in each iteration. To make a fair comparison, the candidate points are fixed as a pre-specified grid, , in the experimental region and both methods will be implemented over the same grid.
In addition to the G-MSRBF, we also consider another global optimization approach based on Gaussian process surrogate model for comparisons. Jones et al. (1998)
proposed the efficient global optimization (EGO) approach by using the Gaussian process for surrogate construction. EGO starts from an initial point set. After evaluating the response values of the initial design points, a numerical optimization approach, like meta-heuristic algorithm, is used to obtain the MLE of the parameters in the Gaussian process and then the corresponding surrogate model is obtained. Since the Gaussian process prediction follows a normal distribution, the EI criterion in Eq. (21) is used to identify the next explored point over the feasible point set, , and then the surrogate model is updated. Iterate these two steps until a stopping criterion is met. Usually the stopping criterion can be the number of explored points or the maximum value of the EI criterion over the unexplored points.
First we start with a 2D global optimization example, whose objective function is quit smooth. Then we consider another 2D objective function which is not smooth and has multiple global optima.
4.1 2D Branin function
We consider the standard 2D test function “Branin function”, which has been widely used in the global optimization literature, e.g. Jones et al. (1998). The scaled version of “Branin function” we use here is defined as follows,
where and To simplify our code, we further restrict this function on the evenly spaced grid . The contour plot of the Branin function over the pre-specified grid points is given in Figure 1, where there will be two local maxima and one global maximum on with the maximum value 1.0473. In BaRBF, we first measure the prediction uncertainties for all grid points in and then identify the next explored point via (23) from the set .
The objective is to find that maximizes in (28) with as few evaluations as possible. At each iteration of the algorithm, the current optimal point, , and its function value are recorded together with all the explored points. We randomly choose a small set of initial explored points using a maximin Latin hypercube design (Santner et al., 2013). All three methods start with the same set of ’s. Each time the surrogate model is updated by incorporating the value of a new explored point. Then we calculate and update the value. For each algorithm, new explored points are selected and evaluated sequentially until the total number of explored points reaches This process is repeated 60 times, and the average performances are reported and compared for the two methods.
For fair comparison among BaRBF and G-MSRBF, we set the initial sampler of the RBF parameters in BaRBF to be the same as the fixed RBF parameters in G-MSRBF. Specifically, we use Algorithm 1 in Fasshauer and Zhang (2007) to select an optimal value of in G-MSRBF that minimizes a cost function that collects the errors for a sequence of partial fits to the data. The center parameters ’s are set as the explored points ’s.
From many simulation trials, we found out that, for the Branin test function, updating all parameters in BaRBF will lead to relatively large model residuals that do not converge. This might be caused by the small number of function evaluations. Thus we only update one scale parameter with all and fix the center parameter ’s at the explored points. We iterate the MCMC 10,000 times. Also, we discard the first 40% of the samples, and take 1 out of every 5 samples in the remaining 60% of the samples, in order to obtain stable and less correlated posterior samples for model fitting. In order to implement BaRBF, two important tuning parameters need to be pre-specified. The first one is the value of in the coefficient prior. Here we set . Since the weighted selection criterion is adopted to select the next explored point, the parameter in the weight is fixed as 5 after we have tested several different possibilities.
In this subsection, we first illustrate the proposed BaRBF with one particular simulation sample. Figure 2 plots the contours of the surrogate model in BaRBF and the locations of the explored points using BaRBF with 21, 26, 31, 36, 41 for one simulation sample. Figure 2(a) shows the initial status of BaRBF. The initial design is a 16-run maximin Latin hypercube indicated by 16 green squares. The next explored point chosen by the selection criterion, i.e. the 17th point, is indicated by the black circle in the lower right corner. In Figure 2(b), the five additional points (17th to 21st) are indicated by the five blue squares. These five points are divided into three sets, one closer to the global maximum, the other closer to the other two local maximums. As in Figure 2(a), the next explored point, i.e., the 22nd point, is indicated by the black circle. In Figure 2(c), all the 21 points from Figure 2(b) are indicated by green squares, the additional five points by blue squares, and the next explored point by black circle. Then the same symbols are used in Figure 2(d) to (f) to demonstrate the progression of points for and 41. Amazingly, except the initial design points, all the explored points are located closer to the three maxima, none for exploring bad regions. Finally the global maximum is identified in point 39 as shown by the black square in Figure 2(f). In summary these contour plots show that BaRBF efficiently explores the experimental space and quickly approaches the optimal points.
|Quantile||Quantile||true optimal values|
We report the performance of BaRBF and G-MSRBF based on 60 replications by randomly generating the initial LHD designs. The purpose is to see whether BaRBF provides a more efficient search path to identify the global maximum compared with G-MSRBF, for the same number of function evaluations. The numerical results are summarized in Table 1
. First BaRBF has the higher mean value, 1.0443, than that of G-MSRBF, 1.0425, and is more stable because of its smaller standard deviation. Based on the sample quantiles of the optimal values identified by both approaches, there is a detectable difference at the 5% quantile values. We found out that for several cases, the best points identified by G-MSRBF are not close to the true optimal point and after checking the corresponding search processes, G-MSRBF did not efficiently explore the experimental space by properly choosing the next points. On the other hand, G-MSRBF performs better than BaRBF for the first quartile Q1. In addition, among the 60 replications, BaRBF can identify the true optimal values 26 times, which is higher than 20 times for G-MSRBF. Overall BaRBF has a better performance. Here we plot in Figure3 the mean value as well as the 5% and 95% quantile curves of the current optimal values with respect to the number of iterations for both methods. The mean curves in the two plots are very similar. More meaningful is the comparison of the two 5% quantile curves. For G-MSRBF, the curve moves up quickly until =13; then it gets stuck (flat) until about = 23. By comparison, the 5% quantile curve for BaRBF moves up quickly until = 18. By then, the band between the upper and lower quantile curves is very narrow and continues to shrink. The corresponding band for G-MSRBF does not shrink even to the end ( = 30). In fact it remains very wide when = 26. This figure gives a more informative comparison than the numerical results in Table 1. It clearly shows the better performance of BaRBF over G-MSRBF.
When we compare the results with EGO, it seems that EGO works perfectly in this Branin example because EGO can quickly identify the global maximum point in each replications. The possible reason should be that since the Branin function can be treated as a smooth function, it can be fitted quit well by the Gaussian process model and thus the EI criterion in EGO can rapidly guide the search process to the target point. For our BaRBF, we do have a error assumption in the surrogate model and the model fitting may not be as well as Gaussian process model. In addition, SEI is computed as the sample expectation of the improvement function without any distributed assumption. Thus if the Gaussian assumption is satisfied, it is not surprised that EI can be more efficient in getting the global optimal point. In fact, when we monitor the search process of BaRBF, sometimes it may stay in a local area for a while. This should be related to that the exploration effect of the SEI criterion does not function well.
4.2 2D Ronkkonen function
In addition, we consider another 2D objective function in Rönkkönen et al. (2011), i.e.,
where for , , and and the experimental region is . This objective function has been used as a test function in Chipman et al. (2012). Here we also restrict the function on the evenly spaced grid . The contour plot of this Ronkkonen function over the pre-specified grid points is given in Figure 4, where there are 12 local maximums and 4 global maximal points with the maximum value, 0.4777. Because of its multiple local and global optimal points, this Ronkkonen function is not as smooth as the Branin function.
In this example, the initial point sets are the same as those in Section 4.1, and the total number of explored points is set as , i.e., based on 16 initial points, the search algorithm iterates 30 times by sequentially adding 30 points. Then the best value among the 46 explored points is reported. Here the goal is to identify one of the four global maximum points. The average performances of BaRBF, G-MSRBF and EGO over 60 replications are summarized in Table 2.
|Quantile||Quantile||true optimal values|
First, G-MSRBF performs worst in terms of the frequencies of reaching one of the four global maximum points (see the last column of Table 2). Then we focus on comparing the performance between BaRBF and EGO. From Table 2, the mean of the best function value found by BaRBF is 0.4775 with standard deviation 4.6850e-4, while the corresponding values for EGO are 0.4526 and 0.0344 respectively. We also compute the sample quantiles of the best values found by both methods. Table 2 shows that BaRBF touches the global maximum at the sample quantile, while EGO reaches 0.4777 at the quantile. In addition, for the BaRBF, the frequency of reaching a global optimum is 43/60, while the frequency for the EGO is 18/60. The mean value of 60 replicates, and the and quantile curves of the current optimal values with respect to the number of iterations for EGO and BaRBF are shown in Figure 5. The mean curve of the BaRBF moves up quickly to the global optimal value, while the curve for EGO moves up more slowly. In addition, the quantile curve for EGO does not get much improvement before adding explored points, i.e., , while the same curve for BaRBF moves up quickly and gets close to the global optimal value after adding 15 points, i.e., . Another attractive feature for BaRBF is the extremely low standard deviation (see the std column), which may suggest that the BaRBF can perform stably over repeated implementations. In summary, the BaRBF outperforms the EGO in this example.
Chipman et al. (2012) have pointed that when the test function has multiple global optima, the EI criterion might lead the search process to jump out of the neighborhood of one global optimum to another one and cannot stick around one global optimum. This may be explained by the fact that the EI criterion tries to minimize the prediction uncertainties among different global optima. The quantile curve for EGO in Figure 5 seems to support this point. For the BaRBF, by using SEI, it can quickly locate one neighborhood of a global maximum and then identify the best value. In addition, since this Ronkkonen function is not as smooth as the Branin function, the normality assumption and the interpolation property of the Gaussian process may not give advantages for the surrogate construction. For the BaRBF, the surrogate model consists of additive radial basis functions and their parameters are simultaneously adjusted via the proposed Bayesian approach. Thus our surrogate construction approach may be more advantageous for non-smooth test functions.
5 Simulation Studies for Three and Four Dimensions
In addition to these two different 2D test functions, we consider 3D and 4D examples to illustrate that the proposed BaRBF can deal with higher dimension cases and also compare its performances with EGO and G-MSRBF.
5.1 3D Ronkkonen Function
The Ronkkonen function defined in Rönkkönen et al. (2011) is quite general and can be extended to different dimensions. Here we consider a 3D Ronkkonen function which is generalized from the 2D function in Section 4.2 and is shown below,
where for , , and ; ; . The experimental region considered here is . According to Rönkkönen et al. (2011), there are local maximum points and 27 of them are the global maximum points.
We divide the experimental region into grid points by setting the grid size as 0.04. For this grid, there is only one global maximum point with the function value 0.3584. We start by choosing the initial maximin LHD with 50 points for the 3-dimensional case. After the initial stage, we run BaRBF for 50 iterations. To implement BaRBF, we follow the same RBF set-up in Section 4.1. That is we set the scale parameter with all and fix the center parameter ’s at the explored points. For the tuning variable , we fix as 15. The performance of the BaRBF is measured by the maximum function value identified within 100 explored points and is summarized based on 20 replications by independently re-generating the initial design points. For the comparison purpose, we also implement EGO and G-MSRBF.
Table 3 is a summary of the maximum values obtained by the three approaches. G-MSRBF performs worst in this case because G-MSRBF cannot identify any true global maximum value within 20 replications. Between BaRBF and EGO, BaRBF has better performance in the average maximum function values and the frequency of reaching the global maximum point, but both approaches share similar values in the first and third quartiles, Q1 and Q3, and the median value. Because there are few replications, the maximum value identified by EGO is less than 0.34. A possible reason should be similar to what was stated in Section 4.2, namely, the Ronkkonen function is not a smooth function and EGO may jump around different local modes. Finally, the std value for BaRBF is much lower than that for the others. This is similar to what we observe in Table 2 for the 2D case and has a similar implication on the stability of the BaRBF.
|true optimal values|
5.2 4D Hartmann Function
In this section, we study the performance for the 4-dimensional Hartmann function. This function has been studied in Picheny et al. (2013) and is defined as
where ; and There are multiple maxima for this function over the experimental region . Here we divide the experimental region int grid points by setting the grid size as 0.05. The maximum function value over these points is 3.1218.
The proposed method, BaRBF, with SEI selection criterion is used to identify the true optimal point over the pre-specified grid for this 4D function. For the parameters in radial basis function, the scale parameter is set as and the center parameters, ’s, are fixed at the explored points. In addition, the variance, , in the mixture normal priors of the coefficients is fixed as 10.
We start by choosing the initial maximin LHD with 50 points for the 4-dimensional case. After the initial stage, we run BaRBF for 50 iterations. Thus there are in total 100 explored points in this example. To track the performance of the proposed method, we record the maximum function values in each iteration, i.e., . That was repeated 20 times by randomly generating the initial maximin LHD. We repeat the same procedure for G-MSRBF and EGO. A summary of the maximum values obtained by the three approaches are shown in Tables 4. For this 4-dimensional Hartmann function, BaRBF outperforms G-MSRBF in terms of the frequency, the mean of the optimal values and a smaller standard deviation, and as shown in Table 4, for BaRBF, the middle 50% of values between Q1 and Q3 is extremely tiny and smaller than that for G-MSRBF. However, EGO has the best performance in this case. We think it should be related to the target function because this Hartmann function should be a smooth function and thus it is fever to the EGO approach.
|true optimal values|
In this session, several issues are studied. First we modify the proposed algorithm by adding a step to force the search process to jump out from a local optimal area. Then we study the effects of grid size on the performance of the proposed method. Finally, in order to reduce the computational burden for high-dimensional optimization problems, a grid-free version of the proposed method is introduced and a higher dimension example is illustrated.
6.1 A Modified Version of BaRBF
From tracing the search process of the BaRBF in the Branin function example, we found out that sometimes the BaRBF gets stuck in a local area and cannot leave the area for a while. To overcome this potential weakness, we have the following modification. To jump out of this local area, we add an additional step, called the escape step, by monitoring the search process of the current best value. That is, we record the number of consecutive non-improvement iterations, i.e., iterations for which the current best function value cannot get improved from the new explored point. Denote this number by . Once exceeds a pre-specified number, , it indicates that the BaRBF is stuck in a local area. Instead of continuing the search, we add some additional points to explore the experiment region. The additional point is chosen based on the maximin-distance criterion in the region. That is, given the current explored points, we find the point such that the union set of this point with the current explored points has the maximal value of the minimal distance between any two points in this union set. The purpose is to put additional points in the unexplored area as far away as possible from the existing points. We continue adding points until we obtain a better function value or until we add points. Then we will return to the original search procedure. A similar idea has been adopted in Regis and Shoemaker (2007) to detect if the search algorithm converges to a local optimum. We refer to this modificcation as M-BaRBF.
|Quantile||Quantile||true optimal values|
We implement the M-BaRBF for the Branin function by setting . For the same initial points sets and other tuning parameters, the average performance of the 60 replicates for M-BaRBF is shown in Table 5. Compare with the results for BaRBF in Table 1. Except for the 5% sample quantile value, the M-BaRBF performs better in the mean value and median value. In addition, the frequency to reach the global optimum is 38/60 which is much higher than 29/60 for the BaRBF. The Q1 value for M-BaRBF is significantly higher than that for BaRBF and the median value touches the global maximum of the Branin function.
6.2 The Effects of the Grid Size
In the numerical examples, we suggest to pre-specify the grid size first when we implement our algorithm. This size may be chosen based on the prior knowledge. In the numerical examples in Section 4, the grid size is fixed as 0.04 or 0.05. Suppose we can consider different grid sizes for a given optimization problem. We will illustrate the effects of the grid sizes on the performance.
We revisit the 2D Branin function example in Section 4.1. Instead of setting the grid size as 0.04, we choose the finer size 0.02 and divide the region into grid points which still cover the original grid, . Based on this finer grid, there are still two local maxima and the global optimal point is located at . Since the number of grid points is now about four times that of the original, we take more iterations, , for the proposed BaRBF. Then based on the same initial points and tuning parameters, the results with 60 replications are summarized in Table 6. In this table, in addition to the results with 120 iterations, we also report the results with 30 iterations for comparison purpose.
|Quantile||Quantile||true optimal values|
Compare the performances of BaRBF and BaRBF(120) in Tables 1 and 6 respectively. First, the BaRBF(120) is implemented over the finer grid and it has higher frequency, , to reach the global optimal point. In addition, the quantiles and the median value of the best solutions of BaRBF(120) are 1.0471, 1.0471 and 1.0473 respectively which are significantly higher than the corresponding values shown in Table 1. Obviously BaRBF(120) has the higher mean value 1.0472 and a smaller standard deviation. This may be related to the fact that there are more candidate points and larger number of iterations. Thus BaRBF can identify better function values due to finer grid and can still explore the experimental space because of a larger number of iterations. To support this guess, we also report the summary of the BaRBF with finer grid and 30 iterations, denoted by BaRBF(30). The corresponding , sample quantiles and the median value are still better than the corresponding values shown in Table 1. But the frequency for obtaining the global optimal point is only 17/60. It means that 30 iterations may not be large enough for BaRBF to explore the whole region and the search process may get stuck in some local areas. Thus we need to have more iterations to increase the probability to jump out of these local areas. Overall we can conclude that when we have a finer grid, a larger number of iterations should be necessary.
6.3 A Grid-free Method
When we demonstrate the performance of the proposed BaRBF in Section 4, we mention that we need to pre-specify a grid as the explored candidate point set before implementing the BaRBF. For example, we set the grid size as 0.04 for the 2 and 3 dimension cases. Suppose we consider the application in parameter tuning. In practice, the precision of each variable should be limited and due to this grid set-up, we can easily select the next explored point based on the SEI criterion; otherwise it should be treated as another non-differentialable optimization problem. However, the problem for the grid set is the curse of dimensionality. For example, when we consider a 7-dimension region,. Suppose we choose the grid size as 0.2. Then the number of grid points is points, which is huge. To keep such huge number of grid points in our process would slow down the computational speed. Since the grid point set is used as the candidate set for selecting the next explored point, we would propose another modified BaRBF without pre-specifying grid point set.
The idea is similar to the G-MSRBF in Regis and Shoemaker (2007). Instead of fixing the grid points for the candidate point set, we randomly generate the candidate points in each iterations. Regis and Shoemaker (2007) have mentioned two conditions for this point generation procedure. One is the conditionally independent sampling condition and another one is the dense condition, that is, any point in
can be chosen as a candidate point with a positive probability. Since our approach satisfies the first condition, sample the candidate points from the uniform distribution over the experimental regionto satisfy the dense condition to guarantee the convergence property, i.e., when goes to infinity, the global optimal point can be identified by such search process with high probability. Thus in our algorithm, we add one more step to generate the candidate point set from the uniform distribution over the region and then choose the next explored points by applying the SEI criterion to . After updating the , we would continue the selection process by re-generating the new candidate points.
To demonstrate this new modified BaRBF, named the grid-free BaRBF, the Rastrigin function (Pohlheim, 2015) is taken as the objective function, i.e.,
This Rastrigin function has multiple local maxima and the locations of the local maximal points are uniformly distributed. The global optimal point is with maximum value 0. Here we choose , that is, we consider a 8-dimensional problem. At first the 80 initial points are chosen from a Latin hypercube design over the experimental region,