Gaussian process model based optimization, also known as Bayesian optimization (BO), is an effective approach to optimize black-box functions. By black-box functions, we refer to objective functions that have no explicit forms and can only be observed at arbitrary inputs. We further assume that observations are noisy and expensive to evaluate. The black-box function optimization can be formalized as
where ( is the design space assumed compact) is the design choice and. BO has been successfully applied in many problems. These include inventory problems where the decision-maker uses a simulation model to determine the best inventory level (the decision variable ) to maintain, in order to maximize the expected profit (the response et al. 2012), where the response in (1) typically represents the training error with respect to a set of hyperparameters
. Other popular application areas of BO include sensor networks, robotics, advertising, recommendation and reinforcement learning (seeShahriari et al. (2016) for a survey).
BO utilizes a GP model (also known as a kriging model) that is built with observations at sequentially selected design points. This model is essentially a statistical approximation of the baseline response and is considered as a type of metamodel. Metamodels summarize available information to predict the performance of unobserved points. Therefore, they are often used to guide optimizations for expensive responses. Compared with other metamodels, the GP model is very flexible with fewer assumptions required on the response structure and ‘more resistant to overfitting than general interpolation methods’ (such as the neural network)(Ankenman et al. 2010). Furthermore, the GP model provides an analytical form of its prediction uncertainty which can be used to determine the accuracy of the prediction and importantly, also help determine new design points to improve the model fitting. Typically, in BO, the next design point is chosen to maximize an acquisition function, which measures the utility of candidate points, computed with the GP model. After selection, the new observation is obtained and the GP model is updated accordingly.
BO can be categorized with respect to the acquisition functions adopted. Popular choices include Expected improvement (EI) (Jones et al. 1998), GP upper confidences bound (Srinivas et al. 2009), stepwise uncertainty reduction (Picheny 2015), knowledge gradient (Frazier et al. 2009) and entropy search (Hennig and Schuler 2012) (see Frazier (2018), Shahriari et al. (2016) for reviews). Among these approaches, EI is easy to perform and perhaps the most widely-used due to its ability to automatically balance between exploitation (searching around the current best solution) and exploration (searching less-explored regions). EI was originally proposed by Jones et al. (1998) in the efficient global optimization (EGO) algorithm to optimize deterministic functions. This acquisition function and algorithm were later adapted for stochastic simulations with homogeneous noise (Huang et al. 2006). More recently, the Two Stage Sequential Optimization (TSSO) (Quan et al. 2013) and the extended-TSSO (Pedrielli et al. 2020)
were proposed for the more general heteroscedastic noise case, where the noise levels at different input levels are not the same. These algorithms consist of two stages, namely the searching stage and the allocation stage, where the searching stage focuses on selecting the next design point based on the EI criterion, and the allocation stage focuses on allocating additional replications of function evaluations to existing design points to reduce the noise levels. Similar to these works, we adopt the EI criterion here as the acquisition function in our algorithm.
Despite numerous successful applications, BO has a few limitations that can hinder its broader usage. To illustrate them, we use an example of an agent-based simulation used in maritime safety. Here, real-time decisions on the trajectory for a vessel, characterized by four parameters including two turning angles and two traveling speeds, are required to minimize its probability of conflict with other vessels when sailing through a region (see detailed description in Section 6.3). In the simulated high-congestion region, a small turn of the vessel can result in a very different conflict environment, thus, resulting in a multi-modal response that can change dramatically across the design space. This raises a few difficulties for the traditional BO approach. First, to build an accurate GP model is challenging for such a non-homogeneous response, unless a large number of design points are used. This, in turn, brings a strain on the simulation resources and the computational cost of fitting the GP model as it increases cubically with the number of design points(Quiñonero-Candela and Rasmussen 2005). For such real-time decision problems, the large computational burden can diminish the advantage of BO. Hence, fast approximations of the GP model are required in such situations. Second, even with a proper GP model, the optimization can be hard. Although EI ensures global convergence under mild conditions (Bull 2011), it tends to over-exploit local optimal regions (Ranjan et al. 2011), which can result in excessive efforts expensed around the current best before opportunities are given to explore other regions where better solutions may exist. With a finite budget, this can cause the algorithm to return a sub-optimal solution, especially if the response has many local optimums like the one in our maritime safety example.
To address these challenges, in this paper, we develop a new BO algorithm, the Combined Global and Local search for Optimization algorithm (CGLO), which combines an iterative global and local search processes to optimize responses with heteroscedastic noise. CGLO leverages on a recently proposed additive GP model, the AGLGP model (Meng and Ng 2015) and exploits the model’s local and global structure to more efficiently guide the search. In this approach, the model fitting is faster and enables more flexibility in capturing the multi-modal response surface. Furthermore, the built-in mechanism in the switching criteria of the algorithm helps to avoid the over-exploitation of any single region.
While several fast approximation methods have been proposed such as sparse approximations (Quiñonero-Candela and Rasmussen 2005) and local kriging, none is able to adequately capture both global trends and local fluctuations of many nonstationary responses. Combined models (Snelson and Ghahramani 2007, Ba et al. 2012), on the other hand, are able to better capture these global and local trends, but they still assume similar local trends throughout the design space, thus restricting their modeling flexibility. Different from these works, the AGLGP model explicitly models the local residual structures independently and non-identically to enable this flexibility.
Our main contributions in this paper can be summarized as follows:
We provide the full development of the AGLGP model and propose a faster approach to build the model;
We adapt the EI criterion with the AGLGP model to develop a new algorithm with global and local search steps;
We provide the asymptotic convergence result for the proposed CGLO algorithm and test its finite-time performance with several examples.
The rest of this paper is organized as follows. Section 2 provides an overview of our CGLO algorithm and its key components. Next, Section 3 develops in detail the first key component, the AGLGP model. Subsequently, Section 4 describes the CGLO algorithm and Section 5 states its convergence results. Section 6 provides numerical tests of CGLO including some test functions and a maritime navigational safety application. Finally, section 7 concludes the paper. The proofs of all lemmas and theorems and all the appendices are provided in the online supplementary material.
2 Overview of CGLO Algorithm
The key idea of CGLO is to leverage on the local and global structure of the AGLGP model to direct a global exploration across the space with its global component, and to build the detailed AGLGP model in local regions to search more efficiently and accurately in these regions (without the need of a detailed estimation of the full GP model). The algorithm iterates between this global and local search levels, providing a cycle of learning between them to improve the quality of the search without overburdening the cost of estimating the model, and a switching criterion is used to decide when the CGLO switches back to the global level to avoid getting trapped in a local optimal. We illustrate the key ideas of CGLO with a simple example below.
Consider an example where the function (see Figure 1) is to be minimized. Here we see that the function has a global trend and also noisy local variations. If we directly use a full GP-EI algorithm, the search can be misleading at the start (with a small number of points) and potentially trapped in one of the many local optimums. Moreover, the full GP model becomes computationally expensive as the design points increase as the search continues. Here instead, we first fit the AGLGP model. This model form consists of a global term capturing the global trend and separate local models capturing the remaining process in each local region. In the global model, to capture the global picture, the information of the design points are summarized by the inducing points and local fluctuations are smoothed out (see red dash-dotted line in Figure 1). To capture the local residuals, the whole design space is divided into non-overlapping regions and local models are fitted separately to the residuals in each region (see the yellow dotted line in Figure 1 for Region 2). As the inducing points (to build the global model) and design points in each local region (to build the local model) are much fewer than the total number of design points, the computational cost to fit the model is greatly reduced.
The CGLO algorithm then leverages these model features in two iterative steps: a global step and a local step. The global step focuses on identifying promising regions and is guided by the global model, as it is less likely to be influenced by local optimums and able to better focus on regions where the global trends are low. With an initial model fit, we are able to identify Region 2 as a promising region. A local model in this region is then incorporated with the global model to form an overall model (see the purple line in Figure 1). This overall AGLGP model is used to guide a more refined in-depth search within this selected promising region to identify optimums within this region. A switching criterion is then used to determine the effort to expense in this region before reverting back to the global step. The idea behind this switch criterion is to allow sufficient resources in the local step to identify promising local minimums but to avoid over expense on negligible local improvements. This switch thus helps CGLO jump out of the current exploiting regions to reconsider the other candidate regions where better solutions may exist and avoid being trapped in one sub-optimal region. After each local step, the global model is updated with the new observations to improve the directed search in the global step. The proposed procedure then alternates between the global and local searches to more efficiently explore and exploit the entire region.
From this example, we see that there are two main components to CGLO:
A fast AGLGP model that jointly models the global trend and the local fluctuations.
An algorithm that leverages on this model to develop global and local search steps.
In the following two sections, we will develop in detail these two components.
3 AGLGP Model Development
The AGLGP model has been shown to work well for multi-modal responses and those that vary significantly across the design space (Meng and Ng 2015). In this section, we first review the basics of the AGLGP model in Section 3.1. In Section 3.2, we describe a fast estimation approach for the AGLGP model that is more suitable for CGLO. It reduces the complexity of traditional BO from to , where , and are the number of design points, the number of inducing points and the number of design points in each local region, respectively (Meng and Ng 2015).
3.1 AGLGP Model Review
The basic idea of AGLGP model can be seen from Figure 1. It models the response with a global term (the red dash-dotted line) and independent local terms (the yellow dotted line), which captures the local residual process. Built by the inducing points (the blue stars in Figure 1), the global model has long lengthscale to capture the global trend while the separate local models are developed with shorter lengthscales. The following parameters in Table 1 are used to develop the AGLGP model.
|;||The global inducing points ; The latent global ’observations’ at|
|The th local region where for and|
|;||The design points in region ; The response observations at|
|The total number of design points:|
The vector of all design points; The response observations at
The AGLGP model assumes the stochastic simulation response at can be represented as:
Here, the process mean of the response consists of a global trend process and local residual processes in region . These processes are assumed to be piecewise independent. is assumed to be a Fully Independent Training Conditional (FITC) GP approximation (Quiñonero-Candela and Rasmussen 2005) with mean and covariance function , while is assumed to be a stationary GP in local region with mean 0 and covariance function , where
and are the GP model variances of and , respectively; and are the correlation function of and , respectively. A widely applied correlation function is the Gaussian correlation: , where is the -th entry of . To sufficiently consider the nonstationarity of the function, in different local models, the covariance structure can be different. Furthremore, in the AGLGP model, and additional constraint is added: , i.e., the global model has longer lengthscale parameter than the local models. This will ensure a smoother global model for the global trend. In addition, is a normal noise independent from and .
As a latent process, only depicts a general and rough global picture of the true function, and it is assumed to be a deterministic model. Suppose and their global values are known. The global predictor at any given point can then be written as
where , is covariance matrix at with entry . We thus have the global predictor at all design points: . The residuals is then modeled by another stochastic GP model . This local model captures the local biases of the global model in each local region and the inherent stochastic noise in the system. As we focus on stochastic simulations, is the sample mean of the replications taken at each design point. Given local regions , the local predictor at any given point is given by
where and is the covariance matrix within a local region whose element . . is the local residuals in region at points .
From (3) and (4), the overall AGLGP predictor can be written as . This predictor, however, can only be obtained when the latent global evaluations is known, which is impractical since typically only the overall response observations are available. The overall AGLGP predictor and predictive variance for is obtained by integrating out as (we write and as and for short):
Denote the th design point in as . Here, is covariance matrix with element , is covariance matrix with element , , ( is a vector of all 0, and .
In order to develop the AGLGP model (5), two important steps have to be taken. First, the design space has to be divided into local regions for the local models (see the boundaries between the regions in Figure 1). Second, a smaller set of inducing points (the blue stars in Figure 1) has to be determined from the large set of design points to fit the global model. To divide the design space, Meng and Ng (2015) suggested to first divide the existing design points into groups with the
-means clustering technique and then build the separation boundaries between each pair of groups with the support vector machine (SVM) method. The regions enclosed by these boundaries are treated as local regions. Such a scheme helps to better approximate the assumption of independence of the local processes across regions as SVM tries to provide the largest separation between the grouped design points. In practice, to save the computing effort expensed for SVM (which is about
, the same as fitting a full GP model), the boundary between two clusters can also be chosen as the bisecting hyperplane of the line segment whose ends are the two clusters’ centers and this is used in our numerical test. The selection of inducing points is essentially to further divide the design points in the same local region to smaller clusters, such that the points in the same cluster are ‘similar’ and can be represented by a single inducing point. The simplest way to do this is to adopt the traditional clustering techniques, like the
-means. These techniques, however, are mostly unsupervised learning methods in that they don’t consider the response values (or labels). For the AGLGP model, the response values are available at the design points and considering these values when deciding the inducing points can better summarize the points in the same cluster. To achieve this, the original AGLGP work(Meng and Ng 2015) proposed a new method to first group the points in each local region by their responses, i.e., the points in each group have close values. Then, the points in each group are further clustered by their values. The details of the methods to partition the space and select the inducing points proposed in the original AGLGP work are provided in Appendix A. We highlight here that although these suggested methods are used here, the CGLO algorithm is not restricted to these methods. Various other techniques for clustering, like the model-based clustering scheme (Fraley and Raftery 2002, Peng et al. 2018b), partitioning, such as random division or partitioning with Mondrian process (Roy et al. 2008), and inducing points selection, such as those used in the sparse GP model approximations (Quiñonero-Candela and Rasmussen 2005), can also be applied.
Before closing this section, we highlight that although the overall AGLGP predictor (5) provides an overall prediction, it does not provide separate predictions for the global trend and the local residues. As we shall see in the following sections, it is desirable to develop separate predictions for these two, like (3) and (4), to integrate them into CGLO so that separate search processes in the global and local levels can be conducted. This motivates us to further develop a two-stage AGLGP model in Section 3.2.
3.2 Development of a Fast Two-Stage Model Fitting Procedure for the AGLGP Model
Although the AGLGP model is developed upon a global and local model structure, estimating the overall likelihood provides only an overall prediction (see (5)). This is useful for model inferences and predictions. However, to leverage the individual components to facilitate the global and local searches, separate predictions for the two different levels are required. Here, we develop a two-stage AGLGP model fitting procedure which separates the predictions and is computationally faster.
The two-stage approach will first construct the global model after integrating out only the latent variable . The predicted residuals for the local model are then obtained. The parameters of the local model are then obtained from the likelihood of the local model with the predicted residuals. The two stage predictive mean and predictive variance for can be derived as
where . It can also be shown that the two-stage predictor is an unbiased predictor given the parameters. Details of the two-stage development and estimation are described in Appendix B. In real applications, multiple evaluations are conducted at each design points. The observation and noise variance in (7) (10) are then replaced with the corresponding sample mean and variance.
Overall, this two-stage approach is computationally faster than the one-stage approach (Meng and Ng 2015) as it optimizes the global and local parameters independently (see Appendix B). The one-stage estimation approach takes into account the interaction between the global model and the local model, and there are items in the Hessian matrix that needs to be calculated. In our proposed two-stage estimation approach, only a Hessian matrix of the second-order derivative of the log-likelihood function needs to be calculated for the global model, and the local model estimation requires additional calculation of Hessian matrices. Hence, in the iterative maximization of the likelihood function, the two-stage estimation will require less than half the effort to calculate its Hessian matrix in each iteration. Therefore, in practice, the one-stage approach requires a higher-dimensional likelihood optimization problem, which is more difficult to solve and computationally more expensive. In contrast, the two-stage approach proposed here is cheaper by separating the global and local parameters estimations and can guide the search in CGLO with separate global and local models more efficiently.
4 Combined Global and Local Search for Optimization
This section introduces the CGLO algorithm built on the global and local features of the AGLGP model in each iteration. CGLO is an iterative algorithm, where each iteration is comprised of two search steps and an Allocation Step. Among the two search steps, the Global Search Step finds promising regions with the global model while the Local Search Step exploits within these promising regions to find the optimal point with the overall model. The Allocation Step then focuses on driving down noise levels to improve the estimations in both steps. Further, a switching criterion is used to decide when to switch back to the Global Search Step from the Local Search Step and advance iterations to avoid over-exploitation within each iteration. An overview of CGLO is outlined in Algorithm 1 (a flow chart of CGLO is provided in Appendix D).
Algorithm 1 (CGLO)
Step 1: (Initialization) Fit an initial AGLGP model with space-filling design strategy and use cross-validation technique to ensure it is valid.
Repeat Steps 23 until the budget is exhausted.
Step 2.1: (Global Search Step) Select a point among , a set of candidate points for the global search, with the global model searching criterion. Identify the current promising region , where .
Step 2.2: (Local Search Step) Search for the next evaluation point with a local search criterion based on the overall model fitted in , and then evaluate this point with replications. This step is repeated until the switching criterion is satisfied.
Step 3: (Allocation Step) Allocate additional evaluations to the selected design points.
Step 4: (Return Optimal Solution) Report the best point (with the optimal sample mean ).
The key notations for CGLO are listed in Table 2 (the steps where they appear are provided in the brackets). They are classified into two groups. The first group includes the ‘inputs’ determined by the users to run the algorithm, and the second group includes the ‘parameters’ that will be dynamically determined by CGLO.
|the number of local regions|
|the total budget (total number of affordable simulation replications)|
|The candidate points set for the global search (Step 2.1)|
|the parameter controlling the penalty in the gEI function (Step 2.1)|
|the number of replications allocated for a newly selected point (Step 2.2)|
|the budget consumed to improve local estimate in iteration (Step 3)|
|the best point in the Global Search Step (Step 2.1)|
|the number of neighbor design points of (Step 2.1)|
|the most promising region in iteration (Step 2.1)|
|the number of design points selected in iteration (Step 2.2)|
|the total number of design points selected in the first iterations (Step 3)|
|the total number of design points in (Step 3)|
|the total number of replications received by in the first iterations (Step 3)|
|the number of replications added by the OCBA technique in the Allocation Step (Step 3)|
|the sample mean of (Step 3)|
|the point with the lowest sample mean in the current promising region (Step 3)|
|the sample mean of (Step 3)|
|The budget consumed to improve global estimate in iteration (Step 3)|
We first discuss how to decide two key inputs, and . The remaining inputs will be discussed later when we introduce Steps 2 and 3 in detail in Sections 4.1 to 4.3.
and local regions: It is worthwhile to note that in Algorithm 1, the local regions are decided based on the initial design points and are kept fixed over the iterations. This removes further computations that would be required in the redivision of regions after each iteration. Moreover, in the fitting of a good reliable GP model, a reasonable number of design points is required, and thus, continued division of the regions can lead to smaller and smaller numbers of points in some local regions, which results in less reliable local models. This will be more severe if the budget is constrained and the total number of affordable design points is limited, which is common in situations where BO has been applied and proved to be competitive. We, therefore, consider fixed local regions in this work to solve our motivating navigational safety problem where the decision is required within a very short period of time. Nonetheless, when the budget is large, CGLO can be modified to update the local regions after Step 3 in each iteration. We provide a modified approach to accommodate the updating in Appendix C. For the fixed local regions case, we suggest that the number of local regions should be selected such that the average number of design points in each local region is sufficient to fit a GP model. A common recommendation is to fit a GP model with design points in a -dimensional space (Jones et al. 1998, Jalali et al. 2017). In sequential design problems, the number of initial fitting points can be smaller than as more design points will be selected in the following iterations to improve the model fitting. For example, Ankenman et al. (2010) used initial design points in a one-dimensional queueing system. In CGLO, we use initial design points in each local region on average. Therefore, can be decided as follows. If the total number of initial design points () is fixed, set ; otherwise, the recommendations to adopt in Meng and Ng (2015) can be applied. When increases, can be selected to increase with , but it should be cautioned that this will increase the budget expensed for the initial design and this should be balanced with the remaining budget for the search.
: This is a common parameter needed to be set for stochastic BO algorithms. Following the recommendation of Pedrielli et al. (2020), this can be decided when performing the cross-validation test with a small value and then increase it step by step until the cross-validation test is passed.
4.1 Global Search Step
This step is to quickly locate a promising region to concentrate the local search. The ‘promising regions’ are regions that have lower global trends or have not been sufficiently searched (regions with the potential to find better solutions than those searched so far). Some popular acquisition functions are designed to select points in such regions by assigning larger acquisition function values to points with lower prediction mean or larger predictive variance (less explored). However, as the global model is based on the inducing points, the predictive variance (8) essentially reflects the spread of the inducing points instead of the actual design points. In particular, fewer inducing points from a region may not necessarily result from fewer design points. Instead, many design points with similar response values may be clustered together, and these points are summarized by a single inducing point. For such regions, there is little need to further explore. Hence, to balance the global search, we propose a global EI (gEI) criterion that incorporates the traditional EI function computed with the global model and a penalty term on the density of the design points nearby:
where is the lowest global prediction at the inducing points, is the number of neighbor design points of and is a user-defined steepness parameter controlling how fast the penalty term decreases. In (11), is normally distributed with variance and a mean of the following structure: if ; if ; if . Here, and are lower and upper bounds set for the predictive mean to avoid severe underestimation or overestimation caused by a limited number of design points or replications. These bounds do not need to be tight. If the response has natural bounds, and can be set as these values; Otherwise, they can be set as some reasonable constants (such a structure is similarly adopted in Sun et al. (2014)). The first term of the product in (11) is the EI function representing how much we expect a new point to be better than the current best solution of the global model. The second term is a penalty designed diminishing with the number of design points around . The neighbors of are defined as , where is the minimum distance between two inducing points used to define the neighborhood size, and is the number of design points lying in .
This step optimizes over a finite global candidate points set , in contrast to the whole continuous domain , which helps accelerate the optimization for this criterion. The rationale behind this is that in the Global Search Step, our purpose is to identify the best region and this does not require the accuracy of a search for a single best point. Therefore, we apply a finite candidate points set here, chosen with space-filling strategies from (to ensure that all local regions can potentially be selected, should cover every region). From a practical viewpoint, this saves the computing budget for a more detailed search in the Local Search Step where the exact design points are selected. This idea of finite candidate points is also applied in other global optimization algorithms for searching criteria to efficiently manage computing budgets (Regis and Shoemaker 2007).
4.2 Local Search Step
The local search is conducted in region , where , the best point in the global search step, belongs. This step searches more intensely within this promising region for a better solution. In order to achieve this, we apply the criterion (Quan et al. 2013) to sequentially select the next design point,
where is the prediction at the best sampled points in , and is normally distributed with variance given by spatial prediction uncertainty and a mean with the following structure: , if ; if ; if . Here the overall AGLGP model is used as we require a more comprehensive and accurate search within this region for a better minimum point. We note that in , the random noise is ignored in the predictive variance . The rationale behind this is to make and values for already selected design points become zero such that they will not be reselected (as the Allocation Step will by design reduce the stochastic noise). This enables the search to focus on new points with low predicted responses and new points in the less explored areas of the local region. The point in that maximizes the function will be simulated with replications. The Local Search Step is sequentially repeated until the switching criterion is satisfied.
Before introducing the switching criterion, we note that optimizing mEI (or other acquisition functions similarly) over continuous space requires an additional non-trivial optimization, which is a common issue for BO. The approaches to implement this include: (i). Use global optimization algorithms such as DIRECT (Jones et al. 1993)
, and heuristics like GA. (ii). In each step, randomly choose a discretization from the continuous space and choose the best one from this discretization (Jalali, 2017). In general, the first approach will require longer computational time, but has high accuracy. The second approach is faster and more economical from a practical viewpoint. In our experiments, we adopt the second approach and each time we optimize the mEI function in a local region, we use a fine discretization chosen with the Latin Hypercube strategy and maximize mEI on this discretization.
In the local search step, the local model and are updated with the new observations sequentially. As we do not want to overly search a local region with the risk of getting trapped by local optima (i.e. spend excessive budget on negligible local improvements), we want CGLO to switch from local search to global search to locate potentially better regions when a local region is reasonably searched. Here, we adopt a ‘quality-based’ switching criterion, which is commonly used for algorithm involving transitions between global and local phases or restart schemes (Regis and Shoemaker 2007, Xu et al. 2010, Zabinsky et al. 2010).
As the best local region is decided by the value, a straightforward indicator to switch back to the global step to re-think the best region is when the values at points in the current best region become smaller than a threshold (indicating greater potentials in other regions). Recall that is defined to evaluate the global expected improvement of the region, which in other words evaluates the ‘quality’ of solution improvements when searching in that region. Here, we propose a quality-based criterion to terminate the local step when is smaller than some threshold :
Recall that the function (11) has two components: the expected improvement and the density penalty. A small value for either one can result in a small value and the switching criterion may be satisfied. A small expected improvement (first term in (11)) is likely to result from a low target value in (11), i.e., the current best has been largely improved in this local region. This indicates that a reasonably good or near-local-optimal solution has already been found. In addition, if the density penalty (second term in (11)) is small, many points are observed in a small area, indicating the search is trapped in a small region, which is potentially a local optimal region. Hence, we would like to design the switching criterion such that when a reasonable solution or local optimal is found within a local region, opportunities are given to search in other potential regions with a switch back to the Global Search Step.
In practice, we choose the threshold as , which is the maximal value of global candidate points from other regions. We argue that this is a reasonable choice of to meet our requirement for the quality-based criterion. At the beginning of the Local Search Step, has the best value among points in . As we search more points in the local region and better and better solutions are found, decreases and a reasonably good solution can be found as explained above. When becomes smaller than this , this indicates that another local region with better ‘quality’ solutions is available. This prompts CGLO to switch back to the global step. We also note that even if this ‘reasonably good’ solution found is not a local optimal or there exist multiple local optima in the current region and only one is found before switching back, CGLO by design can return back to this region for an even more detailed search in the future. This helps to balance the local search expense when the budget is finite and constrained, and further ensures global convergence, in contrast to some one-shot algorithms with local convergence guarantee, which never revisit or consider other local regions that have not been selected yet (Xu et al. 2010).
The CGLO algorithm can also be designed to incorporate other types of switching criteria such as an ‘effort-based’ criterion where a limit on the number of new design points for each local step is set. Once the effort is exhausted, CGLO switches back to the global step. In practice, this ‘effort-based’ criterion may further prevent over-exploitation and can be used in combination with the ‘quality-based’ criterion, such that CGLO switches back to the global step when either of these two criteria is met.
4.3 Allocation Step
The Allocation Step is dedicated to distributing additional simulations among sampled points to improve the accuracy of the noisy simulation evaluations. Specifically, it distributes for the following two purposes. First, to improve the global optimal estimate and ensure algorithm convergence, a minimum number of replications are required for each design point in every iteration. Second, to improve the local model fit and the optimal estimate, the noise of the observations should be drived down with more budget.
Denote as the total number of replications received by design point in the first iterations. To achieve the first purpose, we state the following assumption on the allocation rule:
Suppose the noise variance among the input space is bounded, i.e., . There exists a sequence such that , as and that , . It follows that for every of the design points in iteration .
This assumption essentially requires sufficient budget to be allocated to each design point (specifically, at least to each design point) in the first iterations. This rule helps to ensure global convergence of the algorithm (as will be shown in Section 5). To satisfy this assumption, at the beginning of the Allocation step, we check for every design point and assign additional replications to the points where . We refer to the budget consumed here for the first purpose in the Allocation Step as . As the first purpose is to improve global estimates and convergence, the focus is taken on the design points in the entire design space, i.e., every design point in should satisfy the condition in Assumption 1.
To address the second purpose, i.e., better fit of the local models, additional replications are distributed to design points in , as it is the most promising region in the current iteration. The maximum number of replications for this purpose is denoted as , which is a user-defined input. OCBA provides an efficient way of allocating budget to identify the sampled point with the best response by balancing the response and noise level observed for each point. Denote the sample mean and variance at as and , respectively. OCBA introduces the following allocation rule to maximize the Approximate Probability of Correct Selection asymptotically (Chen et al. 2000):
where is the best sample mean in , is the number of simulations allocated to , and is the number of design points in . As mentioned before, is user-defined and since we focus here on illustrating the application of AGLGP model in optimization, we assign constant (the same value as ) over iterations for simplicity. More sophisticated adaptive budget allocation strategies can be applied to improve the algorithm (Chen et al. 2006, Peng et al. 2018a, Pedrielli et al. 2020, Peng et al. 2018b).
In summary, in the Allocation Step, a check is first made on all existing design points to ensure at least replications have been allocated to each point. If not, add replications to meet . Then, for design points in the current local region, allocate replications according to the OCBA allocation rule.
5 Convergence of CGLO algorithm
We next state the convergence results for CGLO. The proofs for all lemmas and the theorem are provided in the Appendices E, F and G. The proof is based on the following general assumptions. (a) The underlying objective function is bounded.
(b) The hyperparameters (, ) for the AGLGP model are bounded away from zero.
(c) The noise variances are known.
The EGO algorithm (Jones et al. 1998) employs Assumptions 2(a) and 2(b) to ensure the bounded objective and the efficiency of the GP model. Known noises are assumed in Assumption 2(c) for analytical tractability. The convergence is only analyzed empirically (Kleijnen et al. 2012, Pedrielli et al. 2020) if is unknown, to the best of our knowledge.
Recall that in the setting of CGLO, we propose predetermining or dynamically changing the number of local regions (see Appendix C). In both cases, the number of regions will be finite. With this in mind, we show CGLO algorithm convergence in three steps. We first show the design points selected are everywhere dense asymptotically. This is achieved by showing dense points in each local region (in Lemma 1) and showing each region can be selected as the promising region infinitely often (in Lemma 2). After that, with the guarantee in Assumption 1, we show not only the design points are dense, but the number of replications in each point is infinite. The convergence will then follow (in Theorem 1).
Under Assumption 2, the design points in local region are dense if this region is given infinitely large budget.
Under Assumption 2, local region will be selected infinitely often as the promising region.
Under Assumptions 1 and 2, almost surely as , where is the current best value and is the true global optimal solution.
6 Numerical Results
In this section, we illustrate the proposed CGLO algorithm and compare it with other metamodel optimization algorithms in terms of both efficiency and computational time. For all the examples in this section, we apply . We provide the selected algorithm parameters in Appendix H of the online supplementary material.
6.1 One-dimensional test function (illustration of algorithm)
We applied the proposed algorithm to the illustrating example in Figure 1 with the noise variance to further illustrate how the algorithm works. We initialize 12 design points with a Latin Hypercube design and thus set (see details about setting in Section 4), as seen in Figure 2. As shown, the global optimal solution is -10.1316 at and another sub-optimal solution is -9.5799 at . Hence, Regions 2 and 3 are more promising regions with lower response that should be focused on. Here, we set . Two iterations of the algorithm were executed here for illustration.
Figure 2 shows the estimated global model with the initial design points and we can see that it correctly identifies the overall global trend. As the global observations of inducing points are not observable, we only indicate the location of inducing points in Figure 2. As shown, there are 3 local regions, each with two, one and two inducing points respectively. Based on the global model and inducing points, the Global Search Step selects Region 2 as the promising region (the location of is shown as the red in Figure 2).
In the first round of local search, the point with the highest is evaluated with replications, and the clusters and the inducing points are re-generated in Region 2. With this additional point, however, the value at does not significantly decrease and the switching criterion is not met. Hence, the local search in Region 2 continues. After 4 new design points are added in Region 2, a few local optimal or near-optimal points are found (see Figure 3) and the switching criterion is satisfied. After that, CGLO stops the Local Search Step and executes the Allocation Step where 20 additional replications are allocated to the design points and then returns to the global search.
After the first iteration, a total of 16 observations are obtained as seen in Figure 3. The global model is updated, and as seen, the global model is now able to better capture the overall trend in Region 2, leaving Regions 1 and 3 less explored. As Region 3 has a lower trend between [0.9,1], the global search selects Region 3 as the promising region in the next iteration. CGLO then follows the same local search as iteration 1 and adds 3 more design points in Region 3 (Figure 4).
In the subsequent iterations, the search swings mainly between region 2 and region 3 and stops at region 3 with an estimated relative error (based on the estimated optimum and true optimum) of less than 1% after 9 iterations. In this example, we see that the algorithm is able to identify promising regions efficiently with the global search and it is also able to search more intensely for the optimal solution in that region by the local search.
6.2 Comparative studies with other optimization algorithms
In this section, we compare the CGLO algorithm with other BO algorithms. The test function is from Sun et al. (2014)
The test function is multimodal. The global best is and the second best is . We can observe that the difference between the global best and the second best is quite small. To further increase the difficulty, we add a normal noise to this function: the mean is 0 and the variance is .
To assess CGLO’s effectiveness in addressing the challenges mentioned in Section 1, we compare it with three alternative BO algorithms, TSSO (Quan et al. 2013)
, Minimal Quantile criteria (MQ)(Jalali et al. 2017), and Expected Quantile Improvement (EQI) (Picheny et al. 2013), which are current BO methods for simulations with heteroscedastic noises. Among them, TSSO uses the OCBA allocation scheme and we select the online allocation scheme for EQI. We also note that knowledge Gradient (KG) (Frazier et al. 2009) is another competitive GP-based algorithm. However, KG requires much longer decision time to select the design points (about 50 to 100 times that of TSSO and CGLO) (Jalali et al. 2017). In this work, we are motivated by a real-time decision problem where the decision time is limited and in this comparative study, we use a relatively short fixed wall clock time of 1400 seconds for the comparison. Hence, KG is not included here. In each comparison, all algorithms start off with the same 40 LHS design points (and thus for CGLO) with 20 replications at each design point. We set . Figure 5 shows the averaged optimal value estimated by each algorithm over 30 macroreplications.
From Figure 5, we observe that CGLO converges much faster to the optimal value than TSSO, MQ, and EQI.
We do however note that although CGLO numerically converges faster than TSSO and EQI, both TSSO and EQI both are able to catch up when a large amount of CPU time is expensed. MQ, on the other hand, does not converge and this is likely because it is stuck in one of the many local optima of this problem.
To further compare these algorithms, Table 3 presents their averaged performances (over 30 macroreplications) with different simulation replications budget (5,000 and 10,000 replications, respectively). Specifically, we consider the averaged distance between the observed optimal solution and the true optimal solution and the average distance to the true optimal value for each of the algorithms. As seen, CGLO performs significantly better than TSSO, MQ, and EQI at in terms of optimal value and optimal solutions when the budget is limited to only 5000 replications. With these limited replications, we see from the average of that TSSO, MQ, and EQI can only find solutions in suboptimal regions while CGLO can provide solutions near the global optimal. This shows that the global and local natures of CGLO help it jump out of suboptimal regions to the global optimal region much faster than the traditional BO methods. When the simulation replication budget goes up to 10,000, there is no significant difference amongst CGLO, EQI, and TSSO. From this example, we see that the CGLO algorithm, even with the AGLGP model approximation, is able to achieve the level of accuracy similar to the TSSO and EQI in terms of optimal value and distance to the optimal solution. MQ, however, is still unable to locate the optimum, even with 10000 replications.
|No. of replications||average||std||average||std||average||std||average||std|
stands for better results in t-test at significance level
stands for better results in t-test at significance level
6.3 A Navigational Safety Problem
We next test our CGLO algorithm on a maritime safety problem. We apply it to the agent-based simulator developed for the Safe Sea Traffic Assistant () in (Pedrielli et al. 2016), which looks 10 minutes ahead with an agent-based simulation model (ABM). For an own vessel travelling through a heavy traffic region, this ABM tries to predict any potential conflict with the other vessels based on the current and simulated environment. If a potential conflict of high risk is detected on a pre-specified trajectory of an own vessel from A to B (see Figure 6), an optimizer is then called to find an alternative trajectory that minimizes the probability of conflict within this period.
Although looks at several objectives (including the probability of conflict, deviation from original trajectory) while searching for an optimal alternative route, we focus on the key safety objective of the probability of conflict in this example. As this system focuses on heavy traffic regions, this response can be highly non stationary (as a small turn can result in a very different conflict environment). Figure 7(a) shows this response on one of the trajectory parameters for a vessel with 90 vessels in its proximity. The variance of this response also differs quite a bit in the negative and positive polar angles of the turn.
In this system, an alternative trajectory is then defined by its deviation from the pre-specified trajectory. Specifically, the alternative trajectory is defined by a three-point trajectory of , which is characterized by the deviation angles and the traveling speeds on each leg for and for , see Figure 6. For practical reasons, the range for and is between 1 and 15 [knots]. With positive and negative values on the angles indicating the deviations to the left or right from the original trajectory, are bounded between to and constrained on the same polarity.
Due to the stochastic nature in the movement of surrounding vessels, to evaluate the probability of conflict of a candidate alternative trajectory, multiple replications of the simulator should be run to obtain the estimate. With the high nonstationarity of this response, the AGLGP model can be an appropriate metamodel for the response, and the CGLO algorithm an effective one to locate an optimal alternative trajectory.
As this objective ranges between [0,1], the following logistic transformation is applied, mapping the probability to the real line:
With the transformation, the non stationarity of the response is still high (see Figure 7(b)).
As the is a real-time system, determining an alternative trajectory needs to be fast when a conflict is detected. In this system, a 10-minute look-ahead period is applied, and a maximum of 5 minutes is allowed to obtain an optimal solution. Fortunately, the simulator runs relatively fast, taking 0.006 seconds for one evaluation. In this example, relatively many evaluations can be afforded, but may still be insufficient to comprehensively search the entire four-dimensional space. We test CGLO with Random Search (RS) and TSSO in this study. Here, MQ is excluded due to its inferior performance (as seen in Section 6.2) and EQI is excluded due to its relatively long decision time. As observed both in our numerical runs and also in Jalali et al. (2017), EQI takes about 3 times longer than TSSO to arrive at a decision and 6 times longer than CGLO, resulting in much fewer number of points that can simulated during the short time period given to make the decision.
We start the system with a crowded scenario based on historical data gathered from the Strait of Singapore. The system is run until a potential conflict is detected. At this point, we run CGLO, RS, and TSSO to obtain the best optimal alternative trajectory. The time for the search is set at 5 minutes before termination, at which point, the final best solution is determined. To evaluate the solution obtained from each approach, the ‘optimal’ solution was obtained from an extensive enumerative evaluation, which discretizes the solution space into grids in both positive and negative polarity for and , and the speeds and in grids and gives an estimation of minimum probability of at . The ‘optimal’ solution was then compared with the solutions from the three approaches. Table 4 presents numbers of design points and the distances between the ‘optimal’ solution and the observed best solution by the each approach .
As the simulation runs pretty fast, the random search can search at 2000 design points and identifies the optimal area that is close to the true optimal location. Although none of the approaches managed to locate the optimal solution in the limited time, as seen from Table 4, both RS and CGLO perform quite well. CGLO is able to locate a safer solution (with a 50% lower probability of conflict) than RS.
Even though the TSSO works efficiently for expensive simulation optimization, a large part of the time is spent on model estimation for this problem, and only 7,325 replications are simulated on 256 design points. With limited design points, the TSSO only identifies a local optimal solution at the positive polarity. With the fast approximated AGLGP model and the combined search structure, the CGLO can approximately select 16,264 replications on 611 design points and find the optimal solution better with a more extensive search.
A more extensive comparison is conducted from an additional 30 detected conflicts, and the averages obtained across the 30 conflict scenarios are shown in Table 5.
Here we see more conclusively that CGLO is able to find safer solutions with the global information of the response surface. TSSO, on the other hand, fails to work as efficiently for this problem and ends up with some local optima, similarly to the results in Section 6.2. These results thus show that there is a large potential for CGLO to be incorporated into the system to provide much faster evaluations than TSSO (and hence, providing a more extensive search), and safer solutions (in terms of probability of conflict) than RS.
This paper proposes CGLO, a combined global and local search approach. The scheme first identifies promising regions with a global step and then focuses more deeply in these promising regions with the local step. We propose new EI-based search criteria to more effectively suit the global and Local Search Steps’ purposes, a built-in switching mechanism to switch between the steps to avoid overexploitation, and derive new allocation strategies to improve the metamodel fit and optimal solution estimation of the algorithm. Furthermore, we ensure the global convergence of the algorithm and study its performance on a test function and a practical example in maritime safety. With this proposed algorithm, we address some challenges faced by traditional BO approaches, especially when the baseline responses are highly multimodal. Firstly, the fast global approximation together with the local models of the proposed two-stage AGLGP is efficient and allows flexible model fit in different regions. Secondly, the global and local nature of the algorithm helps CGLO jump out of suboptimal regions towards global optimal faster. These improvements are very important for real-time decision processes, like in the navigational safety problem, where the response is highly multi-modal and response time is crucial.
The current work can be extended in several directions. First, in CGLO, we currently apply a constant budget allocation scheme in the Allocation Step to distribute replications to points in the local regions. The algorithm can be improved with adaptive schemes that dynamically decides the allocation budget for better evaluation in each iteration. Second, in the Local Search Step, some alternative locally convergent approaches, such as pattern search, can be explored. As these methods are designed to perform well and fast in small regions, they can be more efficient than a GP metamodel approach in small local regions. Third, the local search in CGLO can be parallelized and adapted in the different local regions to further improve its efficiency.
Wang is the corresponding author. A preliminary version of this paper was published in the Proceedings of the 2016 Winter Simulation Conference. Ng’s work was partially supported by Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2 grant MOE2015-T2-2-148. Wang also acknowledges the support from Shenzhen Humanities and Social Sciences Key Research Bases, at Southern University of Science and Technology, College of Business (Shenzhen, China).
- Stochastic kriging for simulation metamodeling. Operations research 58 (2), pp. 371–382. Cited by: §1, §4.
- Composite gaussian process models for emulating expensive functions. The Annals of Applied Statistics 6 (4), pp. 1838–1860. Cited by: §1.
- Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research 12 (Oct), pp. 2879–2904. Cited by: §1.
- Efficient dynamic simulation allocation in ordinal optimization. IEEE Transactions on Automatic Control 51 (12), pp. 2005–2009. Cited by: §4.3.
- Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems 10 (3), pp. 251–270. Cited by: §4.3.
- Model-based clustering, discriminant analysis, and density estimation. Journal of the American statistical Association 97 (458), pp. 611–631. Cited by: §3.1.
- A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811. Cited by: §1.
- The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing 21 (4), pp. 599–613. Cited by: §1, §6.2.
- Entropy search for information-efficient global optimization. Journal of Machine Learning Research 13 (Jun), pp. 1809–1837. Cited by: §1.
- Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of global optimization 34 (3), pp. 441–466. Cited by: §1.
- Comparison of kriging-based algorithms for simulation optimization with heterogeneous noise. European Journal of Operational Research 261 (1), pp. 279–301. Cited by: §4, §6.2, §6.3.
- Lipschitzian optimization without the lipschitz constant. Journal of optimization Theory and Applications 79 (1), pp. 157–181. Cited by: §4.2.
- Efficient global optimization of expensive black-box functions. Journal of Global optimization 13 (4), pp. 455–492. Cited by: §1, §4, §5.
- Expected improvement in efficient global optimization through bootstrapped kriging. Journal of global optimization 54 (1), pp. 59–73. Cited by: §5.
- An additive global and local gaussian process model for large data sets. In Winter Simulation Conference (WSC), 2015, pp. 505–516. Cited by: §1, §3.1, §3.2, §3, §4.
- An extended two-stage sequential optimization approach: properties and performance. European Journal of Operational Research. Cited by: §1, §4.3, §4, §5.
- A real time simulation optimization framework for vessel collision avoidance and the case of singapore strait. IEEE Intelligent Transportation Systems Trasactions, to appear. Cited by: §6.3.
- Ranking and selection as stochastic control. IEEE Transactions on Automatic Control 63 (8), pp. 2359–2373. Cited by: §4.3.
- Efficient simulation sampling allocation using multi-fidelity models. IEEE Transactions on Automatic Control. Cited by: §3.1, §4.3.
- Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55 (1), pp. 2–13. Cited by: §6.2.
- Multiobjective optimization using gaussian process emulators via stepwise uncertainty reduction. Statistics and Computing 25 (6), pp. 1265–1280. Cited by: §1.
- Simulation optimization via kriging: a sequential search using expected improvement with computing budget constraints. IIE Transactions 45 (7), pp. 763–780. Cited by: §1, §4.2, §6.2.
- A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research 6 (Dec), pp. 1939–1959. Cited by: §1, §1, §3.1, §3.1.
- A computationally stable approach to gaussian process interpolation of deterministic computer simulation data. Technometrics 53 (4), pp. 366–378. Cited by: §1.
A stochastic radial basis function method for the global optimization of expensive functions. INFORMS Journal on Computing 19 (4), pp. 497–509. Cited by: §4.1, §4.2.
- The mondrian process.. In NIPS, pp. 1377–1384. Cited by: §3.1.
- Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1, §1.
- Local and global sparse gaussian process approximations. In Artificial Intelligence and Statistics, pp. 524–531. Cited by: §1.
- Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
- Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: §1.
- Balancing Exploitation and Exploration in Discrete Optimization via Simulation Through a Gaussian Process-Based Search. Operations Research 62 (6), pp. 1416–1438. External Links: Cited by: §4.1, §6.2.
- Industrial strength compass: a comprehensive algorithm and software for optimization via simulation. ACM Transactions on Modeling and Computer Simulation (TOMACS) 20 (1), pp. 3. Cited by: §4.2, §4.2.
- Stopping and restarting strategy for stochastic sequential search in global optimization. Journal of Global Optimization 46 (2), pp. 273–286. Cited by: §4.2.