In multi-objective design optimization, the objective function evaluations are generally computationally costly, mainly due to the long convergence times of simulation models. A simple and common remedy to this problem is to use a statistical model learned from previous evaluations as the fitness function, instead of the ’true’ objective function. This method is also known as Bayesian Global Optimization (BGO) Mockus1978 . In BGO, a Gaussian Process (GP) model is used as a statistical model. In each iteration, the algorithm evaluates a new solution and updates the Gaussian Process model. A new solution is chosen by the score of an infill criterion, given a statistical model. For multi-objective problems, the family of these algorithms is called Multi-objective Bayesian global optimization (MOBGO). Compared to evolutionary multi-objective optimization algorithms (EMOAs), MOBGO requires only a small budget of function evaluations for achieving a similar result with respect to hypervolume indicator, and it has already been used in real world applications to solve expensive evaluation problems kyang2015cec . According to the authors’ knowledge, BGO was used for the first time in the context of airfoil optimization in laniewski2010development , and then applied in the field of biogas plant controllers gaida2014dynamic , detection in water quality management zaefferer2013case
, machine learning algorithm configurationkoch2015efficientnoise , and structural design optimization shimoyama2013updating .
In the context of Bayesian global optimization, an infill or pre-selection criterion is used to evaluate how promising a new point is. In single-objective optimization, the Expected Improvement (EI) is widely used as the infill criterion, and it was first introduced by Mockus et al. Mockus1978
in 1978. The EI exploits both the Kriging prediction and the variance in order to give a quantitative measure of a solution’s improvement. Later, the EI became more popular due to the work of Jones et aljones1998efficient . In MOBGO, a commonly used criterion is Expected Hypervolume Improvement (EHVI), which is a straightforward generalization of the EI and was proposed by Emmerich emmerich2005single in 2005. Compared with other criteria, EHVI leads to an excellent convergence to – and coverage of – the true Pareto front couckuyt2014fast ; kyang2016cec . Nevertheless, the calculation of EHVI itself so far has been time consuming shimoyama2013kriging ; zaefferer2013case ; wagner2010expected ; koch2015efficientnoise , even in the 2-D case222In this paper, 2-D and 3-D represent two and three objective functions, respectively.. Moreover, EHVI has to be computed many times by an optimizer to search for a promising solution in every iteration. For these reasons, a fast algorithm for computing EHVI is needed.
The first method suggested for EHVI calculation was Monte Carlo integration and it was first proposed by Emmerich inemmerich2005single ; emmerich2006single . This method is simple and straightforward. However, the accuracy of EHVI highly depends on the number of the iterations. The first exact EHVI calculation algorithm in the 2-D case was derived in emmerich2011hypervolume , with time complexity . Here, is the number of non-dominated points in the archive. The EHVI calculation algorithm in emmerich2011hypervolume partitions an objective space into boxes and then calculates the EHVI by summing all the EHVI values of each box. Couckuyt et al. couckuyt2014fast introduced an exact EHVI calculation algorithm (CDD13) for by representing a non-dominated space with three types of boxes, where represents the number of objective functions. The method in couckuyt2014fast was also practically much faster than those discussed in emmerich2011hypervolume , though a detailed complexity analysis was missing. Hupkens et al. hupkens2015faster reduced the time complexity to and in the 2-D and 3-D cases, respectively. The algorithms in hupkens2015faster improved the algorithms in emmerich2011hypervolume in two ways: 1) only summing the EHVI values of each box in a non-dominated space; 2) reusing some intermediate integrations during the EHVI calculation. The algorithms in hupkens2015faster further improved the practical efficiency of EHVI on test data in comparison to couckuyt2014fast . Recently, Emmerich et al. Michael2016book proposed an asymptotically optimal algorithm in the bi-objective case with time complexity . More recently, Yang et al. proposed an asymptotically optimal algorithm in the 3-D case with time complexity yang2017computing . The algorithm, KMAC333KMAC stands for the authors’ given names of the EHVI exact calculation algorithm. in yang2017computing , partitions a non-dominated space by slices linearly and re-derives the EHVI calculation formulas. However, a generalization of this technique to more than three dimensions, as well as the empirical testing of MOBGO algorithms on benchmark optimization problems, are still missing so far.
This paper mainly contributes to extending the state-of-the-art EHVI calculation methods into the higher dimensional case, as well as adding benchmark comparisons on common multi-objective optimization benchmarks. The paper is structured as follows: Section 2 introduces the nomenclature, Kriging, and the framework of multi-objective Bayesian optimization; Section 3 provides some fundamental definitions used in this paper; Section 4 describes how to partition an integration space into (hyper)boxes efficiently, and how to calculate EHVI based on this partitioning method; Section 5 shows experimental results of speed comparison and MOBGO based algorithms’ performance on 10 well-known scientific benchmarks; Section 6 draws the main conclusions and discusses some potential topics for further research.
2 Multi-objective Bayesian Global Optimization
A multi-objective optimization (MOO) problem is an optimization problem that involves multiple objective functions and it can be formulated as:
where is the number of objective functions,
are the objective functions, and a decision vectoris in an -dimensional space.
The following table summarizes the notations used in this paper.
|Dimension of the search space|
|Dimension of the objective space|
|Mean values of predictive distribution|
|Standard deviations of predictive distribution|
|A Pareto-front approximation set|
|,||Number of the non-dominated points in|
|The vectors in , where|
|Integration slices in a -dimensional space|
|Number of integration slice in -dimensional case|
|Lower bounds of integration slices|
|Upper bounds of integration slices|
|A target solution in a search space|
|A promising solution in a search space|
|The Kriging model for the -th objective function|
|Counter/Number of function evaluations|
|The number of initilized solutions|
|Training dataset for the Kriging models|
|The Lebesgue measure on|
The multivariate independent normal distribution
|The integration slices in an i-dimensional case|
Kriging is a statistical interpolation method. Kriging is a Gaussian process based multivariate regression method. Its evaluation cost is typically low compared to simulator based evaluations in design optimizationli2008metamodel . Kriging is a popular surrogate model to approximate noise-free data in computer experiments. Kriging models are fitted from previously evaluated points.
Given a set of decision vectors in the -dimensional search space, and associated function values , Kriging assumes to be a realization of a random process of the following form chugh2017handling ; jones1998efficient :
is the estimated mean value over all given sampled points, andis a realization of a Gaussian process with zero mean and variance . The regression part approximates globally the function and the Gaussian process
takes local variations into account. Moreover, as opposed to other regression methods, such as support vector machine (SVM), Kriging/GP also provides an uncertainty qualification of a prediction. The correlation between the deviations at two decision vectors (and ) is defined as:
Here is the correlation function, which decreases with the distance between two points. It is common practice to use a Gaussian correlation function (also known as a squared exponential kernel):
where are parameters of the correlation model. They can be interpreted as measuring the importance of the variables. The optimal in the Kriging models are usually optimized by a continuous optimization algorithm. In this paper, the optimal is optimized by the simplex search method of Lagarias et al. () lagarias1998fminsearch , with the parameter of max function evaluations equal to 1000. The covariance matrix can then be expressed by the correlation function:
is assumed to be an unknown constant, the unbiased prediction is called ordinary Kriging (OK). In OK, the Kriging model determines the hyperparametersby maximizing the likelihood on the observed dataset. The expression of the likelihood function is:
The maximum likelihood estimates of the mean and the variance are given by:
Then the predictor of the mean and the variance at the point can be derived. They are shown in jones1998efficient :
2.3 Structure of MOBGO
MOBGO assumes that objective functions are mutually independent in an objective space. In MOBGO, the Kriging method or Gaussian processes approximates Kriging models , for each objective function, from the existing evaluated data . Each objective function at a given point is approximated by a one-dimensional normal distribution, with a mean and a standard deviation . Then, MOBGO can predict the multivariate outputs by means of an independent joint normal distribution with parameters , , and , , at the target point .
These predictive means and standard deviations can be used to calculate infill criteria. An infill criterion quantitatively measures how promising the new point is when compared to a current Pareto-front approximation set. With the assistance of a single-objective optimization algorithm, a promising solution can be found by selecting the highest/lowest score of the infill criterion. Then, the promising solution is evaluated, and both the dataset and the Pareto-front approximation set are updated. The basic structure of the MOBGO algorithm is shown in Algorithm 1. It mainly contains three parts: initialization of a sampling dataset, searching for an optimal solution and updating the Kriging models, and returning the Pareto-front approximation set .
First, a dataset is initialized and a Pareto-front approximation set is computed, as shown in Algorithm 1 from Step 1 to Step 5. The initialization of contains the generation of the decision vectors using Latin Hypercube Sampling method (LHS) lhs1979 (Step 1), calculation of the corresponding objective values (Step 2) and storage of this information in dataset (Step 3). This dataset will be utilized to build the Kriging models in the second part.
The second part of MOBGO is the main loop, as shown in Algorithm 1 from Step 6 to Step 12. This main loop starts with training Kriging models based on data set (Step 7). Note that contains independent models for each objective function, and these models will be used as temporary objective functions instead of ‘true’ objective functions in Step 8. Then, an optimizer finds the promising solution by maximizing or minimizing an infill criterion (Step 8). Here, an infill criterion is calculated by its corresponding calculation formula, whose inputs include Kriging models , the current Pareto-front approximation set , a target decision vector , etc.
Theoretically, any single-objective optimization algorithm can be utilized as an optimizer to search for the promising solution . In this paper, the Bi-Population CMA-ES is chosen for its favorable theoretical properties hansen2009benchmarking . Step 9 and Step 10 will update the dataset by adding into and update the Pareto-front approximation set . The main loop from Step 6 to Step 12 will continue until meets the termination criterion . The last part of MOBGO returns Pareto-front approximation set .
3 Definitions444For the convenience of visualization and consistency, this paper only considers maximization problems. Mimimization problems can always be re-written as maximization problems by multiplying the correponding objective functions by .
Pareto dominance, or briefly dominance, is a fundamental concept in multi-objective optimization, and provides an ordering relation on the set of potential solutions. The dominance is defined as follows:
Definition 1 (Dominance CoelloCoello2011 )
Given two decision vectors and their corresponding objective values , in a maximization problem, it is said that dominates , being represented by , iff and .
From the perspectives of searching and optimization, non-dominated points are of greater interest. The concept of non-dominance is defined as:
Definition 2 (Non-dominance Michael2016book )
Given a decision vector set , and the image of the vector set , the non-dominated subset of is defined as:
A vector is called a non-dominated point of .
Definition 3 (Dominated Subspace of a Set)
Let be a subset of . The dominated subspace of in , notation , is then defined as:
Definition 4 (Non-Dominated Space of a Set)
Let be a subset of and let be such that . The non-dominated space of with respect to , denoted as , is then defined as:
Note that the notion of dominated space as well as the notion of non-dominated space of a set can also be defined for (countably and non-countably) infinite sets .
The Hypervolume Indicator (HV), introduced in zitzler1999multiobjective , is one of the essential unary indicators for evaluating the quality of a Pareto-front approximation set. Its theoretical properties are discussed in zitzler2003performance . Notably, HV does not require knowing the Pareto front in advance. The maximization of HV leads to a high-qualified and diverse Pareto-front approximation set. The Hypervolume Indicator is defined as follows:
Definition 5 (Hypervolume Indicator)
Given a finite approximation to a Pareto front, say , the Hypervolume Indicator of is defined as the -dimensional Lebesgue measure of the subspace dominated by and bounded below by a reference point :
with being the Lebesgue measure on .
The hypervolume indicator measures the size of the dominated subspace bounded below by a reference point . This reference point needs to be provided by users. Theoretically, in order to get the extreme non-dominated points, this reference point should be chosen in such a way that it is dominated by all elements of a Pareto-front approximation set during the optimization process. However, there is no requirement of setting the reference point in practice if the user is not interested in extreme non-dominated points.
Another important infill criterion is Hypervolume Improvement, which is also called Improvement of Hypervolume in emmerich2008computation . The definition of Hypervolume Improvement is:
Definition 6 (Hypervolume Improvement)
Given a finite collection of vectors , the Hypervolume Improvement of a vector is defined as:
In case we want to emphasize the reference point , the notation will be used to denote Hypervolume Improvement.
Figure 1 illustrates the concept of Hypervolume Improvement using two examples. The first example, on the left, is a 2-D example: Suppose a Pareto-front approximation set is , which is composed by , and . When a new point is added, the Hypervolume Improvement is the area of the yellow polygon. The second example (on the right in Figure 1) illustrates the Hypervolume Improvement by means of a 3-D example. Assume a Pareto-front approximation set is ( , , ). The Hypervolume Improvement of relative to is given by the joint volume covered by the yellow slices.
Probability of Improvement (PoI) is an important criterion in MOBGO. It was first introduced by Stuckman in stuckman1988global , and Emmerich et al. emmerich2006single generalized it to multi-objective optimization. PoI is defined as:
Definition 7 (Probability of Improvement)
Given parameters of the multivariate predictive distribution , and the Pareto-front approximation set , the Probability of Improvement is defined as:
where is the multivariate independent normal distribution with the mean values and the standard deviations . Here, () represents as an improvement with respect to , if and only if the following holds: and .
In Equation (16), () is equivalent to saying that is an element of the non-dominated space of . That is, . A reference point is not indicated in Equation (16) because can be chosen as . Therefore, PoI is a reference-free infill criterion.
Definition 8 (Expected Hypervolume Improvement)
Given parameters of the multivariate predictive distribution , and the Pareto-front approximation set , the expected hypervolume improvement is defined as:
An illustration of the 2-D EHVI is shown in Figure 2. The light gray area is the dominated subspace of bounded by the reference point
. The bivariate Gaussian distribution has the parameters
. The probability density function () of the bivariate Gaussian distribution is indicated as a 3-D plot. Here is a sample from this distribution and the area of improvement relative to is indicated by the dark shaded area. Variables and stand for the first and the second objective values, respectively.
For computing integrals of EHVI in Section 4, it is useful to define the functions and .
Definition 9 ( function (see also Michael2016book ; yang2017phdthesis ))
For a given vector of objective function values and , is the subset of the vectors in which are exclusively dominated by the vector and not by elements in and that dominate the reference point , that is:
Definition 10 ( function (see also hupkens2015faster ))
Let denote the PDF () of the standard normal distribution. Moreover, let
denote its cumulative probability distribution function (CDF), anddenote the Gaussian error function. The general normal distribution with mean and standard deviation has PDF and its CDF is . Then the function is defined as:
One can easily show that .
4 Efficient EHVI Calculation
This section mainly discusses an efficient partitioning method for a non-dominated space and how to employ this partitioning method to calculate EHVI and PoI.
4.1 Partitioning into Integration Slices
The efficiency of an infill criterion calculation is determined by a non-dominated search algorithm and the number of integration slices. The main idea of the partitioning method is to separate the integration volume (a non-dominated space) into as few as possible integration slices. Then, the integral of the criterion is calculated within each integration slice. The value of the criterion is the sum of its contribution in every integration slice.
4.1.1 The 2-D case
In the 2-D case, the partitioning method is simple and has already been published by Emmerich et al. Michael2016book . It uses a new way to integrate the EHVI that does not require a partitioning into a grid, as it was neccessary for previously proposed algorithms by Hupkens et al. hupkens2015faster and Couckuyt et al. couckuyt2014fast , but can make use of arbitrary box partitionings. For the sake of completeness and to introduce this integration technique, we will introduce it here briefly.
Suppose and . The integration space (a non-dominated space) can be divided into disjoint integration slices () by drawing lines parallel to -axis at each element in , as indicated in Figure 3. Then, each integration slice can be expressed by its lower bound () and upper bound (). In order to define the slices formally, augment with two sentinels: and . Then, the integration slices for 2-D case are now defined by:
In the 2-D case, it is straightforward to calculate the number of integration slices, namely, .
4.1.2 The 3-D case
Similar to the 2-D partitioning method, in the 3-D case, each integration slice can be defined by its lower bound () and upper bound (). Since the upper bound of each integration slice is always in the axis, we can describe each integration slice as follows:
An illustration of integration slices is shown in Figure 4. A Pareto front set is composed by 4 points ( and ), and this Pareto front is shown in the upper left figure. The upper right figure represents the partitioned integration slices of . The lower center figure illustrates the projection of the upper right figure onto the -plane with rectangle slices and . The rectangular slices, which share a similar color but of different opacity, represent integration slices with the same value of in their lower bound. The lower bound of the 3-D integration slice is , and the upper bound of the slice is .
Algorithm 2 describes how to obtain the slices , , , , with the corresponding lower and upper bounds ( and ). The partitioning algorithm is similar to the sweep line algorithm described in emmerich2011computing . The basic idea of this algorithm is to use an AVL tree to process points in descending order of the coordinate. For each such point, say , the algorithm finds all the points which are dominated by in the -plane and inserts into the tree. Moreover, because of , the algorithm will also discard all the points (, , ) from the AVL tree. See Figure 5 for describing one such iteration. In each iteration, slices are created by coordinates of the points , , , , , and as illustrated in Figure 5.
The number of the integration slices in the 3-D case is where all points are in general position (for each : the -th coordinate is different for each pair of points in ). Otherwise, provides an upper bound for the obtained number of slices.
Proof: In the algorithm each point creates two slices. The first one, say slice , is created when the point is added to the AVL tree. Another slice, say slice , is created when the point is discarded from the AVL tree due to domination by another point, say , in the -plane. These two slices are defined as follows whereas is either if no points are dominated by in the -plane, or , otherwise. Moreover, and denote either the right neighbor among the newly dominated points in the -plane, or if is the rightmost point among all newly dominated points. In this way, each slice can be attributed to exactly one point in , except the slice that is created in the final iteration. In the final iteration, one additional point is added to the AVL tree. This point leads to the creation of a slice when it is added, but because it is never discarded, it adds only a single slice. Therefore, slices are created in total.
4.1.3 Higher dimensional case
In the higher dimensional case, the non-dominated space can be partitioned into axis aligned hyperboxes, similar to the 3-D case. In the -dimensional case (), the hyperboxes can be denoted by with their lower bound () and upper bound ( ). Here, is the number of hyperboxes and has the same definition as and . The hyper-integral box is defined as:
An efficient algorithm for partitioning a higher dimensional, non-dominated space is proposed in this section, which is based on two state-of-the-art algorithms DKLV17 dachert2017efficient by Dächert et al. and LKF17 lacour2017box by Lacour et al. Here, algorithm DKLV17 is an efficient algorithm to locate the local lower bound points555For the definition of the local lower bound points , see dachert2017efficient . () in a dominated space for maximization problems, based on a specific neighborhood structure among local lower bounds. Moreover, LKF17 is an efficient algorithm to calculate the HVI by partitioning the dominated space. In other words, LKF17 is also efficient in partitioning the dominated space and provides the boundary information for each hyperbox in the dominated space.
The basic idea of the proposed algorithm to partition a higher dimensional non-dominated space is transforming the problem of partitioning a non-dominated space into the problem of partitioning the dominated space, by means of introducing an intermediate Pareto-front approximation set . This transformation is done by the following steps. Suppose that we have a current Pareto-front approximation set for a maximization problem and we want to partition the non-dominated space of . Firstly, DKLV17 is applied to locate the local lower bound points () of in the dominated space. Secondly, regard as a new Pareto-front approximation for a minimization problem with a reference point . The dominated space of is actually the non-dominated space of . Then, LKF17 can be applied to partition the dominated space of by locating the lower bound points and the upper bound points . These bound points (,) of in the dominated space for a minimization problem are exact the lower/upper bound points of the partitioned, non-dominated hyperboxes of for a maximization problem. The pseudo code of partitioning non-dominated space in the higher dimensional case is shown in Algorithm 3.
Figure 6 illustrates Algorithm 3. In the 2-D case, suppose a Pareto-front approximation set is , which consists of , and . The reference point is , see Figure 6 (above left). Use DKLV17 to locate the local lower bound points , which consist of , , and , see Figure 6 (above right). Regard all of the local lower bound points as the elements of a new Pareto-front approximation set . Set a new reference point and utilize LKF17 to partition the dominated space of , by considering minimization, see Figure 6 (below left). The partitioned non-dominated space of is then the partitioned dominated space of , see Figure 6 (below right).
4.2 EHVI calculation 666Both C++ and Matlab source code for computing the EHVI are available on http://liacs.leidenuniv.nl/~csmoda/index.php?page=code or on request from the authors.
This section discusses the problem of exact EHVI calculation. Moreover, a new and efficient algorithm is also derived. Section 4.2.1 and 4.2.2 introduce the proposed method in the 2-D and 3-D cases, respectively. Section 4.2.3 illustrates the general calculation formulas in the higher dimensional case, based on the proposed method.
In order to simplify the notation, is used whenever are given by the context. Based on , the expected hypervolume improvement function can be re-defined as:
For the convenience of expressing the EHVI formula in the remainder of this paper, two functions ( and ) are defined as follows:
Definition 11 ( function)
Given the parameters of an integration slice in dimensions, the Hypervolume Improvement of slice in dimension is defined as:
Definition 12 ( function)
Given the parameters of an integration slice in dimensions and multivariate predictive distribution , the function is then defined as:
4.2.1 2-D EHVI calculation
According to the definition of the 2-D integration slice in Equation 4.1.1, the Hypervolume Improvement in the 2-D case is:
gives rise to the compact integral for the original EHVI:
Here , the intersection of with is non-empty if and only if dominates the lower left corner of . Therefore: