A comparison of mixed-variables Bayesian optimization approaches

10/30/2021
by   Jhouben Cuesta-Ramirez, et al.
0

Most real optimization problems are defined over a mixed search space where the variables are both discrete and continuous. In engineering applications, the objective function is typically calculated with a numerically costly black-box simulation.General mixed and costly optimization problems are therefore of a great practical interest, yet their resolution remains in a large part an open scientific question. In this article, costly mixed problems are approached through Gaussian processes where the discrete variables are relaxed into continuous latent variables. The continuous space is more easily harvested by classical Bayesian optimization techniques than a mixed space would. Discrete variables are recovered either subsequently to the continuous optimization, or simultaneously with an additional continuous-discrete compatibility constraint that is handled with augmented Lagrangians. Several possible implementations of such Bayesian mixed optimizers are compared. In particular, the reformulation of the problem with continuous latent variables is put in competition with searches working directly in the mixed space. Among the algorithms involving latent variables and an augmented Lagrangian, a particular attention is devoted to the Lagrange multipliers for which a local and a global estimation techniques are studied. The comparisons are based on the repeated optimization of three analytical functions and a beam design problem.

READ FULL TEXT VIEW PDF

Authors

page 15

page 16

page 18

page 19

03/06/2020

Bayesian optimization of variable-size design space problems

Within the framework of complex system design, it is often necessary to ...
02/02/2022

Mold into a Graph: Efficient Bayesian Optimization over Mixed-Spaces

Real-world optimization problems are generally not just black-box proble...
01/02/2020

Bayesian task embedding for few-shot Bayesian optimization

We describe a method for Bayesian optimization by which one may incorpor...
07/02/2019

Mixed-Variable Bayesian Optimization

The optimization of expensive to evaluate, black-box, mixed-variable fun...
05/31/2016

Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian

An augmented Lagrangian (AL) can convert a constrained optimization prob...
08/18/2020

Efficient Bayesian Structural Equation Modeling in Stan

Structural equation models comprise a large class of popular statistical...
06/08/2021

Bayesian Optimization over Hybrid Spaces

We consider the problem of optimizing hybrid structures (mixture of disc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

A key task in engineering design is to find an optimal configuration from a very large set of alternatives. When the performance of the candidate solutions is measured through a realistic simulation, the numerical cost of the procedure becomes a bottleneck. The optimization of computationally expensive simulators is a topic widely studied in the literature Thi et al. (2019).

In this work, we focus on Bayesian optimization (BO), which is particularly suitable for solving such problems Frazier (2018). Bayesian optimization is a sequential design strategy that requires a data-driven mathematical model or metamodel that provides predictions along with their uncertainty Bartz-Beielstein et al. (2019). The metamodel replaces some of the calls to the expensive simulation and is a key ingredient to the optimization of costly functions. An acquisition criterion Wilson et al. (2018) aggregates the spatial predictions and uncertainties. The metamodel is trained from a reduced set of simulation data and the acquisition criterion is maximized to propose new configurations to be simulated at the next iteration. When the acquisition criterion is the expected improvement (EI), as first introduced in Mockus et al. (1978), the BO algorithm is often called EGO (Efficient Global Optimization, Jones et al. (1998)). EGO is currently a state-of-the-art approach to medium size, continuous and costly optimization problems, both from an empirical Le Riche and Picheny (2021) and a theoretical point of view Vazquez and Bect (2010).


However, in realistic settings, some of decision variables are categorical. In structural design for example, the type of material, the number of components, the choice between alternative technologies lead to discrete variables with no obvious distance between them. The combination of continuous and categorical variables is called a mixed optimization problem.

In non-costly cases, mixed optimization problems can be approached by Mixed-Integer NonLinear Programming Belotti et al. (2013) (when the discrete variables are integers), by sampling based techniques such as evolutionary optimization Cao et al. (2000); Emmerich et al. (2008); Ocenasek and Schwarz (2002) or by alternating mixed programming Audet and Dennis Jr (2001).

When the objective function is costly, mixed optimization problems remain challenging and a topic for research. It is customary to replace some of the calls to the original objective function by calls to a (meta)model of it. Bartz-Beielstein and Zaefferer (2017) provides an overview of metamodels that have or can be used in optimization when the variables are continuous or discrete. Bayesian optimization methods, which rely on metamodels to save computations, have already been extended to mixed problems. It was made possible by the realization that GP kernels (covariance functions) in mixed variables can be created by composing continuous and discrete kernels. The acquisition function is defined over the same space as the objective function. Therefore maximizing the acquisition function is also a mixed variables problem.

To the best of our knowledge, the first EGO-like algorithm for mixed variables has been proposed in Hutter et al. (2011)

. In this article, the mixed kernel is a product of continuous and discrete Gaussian kernels, and random forests constitute an alternative choice of mixed metamodel. More precisely, the discrete kernel is a Gaussian of integer or hamming (also known as Gower) distance for ordinal or nominal variables, respectively. In

Hutter et al. (2011), the expected improvement is first optimized with a multi-start local search for both the continuous and discrete variables (thus a neighborhood for the discrete variables is defined) which is then complemented by a random search. This work was continued with the REMBO method in Wang et al. (2016), where a random linear embedding is introduced to tackle high-dimensional problems. Discrete variables were relaxed into continuous variables thanks to a mapping function. The optimization of the acquisition function was made with a combination of the DIRECT and CMA-ES continuous global optimizers. Both Hutter et al. (2011) and Wang et al. (2016)

have been motivated by applications to the automatic configuration of algorithms. The goal of reaching very high dimensions (millions) probably forced the authors to use isotropic kernels.

A Bayesian mixed optimizer is presented in Pelamatti et al. (2019)

. The GP kernels are products of continuous and discrete kernels. Different discrete kernels are compared, namely the homo- and hetero-scedastic hypersphere decomposition and the compound symmetric kernels. The optimization of the acquisition function is performed with a genetic algorithm in mixed variables. A similar BO with mixed kernel is described in

Zuniga and Sinoquet (2020), but the expected improvement is optimized with the mixed version of the MADS algorithm Audet and Dennis Jr (2001) and the neighborhood of the categorical variables is defined through a probabilistic model.

Random forests can replace the kriging model in BO with mixed inputs as they natively have a measure of prediction uncertainty. Such an implementation, first done in Hutter et al. (2011), is part of the mlrMBO R package Bischl et al. (2018), in conjunction with several acquisition criteria that can be optimized with a “focus-search” algorithm. The focus-search algorithm hierarchically samples the search space of the chosen acquisition criterion.


Recent developments in metamodels involving mixed variables show that it is possible to map the categorical variables into quantitative non-observed latent variables that are then considered as continuous Zhang et al. (2019). Whenever it is possible to write a model of the studied system, quantitative latent variables exist that describe the effects of the categorical variables. Typically, there are more latent variables than categorical variables. The existence of continuous latent variables can sometimes be established from the physics of the considered phenomena, e.g. in material science Zhang et al. (2020)

. In structural mechanics for example, if the categorical variable describes the shape and the material of an element load in flexion, its bending moment of inertia is a candidate latent variable. Latent variables can emulate the properties of the original categorical variables, in particular within the metamodel, and open the way to reasonings with continous quantities: the kernels of the Gaussian processes can be taken as continuous, gradients and neighborhoods are naturally defined during the optimization. On the contrary, categorical variables and their inherent lack of distance definition is the cause of complications in the kernel definition and in the optimization.


This article presents a new Bayesian optimization algorithm for mixed variables called LV-EGO (for Latent Variable EGO). Our contribution with respect to Zhang et al. (2020) is that the continuity of the latent variables is also taken advantage of during the optimization of the acquisition criterion. This implies that categorical variables must be recovered from the continuous latent variables proposed by the optimizer, which creates a new “pre-image” problem.

Section 1 introduces the problem and the principles of Bayesian optimization. In Section 2, several variants of LV-EGO are described. They differ in the handling of the relationship between the categorical and the latent variables: the “vanilla” LV-EGO just recovers categorical variables after the optimization while augmented Lagrangian versions account for the link during the optimization through constraints. Section 3 presents a set of benchmarks comparing our method to other state-of-the-art techniques. One of the benchmarks is a beam design problem and gives the opportunity to discuss the interpretation of the latent variables. Finally, Section 5 offers conclusions and perspectives to this work.

Notations and abbreviations

(by alphabetical order)

  • [label=,itemsep=0mm]

  • DoE : Design of Experiment.

  •  : dual function.

  • MLE : Maximum Likelihood Estimation

  • E[] : mathematical expectation

  • (), (,), (,) : expected improvement (at point and iteration ).

  •  : objective function to minimize.

  •  : inequality () or equality constraint function.

  •  , 

    : vector of latent mapping functions stemming from MLE maximization (at iteration

    ), .

  •  : vector of relaxed latent variables. They take a value in .

  •  : number of continuous variables.

  •  : number of discrete variables.

  •  : total number of latent variables for all discrete variables, in this article .

  • q : number of latent variables per discrete variable, in this article .

  •  : number of levels for all discrete variables or for the discrete variable .

  •  : number of calls to the expensive objective function, superscript for functions redefined at each iteration.

  •  : vector of discrete (ordinal or nominal) variables, .

  •  : set of the discrete part of already evaluated points, .

  •  : vector of mixed (continuous, discrete) variables, .

  •  : current set of evaluated points, . Contrarily to and L, it is a generic set, the nature of the variables is not specified.

  •  : vector of continuous variables, .

  •  : set of the continuous part of already evaluated points, .

  •  : current set of outputs of the evaluated points, .

1 Problem statement and background

We consider the problem of minimizing a function depending on a vector of continuous variables and a vector of discrete variables , where each has levels encoded . We denote the domain of definition for the continuous inputs, typically, after rescaling, the hypercubic domain . Similarly, we denote the domain of definition for the discrete inputs. We also denote and . For simplicity, the definition of is overloaded in the following, and we will not make a distinction between and .
We focus on costly functions, meaning that each evaluation of is time-consuming, and we aim at minimizing with a tiny budget of evaluations. In this context, minimizing directly is hardly possible. An alternative is to use Bayesian optimization (BO). In BO approaches, there are two main ingredients: a Gaussian process (GP) serving as a fast proxy, often called metamodel, built from the current learning set, and a sampling criterion, often called acquisition criterion, used to update the learning set with a new data point computed with . A famous acquisition criterion is the expected improvement (EI). In that case, the BO approach is often called Efficient Global Optimization (EGO) algorithm.

To be more precise, let be a design of experiments (DoE), and be the corresponding function evaluations (). Let be the current minimum. Let us now assume that is a particular realization of the GP defined on . In that case, the EI criterion is defined by

where is the conditional GP knowing the observations:

Notice that is large when exploiting interesting area, that is to say when there is a good chance that is smaller than . This may occur when is close to

, or when exploring unvisited areas, i.e. when the variance of

is large compared to . The idea of EGO is to evaluate at a new point maximizing the EI criterion until a stopping criterion is reached. See Algorithm 1 for a synthetic description of the EGO algorithm when the stopping criterion is a maximum number of evaluations of , noted budget.

1:  Generate the initial DoE of size , , and calculate , .
2:  while  do
3:     Estimate the GP from the learning set formed by and .
4:     Look for the current minimum and maximize on : .
5:     Evaluate at , .
6:     Update the learning set: , .
7:     
8:  end while
9:  ,
10:  return  ()
Algorithm 1 EGO algorithm on a generic space

This EGO algorithm has been intensively studied to minimize nonlinear functions that are expensive to be evaluated in the case , i.e. when all input variables are continuous (see Le Riche and Picheny (2021) for numerical illustrations of its efficiency). The application of this algorithm in the presence of categorical variables is much less documented (see e.g. Pelamatti et al. (2019); Zuniga and Sinoquet (2020)), which can be explained by two main difficulties. The first one is related to the difficult estimation of covariance kernels on mixed spaces. Indeed, multi-dimensional covariance functions are often built by combination of one-dimensional ones. Therefore, covariance functions on can be obtained by combining covariance functions on and , so that, for all and in :

(1.1)

where are covariance functions and is an operation that preserves positive definiteness, such as sum or product. If we focus on the single categorical variable with levels , we can identify the covariance function to a -dimensional positive semidefinite matrix , such that for all ,

(1.2)

This means that coefficients need to be estimated to determine a covariance on in the general case. That number can be large when is large, which very often makes this estimation very difficult in practice. Furthermore, the optimization problem is often harder than the box-constrained one met with continuous variables. Indeed it is either constrained by the positive definiteness of , which is non-linear, or defined on a manifold if is parameterized in spherical coordinates. We refer to Roustant et al. (2020) for more details and other parsimonious representations of , which can reduce but not totally fix these issues. The second reason that can explain the few number of direct applications of EGO algorithm on mixed space is related to the difficult maximization of the expected improvement, i.e. the search of the new input points where to call the function , which are solutions of:

(1.3)

Indeed, classical optimization algorithms on continuous spaces usually try to exploit information related to the gradient of the function to be maximized, as well as notions of proximity in the space of the inputs. However, these two notions are difficult to exploit when dealing with categorical inputs, i.e. without any a priori ordering between the input instances. To circumvent this difficulty, a naive approach of resolution would consist in no longer considering a single maximization problem on , but the resolution in parallel of maximization problems on , i.e. one problem per combination of instances of the categorical inputs

. Such an approach is not tractable when the number of optimization problems to be solved becomes large, which has motivated the definition of heuristics, such as evolutionary algorithms

Li et al. (2013); Cao et al. (2000); Lin et al. (2018), which seek to concentrate the searches only on the interesting instances of . However, these approaches still rely on a large number of calls to the function to be optimized, and their convergence is not always easy to quantify.

Because mixed optimization problems are difficult, an alternative approach is proposed in the rest of this paper. It is based on the possibility to relax the discrete variables into continuous latent variables, therefore benefiting from the more efficient search mechanisms that exist in continuous spaces (e.g. gradients).

2 EGO with latent variables

2.1 Latent variables

For an easier handling of categorical inputs, it was proposed in Zhang et al. (2019) to replace each categorical input by a vector of continuous inputs with values in , noted . To give an intuition of the underlying idea in the automotive domain, a category of lubricant may be determined by physical continuous features such as boiling temperature, viscosity, etc that act as latent variables. In structural mechanics, the shape of a load carrying structure, which is categorical, has underlying continuous flexural and membrane moments that drive its behavior. This amounts to associating to the Gaussian process (GP) a new GP , such that for each instance of the categorical inputs there exists a particular value of , which is called latent variable, allowing us to write:

(2.1)

An important point is that the values of are unobserved and therefore is unknown. Nevertheless, in order to replace the EI maximization problem on by a new optimization problem on , a precise knowledge of is not necessary. Indeed, assuming that kernels for mixed inputs are built by combining -dimensional ones as in (1.1), it is sufficient to identify the mappings from to to each variable such that

(2.2)

where is a continuous kernel on . Thus, it is not so much the values of that are important, but their relative positions in in order to allow a reasonable reconstruction of the dependency structure between and .

According to the works achieved in Zhang et al. (2019), it appears that interesting mappings can be obtained by likelihood maximization and that relatively small values of can give a satisfying reconstruction. Following their recommendations, can be chosen equal to if and to otherwise, which will be the values chosen in the rest of this paper. We denote by the total number of latent variables. Following Roustant et al. (2020), the continuous kernel associated to the latent variables was chosen as the dot product kernel . The corresponding covariance matrix is then low-rank, and provided better performances than the Gaussian kernel in the examples considered in the latter reference.

This new parametrization leads us to the following adaptation of the EI maximization problem defined by Eq. (1.3), which we name acquisition problem as it allows to acquire a new point to evaluate:

(2.3)

Here, is the expected improvement associated with GP at iteration , is the vector-valued mapping from to at iteration , and the constraint on the values of is driven by the fact that the values of the latent variables at the new point have to remain compatible with the current mapping functions.

We follow two paths to solve this acquisition problem. In the vanilla LV-EGO approach, which will be described soon, the EI maximization and the latent-discrete compatibility constraint are addressed one after each other. Alternatively, with the augmented Lagrangian approaches, which will be described in Section 2.3, the full constrained optimization problem is treated.

2.2 The vanilla LV-EGO algorithm

At each iteration, the vanilla LV-EGO algorithm first maximizes EI in a relaxed, fully continuous, formulation where the discrete variables are replaced by relaxed continuous latent variables. Then, a pre-image problem is solved where EI is maximized over the discrete variables only, the continuous variables being fixed at their value of the relaxed problem. The LV-EGO methodology is summarized in Algorithm 2.

1:  Generate the initial DoE of size : ,
2:  Costly function evaluations ,
3:  while  do
4:     Estimate the latent variable mappings and the parameters of the continuous GP .
5:     Perform one EGO iteration in the relaxed continuous space :.
6:     Recover the discrete pre-image component as: .
7:     Update the DoE with with output value .
8:     
9:  end while
10:  Return
Algorithm 2 Vanilla LV-EGO with mixed inputs

The main difference with the generic Bayesian algorithm 1 is the new discrete pre-image problem in line 6. Notice that the pre-image is formulated in terms of the EI objective, as opposed to a more arbitrary distance like .

In terms of implementation, the EI maximization (line 5) is done with the COBYLA algorithm, a gradient free non-linear optimization technique Powell (1994). Since COBYLA is a local optimizer and the EI is a multimodal function, the maximization is repeated (10 times) from randomly chosen initial points and the best result is kept. An exhaustive search is carried out for the EI maximization of the pre-image problem (line 6).

A comparison of the numerical complexities of the vanilla LV-EGO (Algorithm 2) and the generic EGO (Algorithm 1) shows that the cost of the latent variables is limited. Let us consider that the discrete space can be searched essentially by enumeration in operations (where is the number of levels per discrete variable) while a continuous space can be searched more efficiently in linear time. At each iteration, the Bayesian algorithms of this paper have three steps: first a GP is learned, then an acquisition criterion (EI for now and an augmented Lagrangian later) is maximized and finally a pre-image problem is solved. In the vanilla LV-EGO algorithm, these steps take place at lines 4, 5 and 6 of Algorithm 2, respectively. Table 2.1 summarizes the number of operations per step. The number of operations for learning the GPs is proportional to the cube of the number of points evaluated () because of the inversions of the covariance matrices, times the number of (continuous) parameters of the GP for the likelihood maximization.

The two other steps, the acquisition and the pre-image, imply predictions by the GP in operations times a number of operations that depends on the specific algorithm. Comparing in Table 2.1 the column of the generic EGO with that of the vanilla LV-EGO, and assuming that for all to keep the discussion simple, it can be seen that the latent variables induce a slight extra cost to be learnt. When , which is our default here, this extra cost is operations. would not add any cost to the learning. An advantage, which comes from the sequential resolution of the mixed problem, occurs in the maximization of the acquisition criterion when , at the cost of an additional pre-image problem to solve. Thus, LV-EGO will be faster than a mixed EGO once the latent variables are estimated if , which happens frequently (take for example ).

Mixed space search Vanilla LV-EGO ALV-EGO-g ALV-EGO-l
(Alg. 1) (Alg. 2) (Alg. 3+4) (Alg. 3+5)
GP learning
max acquisition
pre-image 0
Table 2.1: Numerical complexities of the algorithms compared at each iteration (for a given ).

2.3 LV-EGO algorithms with Augmented Lagrangian

A possible pitfall of the vanilla LV-EGO detailed in Algorithm 2.2 is that the link between the discrete variables and their relaxed continuous counterparts is lost when maximizing in line 5. Recovering it during the discrete pre-image problem where is fixed to a value optimal in the relaxed formulation but possibly non-optimal with respect to the mixed problem (1.3) may yield a sub-optimal solution. For this reason, we now propose LV-EGO algorithms that account for the discreteness constraint during the optimization using augmented Lagrangians.

In that prospect, notice that problem (2.3) can be approximated as an optimization problem with an inequality constraint:

(2.4)

where is a small positive relaxation constant and the Euclidean norm. In this reformulation, called relaxed acquisition problem, notice the scaling of the EI which does not change the solution but improves the conditioning of the problem. Two values of will be discussed in the sequel, in which case the constraint becomes an equality constraint, , and but small which corresponds to a relaxation of the equality. In the sequel, is normalized with respect to the size of the vector of latent variables and set to .

The constrained optimization problem (2.4) is solved through an augmented Lagrangian approach Minoux (1986); Nocedal and Wright (2006). The augmented Lagrangian is that of Rockafellar Rockafellar (1993) which, specified for Problem (2.4), is,

(2.5)

When , the constraint becomes an equality constraint, . In this case, the augmented Lagrangian connected to that of Rockaffelar is that of Hestenes Hestenes (1969) and takes the form

(2.6)

Complementary explanations about the augmented Lagrangians are given in Appendix A.

Augmented Lagrangians require to specify the values of the Lagrange multiplier, , and of the penalty parameter, . The general principle to fix them is to calculate the generalized Lagrange multiplier with a dual formulation Minoux (1986): the dual function is maximized with respect to the multiplier while the penalty parameter should take the smallest value that allows to find feasible solutions,

(2.7)

There are two logics to solve Problem (2.7), both of which have been investigated in this study. Following an idea presented in Le Riche and Guyon (2002) for classical Lagrangians, we first propose to approximate the dual function as the lower front of the augmented Lagrangians of a finite set of calculated points. The approximated dual is

(2.8)

where is a DoE that should not be mistaken for , the DoE of the original expensive problem. comes from solving Problem (2.7) with minimizations over the finite set instead of the initial . The functions in Problem (2.4) are not costly, can be quite large. This approach is called global dual as a global approximation to the dual function is built and maximized. It applies to very general functions, e.g., non differentiable functions. Another advantage of this approach is to allow large changes in the dual space. Figure A.1 provides an illustration of the approximated dual function and the effect of on the dual problem. The sketch is done for an inequality constraint, yet it also stands with marginal changes for an equality (cf. Appendix A and the caption to the Figure). Under the non-restrictive hypothesis that there is a beyond which the solution to the primal problem (2.4) maximizes the dual function, maximizing the dual function preserves the global aspect of the search. However, the accuracy of the obtained ’s will depend on the DoE. Because there is only one constraint in the current problem and evaluating it does not require calling the costly function, the maximization on and is done by enumeration on a grid and is a 100 LHS sample.


The other path to updating the multiplier is to progressively change them based on the minimizers of the augmented Lagrangian at the current step. This updating can be seen as a step in the dual space which makes it general, although it is usually proved by analogy with the Karush Kuhn and Tucker optimality conditions Nocedal and Wright (2006) which add unnecessary conditions (like differentiability), cf. Appendix A. Let be a solution to

(2.9)

The update formula reads

(2.10)

As in Picheny et al. (2016), the penalty parameter is simply increased if the constraint is not satisfied,

(2.11)

The update scheme based on equations (2.10) and (2.11) is called local dual as a local step in the dual space is taken.

1:  generate the initial DoE of size for
2:  costly function evaluations ,
3:  initialize budget,
4:  while  do
5:     estimate the latent variables and the GP parameters from current DoE.
6:     {approximately solve the relaxed acquisition problem (2.4) with } s.t. , ALV-EGO-g variant: with the global dual scheme, cf. Algorithm 4 ALV-EGO-l variant: with the local dual scheme, cf. Algorithm 5
7:     recover the discrete pre-image component as:
8:     update DoE: add and its costly evaluation to the DoE .
9:     
10:  end while
11:  return  
Algorithm 3 Augmented Lagrangian Latent Variables EGO with global or local dual scheme (ALV-EGO-g or ALV-EGO-l)

Algorithm 3 gathers all these changes and is called ALV-EGO. The essential difference between this ALV-EGO algorithm and the vanilla counterpart (Algorithm 2) is that the EI maximization step is constrained so that the link between the discrete variables and the relaxed latent variables (hence the continuous ) is not lost and left to the pre-image step. The coupling between the continuous and the discrete variables is better accounted for. However, a pre-image step (line 7) is still necessary to fully recover a discrete solution in cases when the constraint is relaxed (). In ALV-EGO like in the vanilla LV-EGO, there are continuous latent variable per discrete variable.

The global and local dual schemes are further detailed in Algorithms 4 and 5. The continuous minimizations of the Augmented Lagrangians once the Lagrange multipliers are set are always done with 10 random restarts of the COBYLA algorithm Powell (1994). They occur in Algorithm 4, line 4 and Algorithm 5 line 5. To allow comparisons, this implementation is identical to the EI maximization of the vanilla LV-EGO (step 5 of Algorithm 2).

0:  An estimation of the solution to the relaxed acquisition problem (2.4)
0:  , an objective function, , a constraint , ,
1:  Calculate a DoE .Half of the points are feasible by i) sampling a and ii) setting
2:  Create a grid of Lagrange multipliers and penalty parameters, , with and for all
3:  Approximately solve the dual problem by enumeration: smallest that yields a feasible solution, where  
4:  Fine tune the next candidate:
5:  return  
Algorithm 4 Global dual scheme (makes ALV-EGO-g when used in Algorithm 3)
0:  An estimation of the solution to the relaxed acquisition problem (2.4)
0:  , an objective function, , a constraint initial values of the Lagrange multiplier and penalty, and ,
1:  if  then
2:     {when the initial are used} Update according to Eq. (2.10)
3:     Update according to Eq. (2.11) if , otherwise
4:  end if
5:  
6:  return  
Algorithm 5 Local dual scheme (makes in ALV-EGO-l when used in Algorithm 3)

While the local update of and might seem less robust, it is the most common implementation and it might be sufficient for the constrained EI maximization. Indeed, between two iterations, the EI changes only locally around the current iterate. Providing the latent mapping functions do not change too much, a local update of and seems appropriate. The numerical complexity of the ALV-EGO-g and -l algorithms is essentially the same as that of the vanilla LV-EGO, cf. Table 2.1. The global dual scheme has a slight extra-cost because of the search for the Lagrange multiplier and penalty parameter that require extra GP predictions.

Eventually, four variants of ALV-EGO are considered, ALV-EGO-ge or -gi or -le or -li where g stands for global, l for local, e for equality () and i for inequality ().

3 Description of the numerical experiments

3.1 Algorithms tested

The various algorithms tested are summarized in the Table 3.1 which provides their names, the type of formulation for the mixed variables, the type of metamodel, the acquisition criterion and the technique to optimize the acquisition criterion. The two possible formulations for the mixed variables are either by searching in a mixed space (MS) or by a formulation in latent variables (LV). All Gaussian processes (GPs) are built with the kerpg package Deville et al. (2017–2021). The meaning of the acronyms is: LV-EGO, Latent Variables EGO; LV-RFO, Latent Variables Random Forest Optimization ; ALV-EGO-ge/-gi/-le/-li, Augmented Lagrangian Latent Variables global/local dual scheme with equality/inequality pre-image constraints; MS-RFO, Mixed Space search with Random Forest Optimization; MS-ES, Mixed Space search with Evolution Strategy; MS-MKES, Mixed Space search with Mixed Kriging metamodel and Evolution Strategy.

name formulation metamodel acq. crit. optimizer of the acq. crit.
LV-EGO LV GP EI restarted COBYLA
LV-RFO LV randomForest toolbox EI focus-search (from mlrMBO)
ALV-EGO-ge or -gi LV GP EI DoE (for and ) and restarted COBYLA
ALV-EGO-le or -li LV GP EI restarted COBYLA
MS-RFO MS randomForest toolbox EI focus-search (from mlrMBO)
MS-ES MS none evolution strategy (from Li et al. (2013) in CEGO implementation Zaefferer (2014–2021))
MS-MKES MS GP (sym. compound disc. kernel) EI evolution strategy (from Li et al. (2013) in CEGO implementation Zaefferer (2014–2021))
Table 3.1: Summary of the 9 algorithms tested: name, space over which it is defined (mixed versus continuous with latent variables), metamodel used, acquisition criterion, optimizer of the acquisition criterion.

The different algorithms will be tested on the suite of test problems described hereafter.

3.2 Test cases

There are 3 analytical test cases and a beam bending problem. The analytical test cases have all been designed by discretizing some of the variables of classical multimodal continuous test functions. The following notation is introduced to describe the discretization: if the continuous variable is discretized with that takes values in , then means when , a scalar, .

(a) Discretized Branin-Hoo function
(b) Discretized Goldstein-Price function
Figure 3.1: Two of the test functions with 1 discrete variable.

Test case 1: discretized Branin function.

We modified the dimensional Branin-Hoo function whose expression is

where , by keeping continuous in and making discrete with levels . The discretized Branin, which was already used in Zhang et al. (2020), has several local minima as shown in Figure 0(a). The global optimum is located at with .

Test case 2: discretized Goldstein function.

As a second test case, the continuous Goldstein function

is partly discretized by replacing by with levels . The discretized Goldstein, which has also been studied in Zhang et al. (2020), is drawn in Figure 0(b). It has several local optima. The global optimum is located at with .

Test case 3: discretized Hartman function.

Two variables are discretized in the 6 dimensional Hartman function,

where , , and

and are discretized with and levels respectively such that and . Again, there are multiple local minima and the global optimum is located at with .

Euler-Bernoulli beam bending problem.

This test case corresponds to an horizontal beam that is clamped at one end and subject to a vertical force at the other end. If the length of the beam is sufficiently long compared to the dimensions of its cross section, and if it is operating within its linear elastic range, the final beam deflection (to be minimized) is expressed as

(3.1)

where is the horizontal length of the beam, is the cross-section area and is the normalized moment of inertia that can explicitly be derived for a given catalog of beam profiles. The 12 levels of the normalized moment of inertia are

(3.2)

We are interested in finding the best compromise between a minimization of the vertical deflection and the total weight, as expressed in the objective

(3.3)
(3.4)
(3.5)

The solution is with output .

3.3 Experiments setup and metrics

The optimization of each pair of algorithm and test case are repeated 50 times from different initial DoEs. The DoEs are generated by minimax Latin Hypercube Sampling. The size of the DoEs is and a budget of + 50 evaluations of the true objective function. Remember that the true objective function is supposed to be computationally intensive although it is not in these experiments so that runs can be repeated. The evolution strategies are stopped after evaluations of the true function, like the other algorithms.

The internal local optimizer, COBYLA, is restarted 5 times during the likelihood maximization and 10 times during the maximization of the acquisition criterion. The focus-search algorithm has a sample size of with boundary reduction iterations and multi-starts, for a total of 3000 calls to the acquisition criterion.

A summary of the dimensions involved in the different examples is given in Table 3.2.

Branin-Hoo 1 1 4 16
Goldstein 2 1 5 40
Hartmann 4 2 {5,4} 160
Beam Bending 2 1 12 96
Table 3.2: Dimensions and DoE size of the test cases.

4 Results and discussion

The results are provided with 4 main metrics. The performance of an algorithm is classically described by the median objective function over the 50 repeated runs, calculated at each iteration. The associated measure of dispersion of the performance is the interquartile over the repetitions as a function of the iteration. To discriminate between methods that are rapid but provide rough solutions from the ones that take more time but yield better solutions, the two other metrics are based on the definition of targets. For each test case, a target is a given quantile of all the objectives functions found by all the algorithms throughout all the repetitions. A 10% target is difficult, while a 50% target is the median performance. The third metric is the iteration number at which the median objective function of a given algorithm reaches a given target. The fourth metric is the success rate (given a target), which is the percentage of the runs that do better than the target. The metrics associated to the quantile targets have the advantage that they are normalized with respect to the test cases: thanks to the quantiles, the definitions of an easy, a median or a hard target stands accross the different functions to optimize. The target-based metrics will later be averaged over the different test cases.

Let us now review the performances of the algorithms on each test case.

4.1 Analytical test functions

Branin function.

Figure 4.1 presents the results for the Branin function with the four metrics. On the top left plot, showing the median value for the objective function, it is clear that the two methods that rely on the random forest metamodel (MS-RFO and LV-RFO) are overtaken by all other methods. This indicates that, whether in the mixed or in the latent-augmented space, random forests do not represent sufficiently well the Branin function in comparison to Gaussian processes. Looking at Figure 0(b), it is observed that the fast methods typically have the lowest spread in performance and vice versa. This is expected as non converging runs may yield a wide range of performances. All methods involving the discrete constraint (i.e., the augmented Lagrangians) managed to improve over the LV-EGO performance; and including a mixed metamodel increased significantly the success rate and the median solution for the evolutionary strategy.

Regarding the success rate on Figure 0(d), the methods MS-MKES, LV-EGO, ALV-EGO-li, -le, -ge and -gi were the most prominent, the latter being capable to reach success rates of about for a target. Notice that all these methods contain Gaussian processes. Indeed, the Branin function is easy to represent by a GP whether continuous or mixed. In the same vein, MS-MKES which differs from MS-ES by the use of a GP, clearly benefits from that metamodel.
All ALV- methods, which account for the discrete constraint, obtained the best median performances. ALV-EGO-ge in particular found all targets, in the median sense, earlier than the other algorithms as can be seen from Figure 0(c).

A last comment is necessary regarding the bottom of Figure 4.1: the plot on the left describes the median performance (in terms of targets reached) while the right plot counts the success rate at reaching a target over all runs. Therefore, some targets are reached on the right by some of the runs of a given algorithm, while they are never atteined on the left by the median of the same algorithm. This comment stands accross all test cases.

(a) Median solution
(b) Interquartile range
(c) Iteration to median success
(d) Success rate
Figure 4.1: Comparison of all 9 algorithms on the Branin function. .

Goldstein function.

The experiments done with the Goldstein test function are summed up in Figure 4.2. Like with the Branin function, algorithms relying on random forests (LV-RFO and MS-RFO) showed both poor performance (top left plot). The associated high constant interquartile (top right) is that of the best points in the initial designs, which remains unchanged since no better point is found by these algorithms.

Considering the success rates for all targets (bottom plots), it is seen that accounting for the discretness through a constraint (which is the distinctive feature of ALV- methods) is useful with the Goldstein function: like with Branin, ALV-EGO-gi is the best performer, but the other ALV- follow and outperform LV-EGO. All ALV- strategies almost reach the absolute target of percentile with a rate of or higher. The comparison of the plots 1(c) and 1(d) also shows that, behind the ALV- methods, LV-EGO has a good median performance (cf. Figure 1(c)) but more of the MS-MKES searches manage to find difficult targets (the 25% and 10% quantiles).

(a) Median solution
(b) Interquartile range
(c) Iteration to median success
(d) Success rate
Figure 4.2: Comparison of the 9 algorithms on the Golstein function. .

Hartmann function.

Results on the Hartmann function which has 4 continuous and 2 discrete variables, with a total of 9 discrete levels, will be impacted by the sensitivity of the algorithms to an increase in dimension. These results are reported in Figure 4.3.

LV-EGO stands out as the best method with respect to all criteria for Hartmann. The next two best methods are LV-RFO and ALV-EGO-gi, followed by MS-RFO and ALV-EGO-ge. This time, LV-RFO and MS-RFO, which both rely on random forests, belong to the efficient methods: random forests gain in relative performance with respect to the GPs when the dimension and the size of the initial DoE increase. For Hartmann, LV-EGO consistently outperforms the ALV- implementations. The importance of keeping the coupling between discrete and latent variables during the optimization seems less crucial, and even somewhat detrimental, in the Hartmann case. We think that this is due to the very tight budget (50 iterations after the initial DoE) which does not allow the convergence of the optimizers, as can be seen in the Plot 2(a) where the global optimum is not reached. Because the optimum is not really found, constraints on discretness are superfluous and their handling through the pre-image problem is sufficient. As in the other test cases, MS-ES was slower than the other methods.

(a) Median solution
(b) Interquartile range
(c) Iteration to median success
(d) Success rate
Figure 4.3: Comparison of the 9 algorithms on the Hartmann function (for which ).

4.2 Beam bending application

Optimization results.

Figure 4.4 summarizes the 4 comparison metrics of all 9 algorithms in the bended beam test case. The ranking of the algorithms is similar to that obtained with the Branin and Goldstein functions. LV-EGO has the best convergence both in terms of median speed (cf. plots of the left column) and accuracy (bottom right plot). ALV-EGO-gi is the second most efficient method followed by ALV-EGO-ge. Again, the algorithms that resort to random forests, LV-RFO and MS-RFO, are the slowest and most inaccurate. They share this counter-performance with MS-ES.

(a) Median solution
(b) Interquartile range
(c) Iteration to median success
(d) Success rate
Figure 4.4: Comparison of all 9 algorithms on the beam design test case ().

Latent variables in the beam application.

The beam subject to a bending load is a test case that allows to interprete the latent variables. Indeed, the normalized moment of inertia, , is a candidate latent variable once it is allowed to take continuous values as it determines, with the continuous cross-section and the length , the output (the penalized beam deflection) in Equation (3.5). The levels of (given in Equation (3.2)) correspond to 3 increasingly hollow profiles of 4 shapes, as illustrated in Figure 4.5. Because a relaxed is a possible latent variable, it is expected that the latent variables learned from the data will be grouped in the same way as . Looking at values and at Figure 4.5, we thus expect, in the image space defined by latent variables, three groups of levels: those corresponding to solid forms (levels ), medium-hollow forms (levels ) and hollow forms (levels ).

[5pt]    [5pt]    [5pt]    [5pt]    [5pt]    [5pt]    [5pt]    [5pt]    [5pt]    [5pt]    [5pt]    [5pt]

Figure 4.5: Shapes of the considered beam profiles. The scale differs from one picture to another, as the areas are supposed to be the same for each cross-section. From Roustant et al. (2020).

For the sake of interpretation, we select run that found the global optimum with the Vanilla LV-EGO algorithm. In Figure 4.6, we represent in a color scale the estimated correlation matrix corresponding to the categorical kernel of Equation (2.2), at iterations . At the beginning of the optimization, at iteration 1, we can see a block-structure which corresponds quite well to the three groups of forms described above. This structure becomes less clear for the next iterations of the LV-EGO algorithm. This may be explained by the fact that the algorithm creates an unbalanced design, with more points in the promising areas according to the optimizers, so that all levels are no longer properly represented.

(a) Correlation of the latent variables at iteration
(b) Correlation of the latent variables at iteration