Efficient sequential experimental design for surrogate modeling of nested codes

12/05/2017 ∙ by Sophie Marque-Pucheu, et al. ∙ CEA 0

Thanks to computing power increase, the certification and the conception of complex systems relies more and more on simulation. To this end, predictive codes are needed, which have generally to be evaluated in a huge number of input points. When the computational cost of these codes is high, surrogate models are introduced to emulate the response of these codes. In this paper, we consider the situation when the system response can be modeled by two nested computer codes. By two nested computer codes, we mean that some inputs of the second code are outputs of the first code. More precisely, the idea is to propose sequential designs to improve the accuracy of the nested code's predictor by exploiting the nested structure of the codes. In particular, a selection criterion is proposed to allow the modeler to choose the code to call, depending on the expected learning rate and the computational cost of each code. The sequential designs are based on the minimization of the prediction variance, so adaptations of the Gaussian process formalism are proposed for this particular configuration in order to quickly evaluate the mean and the variance of the predictor. The proposed methods are then applied to examples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A lot of industrial issues involve multi-physics phenomena, which can be associated with a series of computer codes. However, when these code networks are used for conception, uncertainty quantification, or risk analysis purposes, they are generally considered as a single code. In that case, all the inputs characterizing the system of interest are gathered in a single input vector, and little attention is paid to the potential intermediate results. When trying to emulate such code networks, this is clearly sub-optimal, as much information is lost in the statistical learning, such that too many evaluations of each code are likely to be required to get a satisfying prediction precision.

In this paper, we focus on the case of two nested computer codes, which means that the output of the first code is an input of the second code. We assume that these two computer codes are deterministic, but expensive to evaluate. To predict the value of this nested code in a unobserved point, a Bayesian formalism Robert2007 is adopted in the following. Each computer code is a priori modeled by a Gaussian process, and the idea is to identify the posterior distribution of the combination of these two processes given a limited number of evaluations of the two codes. The Gaussian process hypothesis is widely used in computer sciences (Sacks1989; Santner2003; Rasmussen2006; Kennedy2000; Kennedy2001; Berger2001; Paulo2005; Kleijnen2017), as it allows a very good trade-off between error control, complexity, and efficiency. The two main issues of this approach, also called Kriging, concern the choice of the statistical properties of the Gaussian processes that are used, and the choice of the points where to evaluate the codes. When a single computer code is considered, several methods exist to add one new point or a batch of new points sequentially to an already existing Design of Experiments (Sacks1989; Santner2003; Bect2012; Echard2011; Chevalier2014

), in order to minimize the global prediction uncertainty. These methods are generally based on a post-processing of the variance of the code output prediction, which expression can be explicitly derived under mildly restrictive conditions on the mean and the covariance of the prior Gaussian distribution.

The adaptation of these selection criteria to the case of two nested codes is not direct. Indeed, the combination of two Gaussian processes is not Gaussian, such that the prediction uncertainty is much more complicated to estimate. Moreover, if the two codes can be launched separately, the selection criterion has also to indicate which one of the two codes to launch. In that prospect, the first objective of this paper is to propose several adaptations of the Gaussian Process formalism to the nested case, in order to be able to evaluate the two first statistical moments of the code output predictor quickly. Then, original sequential selection criteria are introduced, which try to exploit as much as possible the nested structure of the studied codes. In particular, these criteria are able to integrate the fact that the computational cost associated with the evaluation of each code can be different.

The outline of this paper is the following. Section 2 presents the theoretical framework of the Gaussian process-based surrogate models, its generalization to the nested case, and introduces several selection criteria based on the prediction variance to reduce the prediction uncertainty sequentially. Section 3 introduces a series of simplifications to allow a quick evaluation of the prediction variance. In section 4, the presented methods are eventually applied to two examples.

The proofs of the results that will be presented in the following sections have been moved to the appendix.

2 Surrogate modeling for two nested computer codes

2.1 Notations

In this paper, the following notations will be adopted:

  • correspond to scalars.

  • correspond to vectors.

  • correspond to matrices.

  • The entries of a vector are denoted by , whereas the entries of a matrix are denoted by .

  • denotes the transpose of a matrix .

  • corresponds to the multidimensional Gaussian distribution, whose mean vector and covariance matrix are respectively given by and .

  • corresponds to the distribution of a Gaussian process whose mean function is , and whose covariance function is .

  • and are the mathematical expectation and the variance respectively.

  • For all real-valued functions and that are square integrable on , and denote respectively the classical scalar product and norm in the space of square integrable real-valued functions on :

    (2.1)

2.2 General framework

Let be a system that is characterized by a vector of input parameters, . Let be a deterministic mapping that is used to analyze the studied system. In this paper, we focus on the case where the function can be modeled by two nested codes. Two quantities of interest, and , are thus introduced to characterize these two codes, which are supposed to be two real-valued continuous functions on their respective definition domains and . Given these two functions, the nested code is defined as follows:

(2.2)

where . The sets and are moreover supposed to be two compact subsets of and respectively, where and are two positive integers. In theory, the definition domains may be unbounded, but the reduction to compact sets enables the square integrability of on .

Given a limited number of evaluations of the functions and , the objective is to build a stochastic predictor of with the following properties:

  • its mean is as close as possible to the real output of the nested code, that is, the bias is small,

  • its uncertainty (given by its variance) is as small as possible.

In other words, the mean square error of the stochastic predictor has to be small.

2.3 Gaussian process-based surrogate models

The Gaussian process regression (GPR), or Kriging, is a technique that is widely used to replace an expensive computer code by a surrogate model, that is to say a fast to evaluate mathematical function. The GPR is based on the assumption that the two code outputs, and , can be seen as the sample paths of two stochastic processes, and , which are supposed to be Gaussian for the sake of tractability:

(2.3)

where for all , and denote respectively the mean and the covariance functions of .
Let be elements of and be elements of . Denoting by

(2.4)

the vectors that gather the evaluations of and in these points, it can be shown that:

(2.5)

and we refer to Sacks1989; Santner2003 for further details about the expressions of conditioned mean functions, , and conditioned covariance functions, .

According to Eq. (2.2), the nested code, , can thus be seen as a particular realization of the conditioned process , such that for all ,

(2.6)

Under this Gaussian formalism, the best prediction of in any unobserved point in is given by the mean value of , whereas its variance can be used to characterize the trust we can put in that prediction. As explained in Introduction, there is no reason for to be Gaussian, but according to Proposition 1, the first- and second-order moments can be obtained by computing two one-dimensional integrals with respect to a Gaussian measure. This can be done by quadrature rules or by Monte-Carlo methods (Baker1977).

Proposition 2.0.

For all , if , then:

(2.7)
(2.8)

where for all i in , .

2.4 Parametric representations of the mean and covariance functions

As explained in Introduction, the relevance of the Gaussian process predictor strongly depends on the definitions of and . When the maximal information about is a finite set of evaluations, these functions are generally chosen in general parametric families. In this paper, functions are supposed to be two elements of the Matérn-5/2 class (see Santner2003; Stein1999 for further details about classical parametric expressions for ), with be the hyper-parameters that characterize these covariance functions, whereas linear representations are considered for the mean functions,

(2.9)

where is a given -dimensional vector of functions (see PerrinJCP2017 for further details on the choice of the basis functions). In the following, the framework of the ”Universal Kriging” is adopted, which consists in:

  • assuming an (improper) uniform distribution for

    ,

  • conditioning all the results by the maximum likelihood estimate of ,

  • integrating over the conditioned distribution of .

In that case, the distribution of , which is defined by Eq. 2.5 is Gaussian, and its statistical moments can explicitly be derived (see Sacks1989; Bichon2008; Bect2012; PerrinJCP2017).

2.5 Sequential designs for the improvement of Gaussian process predictors

The relevance of the predictor strongly depends on the space filling properties of the sets gathering the inputs of the available observations of and , which are generally called Designs of Experiments (DoE). Space-filling Latin Hypercube Samplings (LHS) or quasi-Monte-Carlo samplings are generally chosen to define such a priori DoE (Fang2003; Fang2006; PerrinSFDS2017). The relevance of the predictor can then be improved by adding new points to an already existing DoE, as the higher the values of and , the more chance there is for to be small.

In the case of a single code, most of the existing selection criteria to add a new point are based on the minimization of a quantity associated with the predictor variance, such as its integral over the input domain for instance Sacks1989; Santner2003; Echard2011; Bect2012; Chevalier2014; Perrin2016; Hu2017; Gramacy2012Tech. Indeed, if is a Gaussian process that is indexed by in , and if we denote by

its covariance function, the variance of the conditioned random variable

, where and are any elements of , is given by:

(2.10)

such that it does not depend on the (unknown) value of . To minimize the global uncertainty over at a reduced computational cost, a natural approach would consist in searching the value of such that

(2.11)

is minimal (under the condition that this integral exists).

In the nested case, we also have to choose on which code to add a new observation point. To this end, let and be the numerical costs (in CPU time for instance) that are associated with the evaluations of and respectively. For the sake of simplicity, we assume that these numerical costs are independent on the value of the input parameters, and that they are a priori known. Two selection criteria are eventually proposed to optimize the relevance of the Gaussian process predictor sequentially. To simplify the reading, the following notation is proposed:

(2.12)

and we denote by the variance of under the hypothesis that the code(s) corresponding to the new point is(are) evaluated in this point (in practice, we remind that these code evaluations are not required for the estimation of this variance).

  • First, the chained I-optimal criterion selects the best point in to minimize the integrated variance of the predictor of the nested code:

    (2.13)

    Such a criterion is a priori adapted to the case when it is not possible to run independently the codes 1 and 2.

  • Secondly, the best I-optimal criterion selects the best among the candidates in and in order to maximize the decrease per unit of computational cost of the integrated predictor variance of the nested code:

    (2.14)

    In that case, the difference in the computational costs is taken into account, and a linear expected improvement per unit of computational cost is assumed for the sake of simplicity.

3 Fast evaluation of the prediction variance

As explained in Section 2.5, to choose the position of the new point, for each potential value of in , we need to compute the value of for all in . If quadrature rules or Monte Carlo approaches are used to evaluate this variance, as it is proposed in Section 2.3, the optimization procedure quickly becomes extremely demanding, even if discretized approximations of the optimization problem defined by Eqs. (2.14) and (2.13) are considered, that is to say where the integral over is replaced by an empirical mean over any -dimensional set of randomly chosen points of . To circumvent this problem, we present in this section several approaches to make the evaluation of explicit, and therefore extremely fast to evaluate.

3.1 Explicit derivation of the two first statistical moments of the nested code predictor

Proposition 3.0.

Using the notations of the Universal Kriging framework that is introduced in Section 2.4, and denoting by the family of functions such that if:

  1. for the mean function is of the form:

    (3.1)

    where is a deterministic function from to and is such that for all ,

  2. the covariance function is an element of the Gaussian class or corresponds to the covariance function of any derivative of a zero-mean process with covariance function of the Gaussian class,

then the conditional moments of order 1 and 2 of , which are defined by Eqs. (2.7) and (2.8) can be calculated analytically.

In other words, if the prior of the Gaussian process modeling the function can be seen as any derivative of a Gaussian process with a trend which is a linear combination of products of polynomials by exponentials of order less than 2, and a covariance function of the Gaussian class, then conditionally to some integration criteria, the moments of order 1 and 2 of the coupling of the predictors of the two codes can be computed explicitly at a reduced cost. However, the approach cannot be generalized to the coupling of more than two codes.

3.2 Linearized approach

In the cases where the conditions for Proposition 2 are not fulfilled (or if more than two codes were considered), another approach is proposed in this section, which is based on a linearization of the process modeling the nested code. Indeed, for , let be the Gaussian process such that:

(3.2)

By construction, is the residual prediction uncertainty once has been conditioned by evaluations of . We remind that these two Gaussian processes are statistically independent. Under the condition that is not too small compared to the complexity of , it is therefore reasonable to assume that is small compared to .

Proposition 3.0.

If:

  1. the predictor of two nested computer codes can be written , where are Gaussian processes which can be written as where ,

  2. and is small enough for the linearization to be valid,

then the predictor of the two nested computer codes can be defined as a Gaussian process with the following mean and covariance functions:

(3.3)

Hence, thanks to the proposed linearization, the variance of but also the one of can explicitly be derived for all in . Under the condition that the linearization is valid, this approach can be applied to configurations with more than two nested codes.

However it can be inferred from equation (3.3) that the variance depends on through . To circumvent this problem for the evaluation of the forward variance in the sequential designs, we assume that a candidate is associated with the current estimate of the output of the first code , in accordance with the Kriging Believer strategy proposed in Ginsbourger2010.

4 Applications

The previously proposed methods are applied to two examples: an analytical one-dimensional one and a multidimensional one.

4.1 Characteristics of the examples

4.1.1 Analytical example

In the analytical example the properties of the Gaussian process mean functions and of the codes are:

(4.1)
(4.2)

where . In this example .

Figure 1 shows the variations of the outputs of the codes 1, 2 and nested. The codes 1 and 2 outputs are relatively smooth compared with the one of the nested code. The amplitude of the variations is strongly non-stationary for the nested code.

(a) Code 1
(b) Code 2
(c) Nested code
Figure 1: Analytical example: variations of the outputs , and of the codes 1, 2 and nested with respect to their input.

4.1.2 Hydrodynamic example

This example consists in the coupling of two computer codes. The objective is to determine the impact point of a conical projectile.

The first code computes the drag coefficient of a cone divided by the height of the cone. Its inputs are the height and the half-angle of the cone, so the dimension of is 2.

The second code computes the range of the ballistic trajectory of a cone. Its inputs are the output of the first code, associated with , and the initial velocity and angle of the ballistic trajectory of the cone, gathered in . The dimension of is therefore 2.

Figure 2 illustrates the two codes inputs and outputs.

Figure 3 shows the variations of the output with respect to each component of the input for each code. This figure enables to propose a basis of functions for the prior mean of the processes associated with the two codes.

For the first code the scatter plots highlight a linear variation with respect to and a multiplicative inverse variation with respect to , so the proposed basis functions are:

(4.3)

For the second code only a multiplicative inverse variation with respect to is evident, so the proposed basis functions are:

(4.4)

The denominator has a lower boundary in order to avoid any inversion problem around zero. This boundary is small and set arbitrarily.

(a) Code 1: drag coefficient / height of the cone
(b) Code 2: range of a ballistic trajectory
Figure 2: Hydrodynamic example: Inputs and outputs of the two codes.
(a) Code 1
(b) Code 1
(c) Code 2
(d) Code 2
(e) Code 2
Figure 3: Hydrodynamic example: variation of the outputs and of the two codes with respect to the components of the inputs and . The input points are drawn according to a maximin LHS design on .

4.2 Reference: ”blind box” method

In this method, the nested computer code is considered as a single computer code. Only the inputs and the output are taken into account. The intermediary information is not considered. A Gaussian process regression of this single computer code is done.

Only the chained I-optimal sequential design could be applied in this framework, the other proposed sequential design requiring to consider the partial information.

4.3 Choice of the covariance functions and estimation of their hyperparameters

In the analytical example the covariance functions are Gaussian. This implies that the sample paths of the Gaussian processes associated with the codes are infinitely differentiable functions. This enables to apply Proposition 2 and Proposition 3 to this example.

In the hydrodynamic example the covariance functions are Matérn , which implies that the sample paths of the Gaussian processes associated with the codes are mean square one time continuously differentiable functions (see Rasmussen2006). This enables to perform the linearization of Proposition 3.

In both cases the covariance functions include a non-zero nugget term (see Gramacy2012Stat for further details).

The hyperparameters of the covariance functions are estimated for each set of observations, including the sequential designs. They are estimated by maximizing the Leave-One-Out log predictive probability (see

Rasmussen2006, chapter 5, and Bachoc2013).

4.4 Comparison between the analytical and the linearized method

Figure 4 illustrates the convergence of the two first statistical moments estimated with the Monte Carlo (see Proposition 1) and the linearized methods (see Proposition 3) towards their real values calculated with the analytical method described in Proposition 2.

Both methods converge when the uncertainty of the first code predictor decreases. It can be seen that the linearized method is a very good compromise between computation time and accuracy compared to the Monte Carlo method.

(a) Calculation time
(b) First moment
(c) Second moment
Figure 4: Comparison of the linearized (Proposition 3) and Monte-Carlo (Proposition 1) methods in terms of computation time and accuracy for the evaluation of the two first moments of the process . The Monte Carlo method is run with and points to compute the one-dimensional integral with a Gaussian measure. The Monte Carlo draws are repeated times and the curves correspond to the median of these repetitions.
The real values are computed with the analytical method (Proposition 2). The covariance functions are Gaussian. The predictor of the first code is of the form with , and for each value of , values of on a grid on are considered. The predictor of the second code is build using input observation points drawn on a grid on for the second code of the analytical example.

4.5 Definition of the performance criterion of the predictor mean

A set of validation observations if available. Let be elements of .
Denoting by the evaluations of the nested code in these points, the performance criterion of the nested predictor mean, also called error on the mean can be defined as:

(4.5)

4.6 Comparison between the blind box and the linearized methods

Figure 5 shows that the linearized method enables to better take into account the non-stationarity of the variations of the nested code output. On the contrary, in the blind box method the magnitude of the prediction interval is the same across the input domain and depends only on the distance to the observation points. The prediction interval is too big in the area with small variations and too small in the area with larger variations.

(a) Linearized method
(b) Blind box method
Figure 5: Analytical example: Predictors of the nested code obtained with the linearized and the blind box methods. The set of 20 observations is drawn according to a maximin LHS on . Actual values shown by dots, the mean of prediction by a line and the 95% prediction interval of prediction by a grey area.

Figure 6 shows the similar accuracies of the prediction mean computed with the analytical and linearized methods proposed in Proposition 2 and Proposition 3.

For both examples, the precision of the prediction mean is better with the linearized method than with the blind box method, showing the interest of taking into account the intermediary information.

(a) Analytical: Gaussian covariance
(b) Hydrodynamic example: Matérn covariance
Figure 6: Comparison of the prediction mean accuracy for the blind box and the linearized (Proposition 3) methods, and, in case of a Gaussian covariance function, the analytical method (Proposition 2). The curves correspond to the median of draws of maximin LHS designs on of increasing size.

4.7 Performances of the sequential designs with identical computational costs

Figure 7 shows the relevance of the proposed sequential designs for improving the prediction mean of the linearized nested predictor, compared to the maximin LHS design on .

In the analytical example, the best I-optimal sequential design enables to obtain the most accurate prediction mean at a given computational cost. In the hydrodynamic example, the different sequential designs give similar results, except for the first new observation points added, where the best I-optimal is better.

(a) Analytical example
(b) Hydrodynamic example
Figure 7: Comparison of the linearized predictor mean precision with the maximin LHS design on and the sequential designs applied to the two examples. In the hydrodynamic example, the two curves representing the sequential designs are almost superimposed. The initial designs are the same for the three curves, with a size of points for the analytical example and points for the hydrodynamical example. The draw of the chained maximin LHS designs is repeated times and the curves present the median of the associated results. The costs of the two codes are assumed to be the same.

In both examples the new observation points are mostly added on the first code, as shown in figure 8. It seems that the uncertainty propagated from the first code into the second code is predominant at the beginning. The best I-optimal sequential design aims therefore to reduce this uncertainty by first adding new observation points on the first code. Then new observations points can be added on both codes.

(a) Analytical example
(b) Hydrodynamic example
Figure 8: Comparison of the number of evaluations of each code in case of a sequential best I-optimal design applied to both examples. The curves correspond to the median of draws of the initial design. The costs of the two codes are assumed to be the same.

4.8 Performances of the sequential designs with different computational costs

Figure 9 shows the prediction mean accuracy with a best I-optimal sequential design when the costs of the two codes are different. It can be seen that at a given total computational cost the accuracy of prediction is better when the cost of the first code is lower. In other words the prediction mean accuracy is better at a given computational budget when more observation points can be added to the first code for the same computational budget. These results are consistent with those of figure 8.

(a) Analytical example
(b) Hydrodynamic example
Figure 9: Performances of the best I-optimal sequential design in terms of prediction mean accuracy with different computational costs for the two codes. 1:2 cost 1 for code 1 and 2 for code 2, 2:1 cost 2 for code 1 and 1 for code 2. The curves correspond to the median of draws of the initial maximin LHS design on . The initial designs are the same for the two curves corresponding to each example and contain 15 observations and 30 observations on both codes for the analytical and the hydrodynamical example.

5 Conclusions and future work

In this paper the Gaussian process formalism is adapted to the case of two nested computer codes.
Two methods to evaluate quickly the mean and variance of the nested code predictor have been proposed. The first one, called ”analytical” computes the exact value of the two first moments of the predictor. But it cannot be applied to the coupling of more than two codes. The second one, called ”linearized”, enables to obtain a Gaussian predictor of the two nested codes, with mean and variance that can be instantly computed. The approach could be generalized to the coupling of more than two codes.
Both proposed methods take into account the intermediary information, that means the output of the first code. A comparison to the reference method, called ”blind box”, is made. In this method a Gaussian process regression of the block of the two codes is made without considering the intermediary observations. The numerical examples illustrate the interest of taking into account the intermediary information in terms of prediction mean accuracy.

Moreover, two sequential designs are proposed in order to improve the prediction accuracy of the nested predictor. The first one, the ”chained” I-optimal sequential design, corresponds to the case when the two codes cannot be launched separately. The second one, the ”best” I-optimal sequential design, allows to choose on which of the two codes to add a new observation point and to take into account the different computational costs of the two codes.
The numerical applications show the interest of the sequential designs compared to a space-filling design (maximin LHS). Furthermore, they illustrate the advantage, in terms of prediction mean accuracy, of choosing on which code to add a new observation point compared to simply adding new observation points of the nested code. The results obtained show an amplification of the uncertainties in the chain of codes, leading to the addition of observation points on the first code firstly in the best I-optimal sequential design. It can be assumed that this should be similar with the coupling of more than two codes. In other words, the uncertainty of the beginning of the chain should be reduced as a priority.

This paper has been focused on the case of two nested codes with a scalar intermediary variable. Considering the case of a functional intermediary variable seems promising for future work.

Appendix

Proof of Proposition 1

According to Eq (2.5):

where and are independent according to the independence of the initial processes and .

Therefore the process modeling the nested code can be written:

Given the independence of and and the fact that , it can be inferred that the first moment of can be written:

By noting that:

  • and are independent,

  • and ,

the second moment of can be written:

Proof of Proposition 2

If and then the mean of is defined as:

where under the condition that .

Given that the moments of a Gaussian variable can be calculated analytically, and therefore can be computed analytically.

So we have shown that if , and then, under the integrability condition , the mean of can be calculated analytically.

First moment

In the framework of Universal Kriging, the conditional mean function of the process modeling the second code can be written:

where and and .

According to the assumptions of Proposition 2 the mean basis functions can be written:

with deterministic functions.

In the same way, the covariance function can be written:

with , and positive integers and denoting the n-th derivative of function . So, we can written that:

where is a deterministic function defined according to the previous equation and real numbers.

So the terms and of the previous equation can be written:

According to the fact that and are deterministic functions, , , and deterministic vectors, and and deterministic real numbers, then:

The means and can therefore be calculated analytically, and consequently, the mean can be calculated analytically.

Second moment

In the framework of Universal Kriging, it can be written that:

where , and are deterministic real-valued, , and dimensional matrices.

According to the assumptions of Proposition 2 and the previous equations, the terms , and can be rewritten: