Known Boundary Emulation of Complex Computer Models

01/09/2018 ∙ by Ian Vernon, et al. ∙ Durham University 0

Computer models are now widely used across a range of scientific disciplines to describe various complex physical systems, however to perform full uncertainty quantification we often need to employ emulators. An emulator is a fast statistical construct that mimics the complex computer model, and greatly aids the vastly more computationally intensive uncertainty quantification calculations that a serious scientific analysis often requires. In some cases, the complex model can be solved far more efficiently for certain parameter settings, leading to boundaries or hyperplanes in the input parameter space where the model is essentially known. We show that for a large class of Gaussian process style emulators, multiple boundaries can be formally incorporated into the emulation process, by Bayesian updating of the emulators with respect to the boundaries, for trivial computational cost. The resulting updated emulator equations are given analytically. This leads to emulators that possess increased accuracy across large portions of the input parameter space. We also describe how a user can incorporate such boundaries within standard black box GP emulation packages that are currently available, without altering the core code. Appropriate designs of model runs in the presence of known boundaries are then analysed, with two kinds of general purpose designs proposed. We then apply the improved emulation and design methodology to an important systems biology model of hormonal crosstalk in Arabidopsis Thaliana.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

page 14

page 18

page 19

page 30

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of mathematical models to describe complex physical systems is now commonplace in a wide variety of scientific disciplines. We refer to such models as simulators. Often they possess high numbers of input and/or output dimensions, and are sufficiently complex that they may require substantial time for the completion of a single evaluation. The simulator may have been developed to aid understanding of the real world system in question, or to be compared to observed data, necessitating a high-dimensional parameter search or model calibration, or to make predictions of future system behaviour, possibly with the goal of aiding a future decision process. The responsible use of simulators in all of the above contexts usually requires a full (Bayesian) uncertainty analysis [6], which will aim to incorporate all the major relevant sources of uncertainty, for example, parametric uncertainty on the inputs to the model, observation uncertainties on the data, structural model discrepancy that represents the uncertain differences between the simulator and the real system, both for past and future and the relation between them, and also all the many uncertainties related to the decision process [39].

However, such an uncertainty analysis, which represents a critically important part of any serious scientific study, usually requires a vast number of simulator evaluations. For complex simulators with even a modest runtime, this means that such an analysis is utterly infeasible. A solution to this problem is found through the use of emulators: an emulator is a statistical construct that seeks to mimic the behaviour of the simulator over its input space, but which is several orders of magnitude faster to evaluate. Early uses of Gaussian process emulators for computer models were given by [35, 13] with a more detailed account given in [36]

. A vital feature of an emulator is that it gives both an expectation of the simulator’s outputs at an unexplored input location as well as an uncertainty statement about the emulator’s accuracy at this point, an attribute that elevates emulation above interpolation or other simple proxy modelling approaches. Therefore, emulators fit naturally within a Bayesian approach, and help facilitate the full uncertainty analysis described above. For an early example of an uncertainty analysis using multilevel emulation combined with structural discrepancy modelling in a Bayesian history matching context see 

[9, 10], and for a fully Bayesian calibration of a complex nuclear radiation model, again incorporating structural model discrepancy, see [26].

Emulators have now been successfully employed across several scientific disciplines, including cosmology [39, 40, 8, 37, 20, 41, 31], climate modelling [44, 34, 25, 22] (the later employing emulation to increase the efficiency of an Approximate Bayesian Computation algorithm), epidemiology [2, 3, 1, 30, 29], systems biology [43, 24], oil reservoir modelling [11, 12], environmental science [16], traffic modelling [7], vulcanology [5] and even to Bayesian analysis itself [42]. The development of improved emulation strategies therefore has the potential to benefit multiple scientific areas, allowing more accurate analyses with lower computational cost. If additional prior insight into the physical structure of the model is available, it is of real importance that emulator structures capable of incorporating such insights have been developed to fully exploit this information.

Here we describe such an advance in emulation strategy that, when applicable, can lead to substantial improvements in emulator performance. In most cases, complex deterministic simulators have to be solved numerically for arbitrary input specifications, which leads to substantial runtimes. However, for some simulators, there exist input parameter settings, lying possibly on boundaries or hyperplanes in the input parameter space, where the simulator can be solved far more efficiently, either analytically in the ideal case or just significantly faster using a much more efficient and simpler solver. This may be due to the system in question, or at least a subset of the system outputs, behaving in a much simpler way for particular input settings, possibly due for example to various modules decoupling from more complex parts of the model (possibly when certain inputs are set to zero, switching some processes off). Note that this leads to Dirichlet boundary conditions, i.e. known simulator behaviour on various hyperplanes, that impose constraints on the emulator itself, and that these are distinct from the boundary conditions imposed on the physical simulator model (as for example, analysed approximately using KL expansions by [38]). The goal then, is to incorporate these known boundaries, situated where we essentially know the function output, into the Bayesian emulation process, which should lead to significantly improved emulators. We do this by formally updating the emulators by the information contained on the known boundaries, obtaining analytic results, and show that this is possible for a large class of emulators, for multiple boundaries of various forms, and most importantly, for trivial extra computational cost. We also detail how users can include known boundaries when using standard black box Gaussian Process software (without altering the core code), although this method is less powerful than implementing the fully updated emulators that we develop here. We then analyse the design problem of how to choose an efficient set of runs of the full simulator, given that we are aware of the existence of one or more known boundaries. Finally we apply this approach to a model of hormonal crosstalk in Arabidopsis, an important model in systems biology, which possesses these features.

The article is organised as follows. In section 2 we describe a standard emulation approach for deterministic simulators. In section 3 we develop the full known boundary emulation (KBE) methodology, explicitly constructing emulators that have been updated by one or two perpendicular (or parallel) known boundaries. In section 4 we discuss the design problem of how to choose efficient sets of simulator runs. In section 5 we apply both the KBE and the design techniques to the systems biology Arabidopsis model, before concluding in section 6.

2 Emulation of Complex Computer Models

We consider a complex computer model represented as a function , where denotes a

-dimensional vector containing the computer model’s input parameters, and

is a pre-specified input parameter space of interest. We imagine that a single evaluation of the computer model takes a substantial amount of time to complete, and hence we will only be able evaluate it at a small number of locations. Here we assume is univariate, but the results we present should directly generalise to the corresponding multivariate case.

We represent our beliefs about the unknown at unevaluated input via an emulator. For now, we assume that the form of the emulator is that of a pure Gaussian process (or in a less fully specified version, a weakly second order stationary stochastic process):

(1)

We make the judgement, consistent with most of the computer model literature, that the have a product correlation structure:

(2)

with , corresponding to deterministic . Product correlation structures are very common, with the most popular being the Gaussian correlation structure given by:

(3)

which can be generalised to have different in each direction, while maintaining the product structure. As usual, we will also assume stationarity, but the following derivations do not require this assumption.

If we perform a set of runs at locations over the input space of interest , giving computer model outputs as the column vector , then we can update our beliefs about the computer model in light of

. This can be done either using Bayes theorem (if

is assumed to be a Gaussian process) or using the Bayes linear update formulae (which following DeFinetti [14], treats expectation as primitive, and requires only a second order specification [15, 17]):

(4)
(5)
(6)

where , and

are the expectation, variance and covariance of

adjusted by  [15, 17]. The fully Bayesian calculation, using Bayes theorem, would yield similar update formulae for the analogous posterior quantities. Although we will work within the Bayes linear formalism, the derived results will apply directly to the fully Bayesian case, were one willing to make the additional assumption of full normality that use of a Gaussian process entails. In this case all Bayes linear adjusted quantities can be directly mapped to the corresponding posterior versions e.g. and . See [15, 17] for discussion of the benefits of using a Bayes linear approach, and [39, 40] for its benefits within a computer model setting.

The results presented in this article rely on the product correlation structure of the emulator. As such, expansion of these methods to more general emulator forms requires further calculation. For example, a more advanced emulator is given by [10, 39]:

(7)

where the active inputs are a subset of that are strongly influential for , the first term on the right hand side is a regression term containing known functions and possibly unknown , is a Gaussian process over the active inputs only, and is an uncorrelated nugget term, representing the inactive variables. See also [11] and [39, 40] for discussions of the benefits of using an emulator structure of this kind, and see [26, 21] for discussions of alternative structures. We will discuss the generalisation of our results to equation (7) in section 6, but currently we note that if the regression parameters are assumed known, perhaps due to sufficiently large run number, and if all variables are assumed active, then equation (7) reduces to the required form, and all our results will apply.

3 Known Boundary Emulation

We now consider the situation where the computer model is analytically solvable on some lower dimensional boundary . Hence we can evaluate a vast number of times on , and use these to supplement our standard emulator evaluations over to produce an emulator that respects the functional behaviour of along . We first examine the case of finite (but large) , which can be analysed using the standard Bayes linear update, but structure our calculations so that they can be simply generalised to continuous model evaluations on , which will require a generalised version of the Bayes linear update, as described in section 3.6.

Call the corresponding length vector of model evaluations . Unfortunately simply plugging these runs into the Bayes Linear update equations (4), (5) and (6), replacing with , would be infeasible due to the size of the matrix inversion . For example, if the dimension of is not small, we may need to be extremely large (billions or trillions say) to capture all the information contained in . Hence a direct update of the emulator in light of the information in is non-trivial. Here we show from first principles that this update can be performed analytically for a wide class of emulators. We do this by exploiting a sufficiency argument briefly described in the supplementary material of [26], and in [33], but which, to our knowledge, has not been fully explored or utilised in the context of known boundary emulation. The emulation problem is further compounded when we have both evaluations on the boundary, and in the main bulk defined as above. For this case we will develop a sequential update that first updates analytically by to obtain , and , and then subsequently updates by , as is discussed in Section 3.3.

3.1 A Single Known Boundary

We wish to update the emulator, and hence our beliefs about , at the input point in light of a single known boundary , where is a dimensional hyperplane perpendicular to the direction (but we note that our results naturally extend to lower dimensional boundaries). To capture the simulator behaviour along , we evaluate at a large number, , of points on which we denote . We also evaluate the perpendicular projection of the point of interest, , onto the boundary , which we denote as . We therefore extend the collection of boundary evaluations, , to be the column vector

which is illustrated in Figure 1 (left panel).

     
Figure 1: The single known boundary case. Left panel: the points required for the and calculation. is the point we wish to emulate at, its orthogonal projection onto the known boundary at distance . Right panel: the points required for the calculation. and are points we wish to update the covariance at, while and are their orthogonal projection onto the known boundary , at distances and respectively. In both panels, the represent a large number of points for which we can evaluate analytically (or at least very quickly).

We start by examining the expression for

(8)

As noted above, this calculation is seemingly infeasible due to the term. However, if we evaluate it at the point itself, which lies on , and as we have evaluated , we must find, for the emulator of a smooth deterministic function with suitably chosen correlation structure, that (and that ). This is indeed the case as can be seen by examining the structure of the term. As is included as the first element of , we note that

(9)
(10)

where

is the identity matrix of dimension

. Taking the first row of equation (10) gives

(11)

Substituting equation (11) into the adjusted mean and variance naturally gives and as it must. Whilst unsurprising, this simple result is of particular value when considering the behaviour at the point of interest, . As we have defined as the perpendicular projection of onto , we can write , for some constant . Now we can exploit the symmetry of the product correlation structure (2), to obtain the following covariance expressions

(12)

since for and . Furthermore,

(13)

since the first component of and must be equal as they all lie on (i.e. ). Combining (12) and (13), the covariance between point and the set of boundary evaluations is given by

(14)

Equations (11) and (14) are very useful results that greatly simplify the emulator calculations. We can use them to write the adjusted emulator expectation for given in (8) as

(15)

Thus we have eliminated the need to explicitly invert the large matrix entirely by exploiting the symmetric product correlation structure and the identity (11). Similarly, we find the adjusted variance using equations (5), (11) and (14),

(16)

Equations (15) and (16) give the expectation and variance of the emulator at a point , updated by a known boundary . As they require only evaluations of the analytic boundary function and the correlation function they can be implemented with trivial computational cost in comparison to a direct update by . Note that they critically rely on the projected point being in .

Finally, we consider the Bayes linear update for the covariance between and a second input point given the boundary . We define the orthogonal projection of onto as , and denote its perpendicular distance from as , as shown in Figure 1 (right panel). Equation (6) now gives

(17)

where in the final line we used the equivalent result to equation (13), rewritten for . Noting that we can also write , and that , equation (17) becomes

(18)

where we have defined the correlation function of the projection of and onto as

and defined the ‘updated correlation component’ in the direction as

(19)

These expressions for the expectation and (co)variance updated by the information at the simulator boundary provide several insights:

  1. [label=(),leftmargin=0.6cm]

  2. Sufficiency: for the updating of our beliefs about the emulator, we see that is sufficient for . Hence, only the evaluation is required and the evaluations are redundant (note that under an assumption of an underlying Gaussian process, this result corresponds to a conditional independence statement discussed in the supplementary material to [26]). This has important ramifications for users of black box GP packages, as we discuss in section 3.3.

  3. The correlation structure is now no longer stationary: the contribution to the correlation function from dimensions 2 to , denoted , is unchanged by the update (as we would expect from symmetry arguments), however the contribution in the direction depends on the distance to the boundary , through , breaking stationarity.

  4. The correlation structure is still in product form: Critically, as the correlation structure has maintained its product from, this suggests that we can update by further known boundaries, either perpendicular to any of the remaining inputs , with , hence perpendicular to or indeed by a second boundary parallel to . We perform these updates in sections 3.4 and 3.5.

  5. Intuitive limiting behaviour: As we move towards , the emulator tends towards the known boundary function, and as we move away from the emulator reverts to its prior form, as expected:

    (20)
    (21)

    as . Similarly, the behaviour of is as expected, tending to its prior form far from the boundary (with finite), and to zero as either and tend to zero:

    (22)
    (23)

3.2 Application to a 2-Dimensional Model

For illustration, we consider the problem of emulating the 2-dimensional function

(24)

defined over the region given by , , where we assume a known boundary at , and hence have that . The true output of over is given in Figure 1(b) for reference.

(a) The emulator expectation .
(b) The true 2-dimensional function .
(c) The emulator stan. dev. .
(d) Emulator diagnostics .
Figure 2: Updating by a single known boundary at .

Using a prior expectation , and a product Gaussian covariance structure with parameters and , we apply the expectation and variance updates (15) and (16) given the boundary at and find that

Figure 1(a) shows the adjusted expectation over , clearly illustrating how the expectation surface has been changed in the vicinity of to agree with the simulator behaviour. Figure 1(c)

shows the adjusted emulator standard deviation

and demonstrates the significant reduction in emulator uncertainty near . Finally, Figure 1(d) shows simple emulator diagnostics over of the form of the standardised values . Thus any values of for which was far from (a typical choice being ) would indicate a conflict between emulator and simulator (see [4] for details). For our boundary-adjusted emulator, the standardised diagnostics all maintain modest values lying well within standard deviations giving no cause for concern.

3.3 Updating by further model evaluations

Most importantly, as we have analytic expressions for , and we are now able to include additional simulator evaluations into the emulation process. To do this, we perform (expensive) evaluations, , of the full simulator across , and use these to supplement the evaluations, , available on the boundary. We want to update the emulator by the union of the evaluations and , that is to find , and . This can be achieved via a sequential Bayes Linear update:

(25)
(26)
(27)

where we first update our emulator analytically by , and subsequently update these quantities by the evaluations  [17]. As typically is small due to the relative expense of evaluating the full simulator, these calculations will remain tractable, as will be feasible for modest .

Not only will the known boundary improve the accuracy of the emulator compared to just updating by , for only trivial computational cost, it will also allow us to design a more informative set of runs that constitute . We discuss appropriate designs for this scenario in section 4.

3.3.1 Incorporating Known Boundaries into Black Box Emulation Packages

Consideration of the form of the sequential update given by equations (25)-(27), combined with the sufficiency argument presented in section 3.1, shows that for the full joint update by , a sufficient set of points is composed of a) the points in , b) the points formed from the projection of onto the boundary , and c) the projection of the point of interest , giving a total of points. This has ramifications for users of black box Gaussian process emulation packages (such as BACCO [19] or GPfit [28] in R, or GPy [18] for Python), which perhaps cannot be easily recoded to use the more sophisticated analytic emulation formula of equations (15) and (16), but for which the inclusion of extra simulator evaluations is trivial. Hence such a user simply has to add the extra projected points on to their usual set of runs, and their black box GP package will produce results that precisely match equations (25)-(27). This however will require inverting a matrix of size and hence will be slower than directly using the above analytic results, which only require inverting a matrix of size .

3.4 Updating by Two Perpendicular Known Boundaries

Given the above results, we now proceed to discuss the update of the emulator by a second known boundary, . In the first case, discussed here, is assumed perpendicular to , and in the second case, discussed in section 3.5, is parallel to . Detailed derivations of the results presented can be found in appendices A and B, and the key results for single and dual-boundaries are summarised in Table 1, for ease of comparison.

First, we assume the second known boundary is a dimensional hyperplane, perpendicular to the direction, as illustrated in Figure 3 (left panel). Our goal is to update the emulator for , by our knowledge of the function’s behaviour on both boundaries and , and subsequently by a set of runs within . Thus we must find and . We do this sequentially by analytically updating by followed by , then numerically by .

     
Figure 3: Left panel: two perpendicular known boundaries. Right panel: two parallel known boundaries. In both cases and are the points of interest for the emulation calculation, while and are their orthogonal projection onto the known boundary , and and their orthogonal projection onto the known boundary . The and represent a large number of points on the boundaries and respectively for which we can evaluate and analytically.

As before, assume that the is analytically solvable and hence inexpensive to evaluate along , permitting a large but finite number, , of evaluations on , denoted . As in Section 3.1, we define the corresponding length vector of boundary values as

(28)

which includes the projection of onto . An analogous proof to that of equation (11) gives

(29)

while as the product correlation structure is not disturbed by the update by , we also have

(30)

where is the perpendicular distance from to and is the correlation function in the perpendicular direction to , as shown in Figure 3 (left panel). Using (29) and (30) the expectation of adjusted by then can now be calculated using the sequential update (25) giving

(32)

where we have also used equation (15) for , defined and denoted the projection of onto as , which is just the perpendicular projection of onto . An expression for the covariance adjusted by then is obtained by a similar argument (see appendix A)

(33)
(34)

where we have defined the correlation function of the projection of and onto as

(35)

The updated variance is trivially obtained by setting to get

(36)

As a consistency check, we see that all three expressions (32), (34) and (36) are invariant under interchange of the two boundaries, represented as the transformation and , as they should be. They also exhibit intuitive limiting behaviours as the distance from the boundary tends to 0 or (see appendix A). Again, we observe that were we to sequentially update by a further evaluations, , and calculate and , the only points we require for sufficiency are and the projections of and onto , , and . This represents only points, which is far fewer than the points (with extremely large) that we started with. Again, users of black box emulators can easily insert these points, at the cost of having to invert a matrix now of size , instead of a single inversion of size were they to encode the above analytic results directly.

An example of an emulator updated by two perpendicular known boundaries is shown in Figures 3(a)3(c), which give , and respectively, for the simple function introduced in section 3.2. A second known boundary is now located at , where we know that . As expected, we see the emulator expectation agrees exactly with the behaviour of the simulator on and (as given in Figure 1(b)). We note also the intuitive property that the variance of the emulator reduces to zero as we approach the boundary, but remains at when we are sufficiently distant. This sensibly represents the increase in knowledge about the simulator behaviour the closer we are to or . Diagnostics are again acceptable.

(a) :
(b) :
(c) : Diagnostics
(d) :
(e) :
(f) : Diagnostics
Figure 4: Emulators updated by two boundaries and . Top row: perpendicular boundaries, with and . Bottom row: parallel boundaries, with and .

3.5 Updating by Two Parallel Known Boundaries

Consider now a second boundary located at , that is therefore parallel to the original boundary at . As updating by leaves the correlation structure still in product form, critically with respect to the term, we can still perform a subsequent analytic update by . We define as before by (28), and denote the distance from point to its perpendicular projection, , onto as , (see Figure  3, right panel), where now . See appendix B for more details of the following derivations.

In this parallel case, analogous results to (30) and (29) can be derived, which combine to give

(37)

where and is as in (19). Inserting into equation (25), the emulator expectation adjusted by both and can be shown to be (see appendix B):

(38)

where we have exploited the fact that the projection of onto is just . Similarly, the covariance adjusted by and can be shown to be (see appendix B):