The use of mathematical models to describe complex physical systems is now commonplace in a wide variety of scientific disciplines. We refer to such models as simulators. Often they possess high numbers of input and/or output dimensions, and are sufficiently complex that they may require substantial time for the completion of a single evaluation. The simulator may have been developed to aid understanding of the real world system in question, or to be compared to observed data, necessitating a high-dimensional parameter search or model calibration, or to make predictions of future system behaviour, possibly with the goal of aiding a future decision process. The responsible use of simulators in all of the above contexts usually requires a full (Bayesian) uncertainty analysis , which will aim to incorporate all the major relevant sources of uncertainty, for example, parametric uncertainty on the inputs to the model, observation uncertainties on the data, structural model discrepancy that represents the uncertain differences between the simulator and the real system, both for past and future and the relation between them, and also all the many uncertainties related to the decision process .
However, such an uncertainty analysis, which represents a critically important part of any serious scientific study, usually requires a vast number of simulator evaluations. For complex simulators with even a modest runtime, this means that such an analysis is utterly infeasible. A solution to this problem is found through the use of emulators: an emulator is a statistical construct that seeks to mimic the behaviour of the simulator over its input space, but which is several orders of magnitude faster to evaluate. Early uses of Gaussian process emulators for computer models were given by [35, 13] with a more detailed account given in 
. A vital feature of an emulator is that it gives both an expectation of the simulator’s outputs at an unexplored input location as well as an uncertainty statement about the emulator’s accuracy at this point, an attribute that elevates emulation above interpolation or other simple proxy modelling approaches. Therefore, emulators fit naturally within a Bayesian approach, and help facilitate the full uncertainty analysis described above. For an early example of an uncertainty analysis using multilevel emulation combined with structural discrepancy modelling in a Bayesian history matching context see[9, 10], and for a fully Bayesian calibration of a complex nuclear radiation model, again incorporating structural model discrepancy, see .
Emulators have now been successfully employed across several scientific disciplines, including cosmology [39, 40, 8, 37, 20, 41, 31], climate modelling [44, 34, 25, 22] (the later employing emulation to increase the efficiency of an Approximate Bayesian Computation algorithm), epidemiology [2, 3, 1, 30, 29], systems biology [43, 24], oil reservoir modelling [11, 12], environmental science , traffic modelling , vulcanology  and even to Bayesian analysis itself . The development of improved emulation strategies therefore has the potential to benefit multiple scientific areas, allowing more accurate analyses with lower computational cost. If additional prior insight into the physical structure of the model is available, it is of real importance that emulator structures capable of incorporating such insights have been developed to fully exploit this information.
Here we describe such an advance in emulation strategy that, when applicable, can lead to substantial improvements in emulator performance. In most cases, complex deterministic simulators have to be solved numerically for arbitrary input specifications, which leads to substantial runtimes. However, for some simulators, there exist input parameter settings, lying possibly on boundaries or hyperplanes in the input parameter space, where the simulator can be solved far more efficiently, either analytically in the ideal case or just significantly faster using a much more efficient and simpler solver. This may be due to the system in question, or at least a subset of the system outputs, behaving in a much simpler way for particular input settings, possibly due for example to various modules decoupling from more complex parts of the model (possibly when certain inputs are set to zero, switching some processes off). Note that this leads to Dirichlet boundary conditions, i.e. known simulator behaviour on various hyperplanes, that impose constraints on the emulator itself, and that these are distinct from the boundary conditions imposed on the physical simulator model (as for example, analysed approximately using KL expansions by ). The goal then, is to incorporate these known boundaries, situated where we essentially know the function output, into the Bayesian emulation process, which should lead to significantly improved emulators. We do this by formally updating the emulators by the information contained on the known boundaries, obtaining analytic results, and show that this is possible for a large class of emulators, for multiple boundaries of various forms, and most importantly, for trivial extra computational cost. We also detail how users can include known boundaries when using standard black box Gaussian Process software (without altering the core code), although this method is less powerful than implementing the fully updated emulators that we develop here. We then analyse the design problem of how to choose an efficient set of runs of the full simulator, given that we are aware of the existence of one or more known boundaries. Finally we apply this approach to a model of hormonal crosstalk in Arabidopsis, an important model in systems biology, which possesses these features.
The article is organised as follows. In section 2 we describe a standard emulation approach for deterministic simulators. In section 3 we develop the full known boundary emulation (KBE) methodology, explicitly constructing emulators that have been updated by one or two perpendicular (or parallel) known boundaries. In section 4 we discuss the design problem of how to choose efficient sets of simulator runs. In section 5 we apply both the KBE and the design techniques to the systems biology Arabidopsis model, before concluding in section 6.
2 Emulation of Complex Computer Models
We consider a complex computer model represented as a function , where denotes a
-dimensional vector containing the computer model’s input parameters, andis a pre-specified input parameter space of interest. We imagine that a single evaluation of the computer model takes a substantial amount of time to complete, and hence we will only be able evaluate it at a small number of locations. Here we assume is univariate, but the results we present should directly generalise to the corresponding multivariate case.
We represent our beliefs about the unknown at unevaluated input via an emulator. For now, we assume that the form of the emulator is that of a pure Gaussian process (or in a less fully specified version, a weakly second order stationary stochastic process):
We make the judgement, consistent with most of the computer model literature, that the have a product correlation structure:
with , corresponding to deterministic . Product correlation structures are very common, with the most popular being the Gaussian correlation structure given by:
which can be generalised to have different in each direction, while maintaining the product structure. As usual, we will also assume stationarity, but the following derivations do not require this assumption.
If we perform a set of runs at locations over the input space of interest , giving computer model outputs as the column vector , then we can update our beliefs about the computer model in light of
. This can be done either using Bayes theorem (ifis assumed to be a Gaussian process) or using the Bayes linear update formulae (which following DeFinetti , treats expectation as primitive, and requires only a second order specification [15, 17]):
where , and
are the expectation, variance and covariance ofadjusted by [15, 17]. The fully Bayesian calculation, using Bayes theorem, would yield similar update formulae for the analogous posterior quantities. Although we will work within the Bayes linear formalism, the derived results will apply directly to the fully Bayesian case, were one willing to make the additional assumption of full normality that use of a Gaussian process entails. In this case all Bayes linear adjusted quantities can be directly mapped to the corresponding posterior versions e.g. and . See [15, 17] for discussion of the benefits of using a Bayes linear approach, and [39, 40] for its benefits within a computer model setting.
The results presented in this article rely on the product correlation structure of the emulator. As such, expansion of these methods to more general emulator forms requires further calculation. For example, a more advanced emulator is given by [10, 39]:
where the active inputs are a subset of that are strongly influential for , the first term on the right hand side is a regression term containing known functions and possibly unknown , is a Gaussian process over the active inputs only, and is an uncorrelated nugget term, representing the inactive variables. See also  and [39, 40] for discussions of the benefits of using an emulator structure of this kind, and see [26, 21] for discussions of alternative structures. We will discuss the generalisation of our results to equation (7) in section 6, but currently we note that if the regression parameters are assumed known, perhaps due to sufficiently large run number, and if all variables are assumed active, then equation (7) reduces to the required form, and all our results will apply.
3 Known Boundary Emulation
We now consider the situation where the computer model is analytically solvable on some lower dimensional boundary . Hence we can evaluate a vast number of times on , and use these to supplement our standard emulator evaluations over to produce an emulator that respects the functional behaviour of along . We first examine the case of finite (but large) , which can be analysed using the standard Bayes linear update, but structure our calculations so that they can be simply generalised to continuous model evaluations on , which will require a generalised version of the Bayes linear update, as described in section 3.6.
Call the corresponding length vector of model evaluations . Unfortunately simply plugging these runs into the Bayes Linear update equations (4), (5) and (6), replacing with , would be infeasible due to the size of the matrix inversion . For example, if the dimension of is not small, we may need to be extremely large (billions or trillions say) to capture all the information contained in . Hence a direct update of the emulator in light of the information in is non-trivial. Here we show from first principles that this update can be performed analytically for a wide class of emulators. We do this by exploiting a sufficiency argument briefly described in the supplementary material of , and in , but which, to our knowledge, has not been fully explored or utilised in the context of known boundary emulation. The emulation problem is further compounded when we have both evaluations on the boundary, and in the main bulk defined as above. For this case we will develop a sequential update that first updates analytically by to obtain , and , and then subsequently updates by , as is discussed in Section 3.3.
3.1 A Single Known Boundary
We wish to update the emulator, and hence our beliefs about , at the input point in light of a single known boundary , where is a dimensional hyperplane perpendicular to the direction (but we note that our results naturally extend to lower dimensional boundaries). To capture the simulator behaviour along , we evaluate at a large number, , of points on which we denote . We also evaluate the perpendicular projection of the point of interest, , onto the boundary , which we denote as . We therefore extend the collection of boundary evaluations, , to be the column vector
which is illustrated in Figure 1 (left panel).
We start by examining the expression for
As noted above, this calculation is seemingly infeasible due to the term. However, if we evaluate it at the point itself, which lies on , and as we have evaluated , we must find, for the emulator of a smooth deterministic function with suitably chosen correlation structure, that (and that ). This is indeed the case as can be seen by examining the structure of the term. As is included as the first element of , we note that
is the identity matrix of dimension. Taking the first row of equation (10) gives
Substituting equation (11) into the adjusted mean and variance naturally gives and as it must. Whilst unsurprising, this simple result is of particular value when considering the behaviour at the point of interest, . As we have defined as the perpendicular projection of onto , we can write , for some constant . Now we can exploit the symmetry of the product correlation structure (2), to obtain the following covariance expressions
since for and . Furthermore,
Thus we have eliminated the need to explicitly invert the large matrix entirely by exploiting the symmetric product correlation structure and the identity (11). Similarly, we find the adjusted variance using equations (5), (11) and (14),
Equations (15) and (16) give the expectation and variance of the emulator at a point , updated by a known boundary . As they require only evaluations of the analytic boundary function and the correlation function they can be implemented with trivial computational cost in comparison to a direct update by . Note that they critically rely on the projected point being in .
Finally, we consider the Bayes linear update for the covariance between and a second input point given the boundary . We define the orthogonal projection of onto as , and denote its perpendicular distance from as , as shown in Figure 1 (right panel). Equation (6) now gives
where we have defined the correlation function of the projection of and onto as
and defined the ‘updated correlation component’ in the direction as
These expressions for the expectation and (co)variance updated by the information at the simulator boundary provide several insights:
Sufficiency: for the updating of our beliefs about the emulator, we see that is sufficient for . Hence, only the evaluation is required and the evaluations are redundant (note that under an assumption of an underlying Gaussian process, this result corresponds to a conditional independence statement discussed in the supplementary material to ). This has important ramifications for users of black box GP packages, as we discuss in section 3.3.
The correlation structure is now no longer stationary: the contribution to the correlation function from dimensions 2 to , denoted , is unchanged by the update (as we would expect from symmetry arguments), however the contribution in the direction depends on the distance to the boundary , through , breaking stationarity.
The correlation structure is still in product form: Critically, as the correlation structure has maintained its product from, this suggests that we can update by further known boundaries, either perpendicular to any of the remaining inputs , with , hence perpendicular to or indeed by a second boundary parallel to . We perform these updates in sections 3.4 and 3.5.
Intuitive limiting behaviour: As we move towards , the emulator tends towards the known boundary function, and as we move away from the emulator reverts to its prior form, as expected:
as . Similarly, the behaviour of is as expected, tending to its prior form far from the boundary (with finite), and to zero as either and tend to zero:
3.2 Application to a 2-Dimensional Model
For illustration, we consider the problem of emulating the 2-dimensional function
defined over the region given by , , where we assume a known boundary at , and hence have that . The true output of over is given in Figure 1(b) for reference.
shows the adjusted emulator standard deviationand demonstrates the significant reduction in emulator uncertainty near . Finally, Figure 1(d) shows simple emulator diagnostics over of the form of the standardised values . Thus any values of for which was far from (a typical choice being ) would indicate a conflict between emulator and simulator (see  for details). For our boundary-adjusted emulator, the standardised diagnostics all maintain modest values lying well within standard deviations giving no cause for concern.
3.3 Updating by further model evaluations
Most importantly, as we have analytic expressions for , and we are now able to include additional simulator evaluations into the emulation process. To do this, we perform (expensive) evaluations, , of the full simulator across , and use these to supplement the evaluations, , available on the boundary. We want to update the emulator by the union of the evaluations and , that is to find , and . This can be achieved via a sequential Bayes Linear update:
where we first update our emulator analytically by , and subsequently update these quantities by the evaluations . As typically is small due to the relative expense of evaluating the full simulator, these calculations will remain tractable, as will be feasible for modest .
Not only will the known boundary improve the accuracy of the emulator compared to just updating by , for only trivial computational cost, it will also allow us to design a more informative set of runs that constitute . We discuss appropriate designs for this scenario in section 4.
3.3.1 Incorporating Known Boundaries into Black Box Emulation Packages
Consideration of the form of the sequential update given by equations (25)-(27), combined with the sufficiency argument presented in section 3.1, shows that for the full joint update by , a sufficient set of points is composed of a) the points in , b) the points formed from the projection of onto the boundary , and c) the projection of the point of interest , giving a total of points. This has ramifications for users of black box Gaussian process emulation packages (such as BACCO  or GPfit  in R, or GPy  for Python), which perhaps cannot be easily recoded to use the more sophisticated analytic emulation formula of equations (15) and (16), but for which the inclusion of extra simulator evaluations is trivial. Hence such a user simply has to add the extra projected points on to their usual set of runs, and their black box GP package will produce results that precisely match equations (25)-(27). This however will require inverting a matrix of size and hence will be slower than directly using the above analytic results, which only require inverting a matrix of size .
3.4 Updating by Two Perpendicular Known Boundaries
Given the above results, we now proceed to discuss the update of the emulator by a second known boundary, . In the first case, discussed here, is assumed perpendicular to , and in the second case, discussed in section 3.5, is parallel to . Detailed derivations of the results presented can be found in appendices A and B, and the key results for single and dual-boundaries are summarised in Table 1, for ease of comparison.
First, we assume the second known boundary is a dimensional hyperplane, perpendicular to the direction, as illustrated in Figure 3 (left panel). Our goal is to update the emulator for , by our knowledge of the function’s behaviour on both boundaries and , and subsequently by a set of runs within . Thus we must find and . We do this sequentially by analytically updating by followed by , then numerically by .
As before, assume that the is analytically solvable and hence inexpensive to evaluate along , permitting a large but finite number, , of evaluations on , denoted . As in Section 3.1, we define the corresponding length vector of boundary values as
which includes the projection of onto . An analogous proof to that of equation (11) gives
while as the product correlation structure is not disturbed by the update by , we also have
where is the perpendicular distance from to and is the correlation function in the perpendicular direction to , as shown in Figure 3 (left panel). Using (29) and (30) the expectation of adjusted by then can now be calculated using the sequential update (25) giving
where we have also used equation (15) for , defined and denoted the projection of onto as , which is just the perpendicular projection of onto . An expression for the covariance adjusted by then is obtained by a similar argument (see appendix A)
where we have defined the correlation function of the projection of and onto as
The updated variance is trivially obtained by setting to get
As a consistency check, we see that all three expressions (32), (34) and (36) are invariant under interchange of the two boundaries, represented as the transformation and , as they should be. They also exhibit intuitive limiting behaviours as the distance from the boundary tends to 0 or (see appendix A). Again, we observe that were we to sequentially update by a further evaluations, , and calculate and , the only points we require for sufficiency are and the projections of and onto , , and . This represents only points, which is far fewer than the points (with extremely large) that we started with. Again, users of black box emulators can easily insert these points, at the cost of having to invert a matrix now of size , instead of a single inversion of size were they to encode the above analytic results directly.
An example of an emulator updated by two perpendicular known boundaries is shown in Figures 3(a)– 3(c), which give , and respectively, for the simple function introduced in section 3.2. A second known boundary is now located at , where we know that . As expected, we see the emulator expectation agrees exactly with the behaviour of the simulator on and (as given in Figure 1(b)). We note also the intuitive property that the variance of the emulator reduces to zero as we approach the boundary, but remains at when we are sufficiently distant. This sensibly represents the increase in knowledge about the simulator behaviour the closer we are to or . Diagnostics are again acceptable.
3.5 Updating by Two Parallel Known Boundaries
Consider now a second boundary located at , that is therefore parallel to the original boundary at . As updating by leaves the correlation structure still in product form, critically with respect to the term, we can still perform a subsequent analytic update by . We define as before by (28), and denote the distance from point to its perpendicular projection, , onto as , (see Figure 3, right panel), where now . See appendix B for more details of the following derivations.
where we have exploited the fact that the projection of onto is just . Similarly, the covariance adjusted by and can be shown to be (see appendix B):