Using Bayes Linear Emulators to Analyse Networks of Simulators

10/17/2019 ∙ by Samuel E. Jackson, et al. ∙ 0

The key dynamics of processes within physical systems are often represented in the form of computer models, or simulators. Often, the overarching system of interest, comprising of multiple processes, may be represented as a network of simulators, with some of the inputs to some simulators arising from the outputs to other simulators. Much of the computational statistics literature focusses on approximating computationally intensive simulators using statistical emulators. An important, yet underexplored question in this literature is: can emulating the individual simulators within a network be more powerful than emulating the composite simulator network as a single simulator? In this article we present two novel approaches for linking Bayes linear emulators of several component simulators together to obtain an approximation for the composite simulator, comparing these ideas to approximating the composite simulator using a single emulator. These techniques, termed the Bayes linear sampling approach and Uncertain Input Bayes linear emulator, will be demonstrated on a couple of illustrative simulated examples, as well as being applied to an important dispersion dose-response chain of simulators used for disease modelling.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Physical processes of scientific interest are often represented in the form of computer models, or simulators. Such simulators aim to represent the key kinetics and dynamics of a physical system using, for example, sets of differential equations. Such a representation can aid the understanding of the behaviour of the physical system itself. Scientists are frequently interested in complex physical systems which can be thought of in terms of simpler component processes. In this case, it is common for a simulator of the overall physical system to be composed of a chain or network of less complex simulators, each representing one of these component processes. For example, the disease modelling application of this article involves an atmospheric dispersion model of a biological agent linked to a dose-response model, representing how the population will respond to contact with certain doses of the agent. The chaining of simulators also exists in many other important applications, including climate modelling [36] and volcanic eruption modelling [2].

For any individual component simulator, there are various sources of uncertainty arising as a result of the modelling procedure [15]

. In particular, there is often substantial uncertainty resulting from uncertain knowledge of simulator behaviour across the vast majority of the input space as a result of the computational intensity of running the simulator even once prohibiting arbitrarily large numbers of runs. Powerful analytical tools, such as emulators, are often used to aid the analysis of these computationally intensive simulators. Bayes linear emulators are fast statistical approximations of simulators, built using a set of simulator runs across the input space. These emulators provide an expected value of simulator output at any input along with a corresponding uncertainty estimate reflecting our beliefs about the uncertainty in the approximation

[13, 39]. The key advantage of these emulators is their computational efficiency, allowing them to be used as surrogates for the simulators themselves for the purposes of inference and analysis.

An important, yet rather underexplored, question in the literature is: can combining emulators for the individual simulators within a network be more powerful than emulating the network as a single composite simulator? In this article we present two novel approaches for linking Bayes linear emulators of several component simulators together to obtain an approximation for the composite simulator, comparing these ideas to approximating the composite simulator using a single emulator. We demonstrate the discussed techniques on an illustrative simulated example throughout the paper, before applying them to a scientifically relevant chain of simulators used in epidemiology.

The epidemiological chain that we consider consists of an Anthrax dispersion model linked through to a dose-response model (based on [19]), which, as illustrated by the graphical representation of Figure 1, can be represented by the composite simulator . The dispersion model simulates the dispersion behaviour of a biological agent that has been released across a specific region. Input parameters of interest to this simulator correspond to the physical quantities wind speed (), wind direction () and source mass (). Since interest in this article is in linking simulators together as a network, for simplicity we take the output of the dispersion model to be the dose of the agent at a single specific location, although dose across a spatial domain could also be incorporated. The dose-response simulator takes dose as input and outputs a resulting proportion of casualties . Whilst the two aspects of this overall epidemiological system are modelled by two different groups of experts, there is overarching interest in the effect of the release conditions on the proportion of casualties.

wind direction

source mass

intervention policy

dose

proportion of casualties

Figure 1: Graphical representation of the dispersion dose-response network of simulators , where represents the dispersion model, and represents the dose-response model.

1.1 Bayes Linear Statistics

In this article, we focus on the Bayes linear approach to analysis (and hence emulation - see Section 1.2). The Bayes linear approach [20, 32, 13, 16] to statistical inference takes expectation as primitive, following De Finetti [8, 9, 40]

, and deals with second-order belief specifications (that is, expectations, variances and covariances) of observable quantities. Probabilities can be represented as the expectation of the corresponding indicator function when required. More precisely, suppose that there are two collections of random quantities,

and . Bayes linear analysis involves updating subjective beliefs about given observation of

. In order to do so, prior mean vectors and covariance matrices for

and (that is , , and ), along with a covariance matrix between and (that is ), must be specified. Second-order beliefs about can be adjusted in the light of using the Bayes linear update formulae:

(1)
(2)
(3)

and are termed the adjusted expectation and variance of B given D [16]. is termed the adjusted covariance of and given , where and are subcollections of . For a more detailed overview of Bayes linear methods, see [13], and for a thorough treatment, see [16]. For a comparison of Bayes linear methods with the full Bayesian approach, see, for example, [16, 13, 38].

1.2 Emulation

We consider a simulator represented by the function , which takes a vector of input parameters, and outputs a vector . We choose to represent our beliefs about the behaviour of any scalar simulator output component , for any input , in the following form [14]:

(4)

where is a vector of known basis regression functions of , is a vector of unknown regression coefficients, and is a second-order weakly stationary stochastic process.

Suppose represents simulator output evaluated at simulator input locations . We can adjust a second-order prior belief specification about across using the Bayes linear update Equations (1)-(3) to obtain an emulator prediction for simulator output at a new , given by

(5)

along with a measure of the uncertainty in the prediction, given by

(6)

In Equations (5) and (6), the notation and reflect the fact that we have adjusted our prior beliefs about by simulator runs . The terms on the right-hand side correspond to the simulator runs and a second-order prior belief specification for across , which can be derived from the assumed form given by Equation (4) and a second-order prior belief specification across the collection . Prior specification, as for a full Bayesian analysis, is not trivial. It is common, however, to assume , , and a covariance structure between the value of and , at two inputs and , of the following form:

(7)

where represents a stationary correlation function, for example

(8)

The form of Equation (8) is known as the Gaussian correlation function, and involves specification of the variance and correlation length parameters and .

As already discussed, we choose to focus on the Bayes linear approach to emulation. In the literature, it is common to instead assume normal and Gaussian process priors for and in Equation (4) [27, 4, 25], thus resulting in Gaussian process emulation, otherwise known as kriging [26, 34]. In this case, the resulting Bayesian update equations are practically similar to Equations (5) and (6) presented above, however, methodologically involve additional distributional assumptions that we may rather not make, nor want any consequential inference or decisions to be made in the light of. Therefore, whilst carefully specified distributions may allow for a more detailed Bayesian analysis of uncertain parameters, there are times when it may be preferential to restrict ourselves to a second-order specification. It is for this reason that we present novel approaches to linking Bayes linear emulators for networks of simulators in this article. It is also worth noting that various results in the literature, such as Chebyshev’s inequality [3] or Pukelsheim’s rule [33] (at least 95% of the probability mass of any unimodal continuous distribution will lie within standard deviations of the mean) facilitate making inferential statements about the simulator, and hence the system, of interest when a second-order belief specification has been the basis for analysis.

1.3 Outline

The remainder of this article is outlined as follows. Section 2 formally introduces the notion of linking simulators together in a network. Section 3 introduces two novel approaches to linking Bayes linear emulators of each individual simulator in a network, namely the Bayes linear sampling approach and the Uncertain Input Bayes linear emulator. Section 4 demonstrates and compares these novel approaches to emulating the application chain of simulators introduced in Section 1 with the direct emulation approach. Whereas much of the article focusses on a relatively simple composite simulator formed of two individual simulators, Section 5 demonstrates the methodology on a more complicated network of simulators. The conclusions of our research are discussed in Section 6.

2 Networks of Simulators

Linking of simulators exists in many areas of science, including epidemiology and seismic activity modelling [24]. Such a network of simulators can be thought of as a single simulator which is made up of several modules [23]; some of the inputs to many of the simulators in the network are taken to be outputs of other simulators. More precisely, suppose we have a simulator which takes an input vector , and generates an output vector . In the case of a network of simulators, some of the input parameters of arise from (one or more) other simulators, that is, for some scalar output component of simulator . These networks of simulators may exist, for example, as a result of the different modules being constructed by different groups of scientific experts, but where overarching interest lies in the entire physical process described across the network. In this section, we define some notation for a generic network of two simulators linked together as a chain. We discuss some of the current approaches in the literature, before introducing the idea of emulating the entire network as a single simulator.

For the purposes of clarity, we here-on focus on perhaps the simplest of simulator networks; a chain formed of two simulators forming a composite simulator , taking input vector and generating output vector , as illustrated in Figure 2. In addition, we have dropped the bold vector notation for functions and , essentially considering scalar output functions for clarity. The methods discussed, however, directly generalise to more complicated networks of vector-output simulators; this being demonstrated in Section 5 on the network of simulators illustrated in Figure 7.

Figure 2: Graphical representation of a generic chain of models .

Even considering the simple network considered in Figure 2 there are many problems to be addressed, such as possible links and discrepancies between the output from and the input to . Such links should likely be made with reference to the physical system properties associated with and . We put such issues aside for the time being, focussing instead on the power of approximating networks of simulators by linking Bayes linear emulators of the modules in comparison to using a single Bayes linear emulator to approximate the simulator network. Crucially, however, a new challenge arises from the fact that the output of the first simulator becomes uncertain, resulting in an uncertain input to the second simulator.

In relation to this, Kyzyurova et al. [28] propose coupling simulators by linking independently developed Gaussian process emulators of the simulators. Their motivation arose from potentially having separate training runs for the two simulators and , this prohibiting direct emulation of (see Section 2.1

). For certain Gaussian processes, they derive closed form expressions for the overall mean and variance of the emulator chain, approximating the distribution by a normal distribution with equivalent mean and variance. The lack of closed-form distribution, but availability of second-order posterior statistics, for their linked emulator has inspired our investigation into Bayes linear approaches to emulation in this context.

Firstly, use of the Bayes linear approach removes the requirement to assume that follows a particular distribution (specifically normal, Laplace or exponential). Secondly, in the methods that we present in this article, there is no requirement to restrict the correlation functions to being power exponential correlation functions (with power parameter 1 or 2). Finally, it is not directly clear how the calculations in [28] generalise to larger networks of simulators. In contrast, our methods generalise directly, as discussed in Section 5. We feel that, if the assumptions required by [28] are deemed reasonable, and the chain only involves two simulators, then the precise results of [28] are appropriate for analysis. On the other hand, we feel that the methods presented in this article are more flexible, primarily resulting from working in a Bayes linear framework. The proposed approaches allow inferences to be made about a network of simulators, but without relying on distributional assumptions, to which any consequential analysis may be sensitive.

We also note here the similarity between emulating networks of simulators using Gaussian processes to modelling using deep Gaussian processes [7, 10]. Deep Gaussian processes arise from belief networks about simulator behaviour based on Gaussian process mappings, such that layers of Gaussian process latent variables exist between the input and output . Intermediate nodes act as inputs for the layer below and outputs for the layer above. Each layer adds model parameters and a regularisation challenge, arising from needing to choose the size of each latent layer. The latent variables are effectively marginalised out variationally (see, for example, [37]). The primary difference between the latent variables of a deep Gaussian process and the intermediate variables in a simulator network is that, in a network, the variables themselves represent physical system properties, these aiding the construction and modelling of the emulators for each of the individual component processes. Use of a deep Gaussian process in place of chaining individual simulators in a network is likely to remove this additional information. This does not mean that the ideas of a deep Gaussian process couldn’t be applied on the individual simulators within a network, these then being linked together using the methods that we present in this article.

2.1 Direct Emulation of

The direct method to emulating neglects the fact that is composed of two modules. Such an approach relies on being able to run the composite simulator for any choice of input to the first simulator, that is, the training runs for are at the output locations of the training runs for . In this case, we have a set of training runs:

(9)

and use the standard Bayes linear update for as given by Equations (5) and (6), namely

(10)
(11)

We demonstrate the direct emulation approach in the following example, which thus serves as an example of standard Bayes linear emulation methodology. This example will be used throughout the article to demonstrate the novel methods for emulating networks of simulators.

2.2 Illustrative Example

For the purposes of this example, we will consider the three functions:

(12)
(13)
(14)

over a domain of interest for of , and a domain of interest for of (roughly corresponding to the output domain of ). These three functions are plotted in the top left, top right and bottom left panels of Figure 3 respectively. Note how chaining even simple functions togethers can lead to much more complex behaviour. We treat these three functions as computationally intensive in order to demonstrate the emulation techniques discussed throughout this article.

In this section, we focus on emulating simulator directly. We represent our beliefs about the behaviour of using the form given by Equation (4), with covariance structure given by Equations (7) and (8). We let the regression functions , that is, a first-order polynomial function of .

We let the regression and covariance parameters for the emulator for be denoted by , and . Specification of and is in general non-trivial, and several methods are proposed in the literature. Ideally, this specification should be made a priori, although this is often difficult in practice. Alternatively, the parameters may enter into our belief network, however, this is also challenging, and, owing to not having a physical interpretation or “correct” value, somewhat meaningless. As a result, many pragmatic approaches are available for estimating and from the data. For example, we can use maximum likelihood estimates if we are happy to specify distributions [1]

, we can use simple heuristics

[38], or we can use predictive diagnostics, such as Leave-One-Out-Cross-Validation (LOOCV) [5, 29]. However they are chosen, the resulting choices should be checked using rigorous diagnostics.

In this section, we fit and via a LOOCV diagnostic predictions approach, following the guidelines suggested in [35] for when only small numbers of training runs are available. In addition, we specify vague prior beliefs on (which results in the adjusted specification for being the generalised least squares estimate mean and variance for ). Having specified prior beliefs about simulator , we proceed to adjust these beliefs for any in light of simulator runs using Equations (10) and (11), where is chosen to be 8 equally spaced points over the input domain.

Figure 3: Top panels: simulator function and evaluated across the input domains and respectively. Bottom left panel: simulator function . Bottom right panel: a Bayes linear emulator of constructed using training points. The blue lines represent emulator expectation, and the red lines represent emulator expectation plus and minus 3 standard deviations.

The results of this emulation process are shown in the bottom right panel of Figure 2. The blue lines represent emulator expectation . The red lines represent the emulator mean emulator standard deviations, given as

, these being bounds for a 95% credible interval, following Pukelsheim’s

rule [33]. Also plotted is the true simulator (green line). Note that this would be unavailable in general for computationally intensive simulators. There are several ways to diagnose the overall ability of an emulator; these measures being used throughout this article for emulator comparison purposes. Standardised prediction errors, given by:

(15)

can be used to assess the validity of an emulator. Large absolute errors indicate conflict between simulator and emulator. If these are observed, the emulator is not valid for inference. It may be possible that the emulator prior beliefs were misspecified, for example, as a result of incorrect prior specifications for the parameters , and . Alternatively, it could be indication of an erratically behaved simulator that would require substantially more simulator runs in order to be emulated well. In accordance with this diagnostic, the fitted emulator in Figure 3 is valid, since the green line largely falls within standard deviation error bars. On the other hand, the emulator is fairly inaccurate (emulator prediction is far from true simulator run) and imprecise (large emulator uncertainty). This motivates exploration of alternative approaches to emulating this function, for example, by making use of the simpler behaviour present in the component simulators and .

We note at this point that various approaches exist in the literature for obtaining a better emulator for an erratically behaved function, such as . Examples include local Gaussian processes [17], Treed Gaussian processes [18] and the aforementioned deep Gaussian processes, amongst others. The aim of this article is not to compete with such existing methods, but to demonstrate the efficacy of emulating individual simulators in a network and linking them together over emulating the composite simulator directly. In particular, these alternative methods, with possible slight modification for use in the Bayes linear paradigm, may still be used on any component simulator in a network to improve emulation of that component.

3 Approaches to Emulating Networks of Simulators

In Section 2, we motivated finding alternative approaches to emulating networks of simulators (rather than direct emulation of the entire network) as a result of the composite simulator potentially arising from a chain or network of less complex simulators. In addition, it may well be the case that the training runs for and simply don’t match up (that is, the outputs to do not correspond to training run input locations for ). We are therefore interested in making inference for a new input to the composite model given the union of training runs and , at input locations and respectively. From a Bayes linear point-of-view we are interested in and for any input .

Such adjusted beliefs can be calculated via sequential adjustment of our second-order prior beliefs about by and then via a sequential Bayes linear update:

(16)
(17)

Equivalently, we could adjust sequentially by then by swapping in Equations (16) and (17). Consideration of the required belief specification to calculate the above expressions is likely to be challenging. We can avoid direct specification by claiming that is Bayes linear sufficient for for adjusting [16], written:

(18)

In other words, the training runs have no effect on our beliefs about once adjusted by . This is very similar to a conditional independence property in the full Bayesian paradigm. As a result, we have that:

(19)
(20)

where we are treating and as uncertain quantities, our beliefs about which we wish to adjust in light of . From this point there are several approaches to making inferences about given and . We present three such approaches here; the first is a generalisation of Section 2.1; the latter two novel, these crucially making use of the fact that, for any , we have and .

3.1 Direct Emulation of Second-Order Belief Specification

This first approach is a generalisation of the direct emulation approach of Section 2.1, and is discussed for completeness. We define the adjusted second-order belief specification for given and as two simulators and with the same input as . We proceed to construct Bayes linear emulators for and in the usual way (using Equations (4)-(6) with etc.). Note that training runs and can be obtained from the set of training runs since we can trivially calculate and for any . Such emulation results in expressions for and , which we can then plug into Equations (19) and (20) in order to calculate and .

There are benefits to splitting the composite simulator into two simulators and before emulating, primarily as this removes the restriction for . As a result, we no longer need the number of training points for and to be the same, which may be of advantage if one simulator is much quicker to evaluate than the other. However, there are limits to the benefits, in terms of approximating , of improving the emulators for and in this manner. Emulation of and becomes more accurate and precise as the number of training points in (equivalently ) increases. However, the emulated approximation as the number of training points in for increases. In other words, for a fixed set of training points , an increase in the number of training points for is highly unlikely to lead to an improved approximation over the direct emulation of Section 2.1 with the same .

The direct emulator of Section 2.1 can be recovered by setting , in which case we have that and . It is not unreasonable to assume that having leads to for all . Having results in and . Combining these results with Equations (19) and (20) leads to and when . This is a logical conclusion; it would be strange if we were unable to adjust our belief specification for given with in the same way as directly adjusting by .

In this approach, the training runs are used to calculate the second-order belief specification of given , but thereafter any additional knowledge about the behaviour of is lost. For example, for prediction of at new , we have information about , but this is not utilised. It is for this reason that we present the two novel approaches in Sections 3.2 and 3.3. We will compare these two approaches with the direct emulation approach of Section 2.1, which as discussed is a special case of the emulation method discussed in this section.

3.1.1 Emulator Design

Emulator design is the process of selecting the points in simulator parameter space at which the simulator will be run in order to construct an emulator [35]. If is emulated as a composite simulator this amounts to selecting for . Computer experimental design is well covered in the literature in the context of emulating a single model (see, for example, [11, 31]). On the other hand, if each simulator is emulated separately, a design for each is required, for example, and for and respectively. Experimental design in this context likely involves optimising over the two parameter spaces simultaneously, and is scope for much research in its own right. As a result, in this article we will use the popular Maximin Latin Hypercube (MLH) design [30, 6] over assumed known simulator parameter spaces whenever a design is required, whether this be for an emulator of an individual simulator or the entire simulator network. The exception to this is for simulators with only one parameter, in which cases an evenly spaced set of points across the parameter range is used, such as for the simulated example of Section 2.2.

3.2 Bayes Linear Sampling Approach

Let us consider Equations (19) and (20) once again. The required second-order belief specification is across . In the previous section, we treated as a simulator itself, and noted that given , we could calculate , thus effectively avoiding making this complex second-order belief specification. In this section, we also seek to avoid making this specification, however, we now wish to utilise knowledge about (namely and ) directly for making inference about at a new input point . Since only affects through , we have that

(21)
(22)

Note that obtaining and through Equations (21) and (22) is tantamount to requiring belief statements about for simulator , where the input

is a random variable with a second-order belief specification;

. We present two approaches to emulating in this case. The approach in the next section remains fully in the Bayes linear paradigm, whereas the approach in this section utilises appropriate distributions to effectively integrate out of the emulator specification. In particular, we assume that random variable

follows an appropriate probability distribution which is consistent with our second-order adjusted belief specification for

at , for example

(23)

It is important to note that being Bayes linear in principle does not prevent us from investigating the consequences of assuming certain distributions, for example, as tools for sampling purposes. The dependence of any results to this choice of distribution should be explored by performing a robustness analysis.

We then assume interest lies in and , which has effectively replaced the second-order specification given in Equations (21) and (22) with that of given by Expression (23). The fact that we have now assumed a distribution on allows us to integrate out in a fully Bayesian fashion so that:

(24)
(25)

In Equations (24) and (25) it is convenient to distinguish between expectations and variances in the Bayes linear framework (where expectation is primitive) and full Bayesian framework (where expectation is derived) using slightly different notation and respectively.

Although integration over the specified distribution (23) may be possible for specific distributional choices, we instead propose taking a Monte Carlo sample of possible -values for each input , and use these to approximate Expressions (24) and (25) above, so that

(26)
(27)

Note that such Monte Carlo should be relatively fast as it only involves evaluation of emulators. In addition, we anticipate that, by emulating a (hopefully less complex) single module, each emulator should be simpler and thus cheaper to evaluate (by requiring less training points). Further, if we wish to evaluate at very many points, then we could emulate the approximation calculation itself by viewing it as a stochastic simulator. This would avoid the need to run multiple times for each .

We can also gain much insight into the uncertainty arising from emulating the separate modules, since each corresponds to one of the two parts of Equation (25):

  • reflects uncertainty in as a result of emulating .

  • reflects uncertainty in as a result of emulating .

This separation of contributions to the overall variance of could be insightful for multiple reasons. An example is experimental design, briefly discussed in Section 3.1.1 where one is allocating computational resource budget between training runs of simulators and . Finally, having obtained a Monte Carlo sample, it is trivial to calculate an approximation to , which may also be useful for design purposes, as well as diagnostics.

3.2.1 Illustrative Example

Figure 4: Top panels: Bayes linear emulators of and constructed using training points. The blue lines represent emulator expectation, and the red lines represent emulator expectation plus and minus 3 standard deviations. The green lines represent the true function for comparison purposes. Bottom panels: emulators for

resulting from applying the Bayes linear sampling approach (using normal and uniform distributions respectively).

In this section, we continue the example of Section 2.2. We construct Bayes linear emulators for each of the functions and using

training points and the same prior beliefs and techniques for eliciting the hyperparameters as discussed for

in the previous section. The results of emulating these two simulators are shown in the top two panels of Figure 4. We can see that both of these emulators are valid and accurate, with low uncertainty. The emulator for is more precise, resulting from its simpler behaviour over that of .

We proceed to combine the two emulators for and using the Bayes linear sampling approach to emulating networks of simulators. We begin by emulating at 1000 evenly-spaced points across the input space to obtain and . For each of these points we then sampled 100 points from a normal distribution with corresponding mean and variance. We then approximated and using Equations (26) and (27). The results of this approximation are shown in the bottom left panel of Figure 4. We observe that has been emulated well using this sampling approach, with low uncertainty. Areas of slightly larger uncertainty can be associated with regions of the input spaces for and/or with larger uncertainty, as should be expected.

To assess the effect of the chosen sampling distribution, we repeat the sampling approximation using a uniform distribution, the results of which are shown in the bottom right panel of Figure 4. We can see that the results are fairly similar, suggesting that the effect of exact distributional specification isn’t too influential in this case.

3.3 Uncertain Input Emulator Approach

In this section, we propose a third method for emulating networks of simulators. In particular, following Equations (21) and (22) we consider a Bayes linear emulator approach for simulators with second-order specified random variable uncertain inputs. In our case, this uncertainty arises as a result of the input to one simulator being (connected to) the output of another simulator about which we are uncertain. Whilst work has been done with regards to uncertain input emulators, it usually involves Gaussian assumptions which we prefer not to make [12]. The suggested approach is efficient, and we profess results in reasonable approximations to the overall simulator network. We proceed to describe our approach to Bayes linear emulation with uncertain inputs, the linking of Bayes linear emulators following naturally.

To be consistent with the chained simulators discussed above, we consider a scalar output emulator , for which we have a set of training runs

(28)

at known inputs . We wish to make inference about , where is an uncertain input to the simulator, and is uncertain across the majority of the input space, so that

(29)

We assume a prior specification as follows: , , , . These are similar to a specification that may be made in the case of known inputs. One of the key differences to the prior specification in the context of unknown inputs is specification of a correlation function for unknown input points. We assume a general form, similar to that given by Equation (7), as follows:

(30)

For example, we propose a reasonable extension to the Gaussian correlation function, presented in Equation (8), to be

(31)

this being derived from the general formula for the expected value of a quadratic form [21]. In addition, this choice follows a general rule we believe should hold, namely that the generic correlation function involving should reduce down to a standard correlation function form if are known.

Given this choice of correlation function, the Bayes linear update at an uncertain input follows relatively straightforwardly from that assuming known inputs. For details of calculating a Bayes linear update in the case of known inputs, see, for example, [22]. We express , following Equation (4), as

(32)

where is an design matrix, is the -vector of regression parameters and is an -vector of residuals. Following the prior specification above, we have that , and , where we define

(33)

The expected value of at random input is

(34)

which follows since , and is assumed to be a known function of (as usual). Specification of (and later ) is straight forward for first-order linear basis functions. It is also possible for further functions of the input components, but these transformed input components will require a sensible second-order specification as a result of emulating the previous simulator. There are several ways to achieve this, the most straightforward approach being to emulate the transformed inputs as further output quantities to the previous simulator.

We now proceed to adjust our beliefs about the variance of using Bayes linear update Equation (2):

(35)

where

(36)
(37)

and

(38)

so that

(39)

These equations permit the output of a simulator to be emulated at unknown input provided a second-order specification is provided about the input. We can apply this approach to emulation directly to a network of simulators, using the adjusted second-order belief structure about the output of any simulator as the required specification about the unknown input to another simulator. Since the result in each case is a second-order belief specification, this approach can be applied to arbitrarily large networks of simulators. We now demonstrate this on the illustrative example.

3.3.1 Illustrative Example

We emulate at 1000 evenly-spaced points across the input space to obtain and . The uncertain input emulator for is trained using the same training runs as in the previous sections, resulting in the same values for the parameters and . Given these parameters, we can now approximate the output to at each corresponding uncertain input resulting form the adjusted second-order belief specification for at each of interest. The results of doing this are shown in Figure 5.

Figure 5: Emulator for resulting from applying the Bayes linear uncertain inputs approach discussed in Section 3.3. The blue lines represent emulator expectation, and the red lines represent emulator expectation plus and minus 3 standard deviations. The green lines represent the true function for comparison purposes.

The results of emulating using this uncertain inputs approach is slightly different to that obtained using the sampling approach. The blue-line prediction is very similar, however, the standard deviation bounds are slightly wider in places. On the whole, however, we notice that the prediction is quite accurate, with much lower uncertainty than the direct emulator approach used in the example in Section 2.

4 Application to a Dispersion Dose-Response Chain of Simulators

In this section, we consider applying the three techniques discussed in Section 3 to the Dispersion Dose-Response network of simulators introduced in Section 1. Due to the behaviour of the dispersion model , we chose to emulate a transformation of the output, namely , treating this transformed function as the first simulator of the network. To be consistent with this output, we considered the second simulator to take input (so that ), where represents dose. , like , represents number of casualties. The combined simulator takes wind speed, wind direction and source mass as input, and directly outputs a number of casualties. The graphical simulator showing the links between the quantities of interest and the inputs and outputs of simulators (dispersion) and (dose-response) is detailed in Figure 1.

We proceeded to construct Bayes linear emulators for each of the individual simulators and , as well as the combined simulator . We consider that the ranges of interest of the inputs to simulator (and thus ) are as given in Table 1, each of which were scaled to for the purposes of our analysis. We constructed a training point design for and using a MLH of size 50 across the three input dimensions. In contrast, simulator is one-dimensional, thus the need far fewer training points, so we take a sample of 20 points from a uniform distribution. Note that, as a result of needing to run simulator many times at each training point in order to find the mean function, this simulator is slow, hence if fewer training points are required to train the emulator for this simulator it is already an advantage. For each of the emulators for and , we assumed a Gaussian correlation function, as given by Equation (8), along with a first-order polynomial mean function. We represent the scalar variance parameter and correlation length vectors as and respectively. We fit these parameters using maximum likelihood for each emulator, as recommended in [35] for models of moderate/large training run size, this also permitting a fair comparison between the emulation methods presented. It should also be noted that, whilst the focus of the article concerns the use of Bayes linear emulators, this does not prevent us from considering the effects of assuming certain distributions, in this case for selecting parameters for the correlation functions using maximum likelihood in order to obtain adequate emulators [38].

Input parameter Minimum Maximum
37 63
1 150
0.001 1
Table 1: Input parameter ranges to (which were scaled to for analysis).
Figure 6: Adjusted expectation standard deviations against simulator output for six different emulators; the first row for Bayes linear emulators of dispersion simulator and dose-response simulator respectively; the left panel of the second row showing the results of emulating the joint dispersion dose-response network of simulators directly as a single simulator; the right panel of the second row and left panel of the bottom row showing the results of emulating by combining the emulators for and using the sampling approach of Section 3.2, first using a normal and then a uniform distribution; the right panel of the bottom row showing the results of emulating by combining the emulators for and using the uncertain inputs approach of Section 3.3.

Given the individual emulators for and , we can then combine them using the sampling method of Section 3.2 and uncertain inputs method of Section 3.3 to yield chained emulators for . Figure 6 shows diagnostic plots of adjusted expectation standard deviations against simulator output for six different emulators; the first row for Bayes linear emulators of dispersion simulator and dose-response simulator respectively; the left panel of the second row showing the results of emulating the joint dispersion dose-response network of simulators directly as a single simulator; the right panel of the second row and left panel of the bottom row showing the results of emulating by combining the emulators for and using the sampling approach of Section 3.2, first using a normal and then a uniform distribution; the right panel of the bottom row showing the results of emulating by combining the emulators for and using the uncertain inputs approach of Section 3.3. The inputs for these diagnostic runs of size 50 and 20 (for and respectively) were constructed in the same manner as the training runs. In addition to the plots, Table 2 shows Mean Absolute Standardised Prediction Errors (MASPEs):

(40)

and Root Mean Squared Errors (RMSEs):

(41)

for the diagnostic runs for each of the six emulators, with or as appropriate and , representing appropriate emulator mean and variance.

Emulator   MASPE   RMSE
1.638 0.6457
0.766 0.0003
direct 1.579 0.0312
SA_Normal 1.256 0.0235
SA_Uniform 1.242 0.0243
Uncertain Inputs 0.767 0.0240
Table 2: Mean Absolute Standardised Prediction Errors and Root Mean Squared Errors for each of the six emulators discussed in the text.

We can see that the emulator for is fairly accurate, with the exception of points towards the bottom end of the output range, where there are several cases of severe overestimation (with underestimated uncertainty). The emulator for is very accurate, reflecting the fact that emulator predictions can be taken with almost as much certainty as running the simulator itself. The diagnostics for these emulators should be kept in mind as we discuss the emulators for .

The direct emulator yields predictions with underestimated uncertainty. By comparison, the estimated uncertainty for the remaining methods is larger, with more appropriate MASPE values. In addition, the accuracy of the predictions for the chained emulators are, on the whole, improved, this being confirmed by the RMSE values in Table 2. The RMSE values for the sampling approach and uncertain inputs approach are comparable. It is interesting to note, however, that the uncertainty attributed to each diagnostic point is different between the two emulators, with the uncertainty of the Uncertain Inputs emulators being larger for runs resulting in low or high values of , and smaller for those points in the middle. This is likely to be a consequence of the way the uncertainty in is propagated through in the two methods. The sampling approach propagates uncertainty in by sampling possible values of according to possible values of

. This results in a heteroscedastic error structure across the emulator for

(for example, if is expected to change little regardless of the possible values of , the uncertainty is small). In contrast, the uncertain inputs approach has uncertainty from the regression part and covariance structure. As with a standard Bayes linear emulator that uses a single correlation structure across (or in this case and ), there is some averaging of the uncertainty estimates for simulator prediction across the input space, even if the behaviour at some points is smoother than others. Incorporation of more sophisticated methodology into this current emulator network methodology, for example, utilising similar ideas to local Gaussian processes [17], may be of benefit in this case. To summarise, we feel that the results presented give evidence for the two methods presented for linking networks of emulators over applying a single emulator to a composite simulator in many cases. We defer further discussion to Section 6.

5 Extension to Networks of simulators

The previous sections of this article have demonstrated how emulating individual simulators in a network can result in better overall emulators than emulating the entire network directly. The previous examples have focussed on chains of two models; we now proceed to demonstrate how the applicability of the sampling approach and uncertain input approach directly generalises to more complex networks of simulators. In each case, the second order specification resulting from one emulator leads to the sampling distribution or uncertain input specification of the next one.

5.1 Illustrative Example

We here demonstrate the applicability of our methods to a more complicated network of simulators. We analyse the small network of simulators shown in Figure 7. and are the same functions as in Section 2.2, is the one-dimensional function

(42)

with domain of interest and is the three-dimensional function

(43)

with domains of interest given by (roughly corresponding to the output ranges of and respectively) and . We denote the entire function as with input . This illustrative example demonstrates potential aspects of utilising networks of emulators over emulating the whole network directly.

Figure 7: Graphical representation of the one-dimensional network example of Section 5.1

To begin with, we construct Bayes linear emulators for , , , and . We take the training points for to be a Latin hypercube of size 30 across the three dimensions, appealing to the rough heuristic suggesting a minimum of design points, where is the parameter space dimension. The relevant inputs of this Latin hypercube can then also be used as the training points for and . For , a new set of 30 equally spaced points are taken as the training points . A Latin hypercube of size 30 was taken as the training points for the emulator for . We again assume emulators of the form given by Equation (4) with covariance structure given by Equation (8). We specify vague prior beliefs on , fitting and by maximum likelihood as recommended in [35] for simulators of moderate/large training run size. The emulators for , , and were then combined similarly to the previous examples using the Uncertain Input emulator approach to yield an approximation for . The results of these six emulators are shown in Figure 8.

Figure 8: Adjusted expectation standard deviations against simulator output for six different emulators related to the illustrative network of simulators example of Section 5; the top row for emulators of , and , each constructed using 30 training points; the bottom row for an emulator of constructed using 30 training points, a direct emulator of the whole network constructed using 30 training points, and the results of emulating by combining the emulators for , , and using the uncertain inputs approach of Section 3.3.

The top three panels of Figure 8 show diagnostic plots for the individual emulators , and . Since these are relatively simple 1-dimensional functions, 30 training points allow almost-perfect predictions. Moving onto the bottom row, the left panel suggests that is emulated well with some uncertainty, whereas the middle panel shows that the behaviour of is hard to mimic using a direct emulator. The uncertain input emulator for (bottom right panel) is much more accurate than the direct emulator. The most obvious benefit of emulating the individual simulators in this example, as for many networks of complex simulators, arises from their reduced dimension. Since many of the individual simulators are 1-dimensional, their behaviour can be captured accurately.

Figure 9: Adjusted expectation standard deviations against simulator output for nine different emulators related to the illustrative network of simulators example of Section 5; the top row for emulators of , and , each constructed using 8 training points; the middle row for an emulator of constructed using 30 training points, and direct emulators of the whole network constructed using 30 and 120 training points respectively; the bottom row showing the results of emulating by combining the emulators for , , and , first using the sampling approach of Section 3.2 using a normal then uniform sampling distribution, and then the uncertain inputs approach of Section 3.3.

The results shown in Figure 8 indicate that linking emulators together may sometimes be a useful strategy for obtaining more accurate emulators of the entire network of simulators for the same amount of computational expenditure. Figure 9 explores the results of reducing the number of training points for the emulators of the one-dimensional simulators , and to 8, as was the case in Sections 3.2.1 and 3.3.1. Due to the reduction in training point number, we this time fitted