Active learning for efficiently training emulators of computationally expensive mathematical models

12/18/2018 ∙ by Alexandra G. Ellis, et al. ∙ Brown University 0

An emulator is a fast-to-evaluate statistical approximation of a detailed mathematical model (simulator). When used in lieu of simulators, emulators can expedite tasks that require many repeated evaluations, such as model calibration and value-of-information analyses. Emulators are developed using the output of simulators at specific input values (design points). Developing an emulator that closely approximates the simulator can require many design points, which becomes computationally expensive. We describe a self-terminating active learning algorithm to efficiently develop emulators tailored to a specific emulation task. Its postulated advantages over the prevalent approaches include (1) self-termination and (2) development of emulators with smaller mean squared errors. To explicate, we develop and compare Gaussian Process emulators of a prostate screening model using the adaptive algorithm versus standard approaches.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many decisions in health must be made under uncertainty or with incomplete understanding of the underlying phenomena. In such cases, mathematical modeling can help decision-makers synthesize evidence from different sources, estimate and aggregate anticipated outcomes while accounting for stakeholder preferences, understand trade-offs, and quantify the impact of uncertainty on decision making

[1]. To be informative, a model should be detailed enough to capture salient aspects of the decisional problem at hand, but a highly detailed model can render routine analyses computationally expensive, hindering its usability.

As an example, consider the prostate-specific antigen (PSA) growth and prostate cancer progression (PSAPC) model, which is a microsimulation model that projects outcomes for PSA-based screening strategies defined by combinations of screening schedules and age-specific PSA positivity thresholds [2, 3]. Although evaluating the expected outcomes of one screening strategy takes just a few minutes on modern computer hardware, analyses that require many (on the order of to ) model evaluations, such as calibration of model parameters to external data [4, 5], sensitivity analysis [4, 6, 7], and uncertainty analysis [4, 8], become expensive if not impractical. For example, in part because of the computational burden, Gulati and colleagues evaluated only screening strategies [3], out of more than policy-relevant strategies that could have been explored [9].

A way to mitigate the computational cost of such analyses is to use emulators of the original mathematical models. Emulators, also called surrogate models or meta-models, are approximations of the original models, hereafter referred to as simulators for consistency with prior works [7, 8, 10, 11]. Once developed, emulators are orders of magnitude faster to evaluate than the simulators and, thus, can be used instead of, or in tandem with, the simulators to perform computationally intensive tasks [7, 12, 11]. Although uncommon in healthcare applications [13, 14, 15], using emulators in place of simulators is a well-established practice in mechanical, electrical, and industrial engineering, geology, meteorology, and military modeling [7, 16, 17, 4, 8].

Developing an emulator that approximates a simulator over an input domain is analogous to estimating a response surface, so it is an experimental-design problem [18]. In this context, the simulator is used to generate data (a design), and an emulator is fit to the design. In this work we describe an adaptive algorithm for developing designs that are tailored to a specific simulator and a particular emulation task. We compare the performance of emulators generated with the adaptive design versus with the standard approach of Latin Hypercube Sampling (LHS), which generates generic designs that are agnostic about the emulation task [19] and distribute points across intervals of the input domains.

This work is organized as follows. In Section 2, we describe desirable characteristics for emulators of simulators in health to motivate our choice of Gaussian Process emulators. In Section 3, we introduce an adaptive design for developing emulators, which is the focus of this work. Section 4 uses the PSAPC model to explicate the development of emulators that predict life expectancy over no screening for large sets of practically-implementable prostate cancer screening strategies. Although our adaptive design is meant for very expensive simulators (that, e.g., take days rather than minutes to evaluate), we use the PSAPC simulator as a model with which we can actually do full computations. We compare emulators developed with the sequential design versus those developed with LHS in Section 5, and conclude with key remarks (Section 6).

2 Emulation goals and emulator models

2.1 Simulators

For this exposition, a simulator is a deterministic smooth function that maps from input parameters to scalar outputs. is typically in the many dozens or hundreds. We treat the simulator as a black box function that can be evaluated (expensively) at specific input values. Typically, a baseline set of values has been specified for the simulator through data analysis, calibration, or expert knowledge.

While this setup is not the most general case for which we could present our algorithm, it is not as restrictive as it appears for the following reasons: (i) Many simulators in health that are not smooth are Lipschitz-continuous (i.e. have no “extreme” changes in the output given a marginal change in input) and, for the precision demanded in practice and for our purposes, can be treated as if they were smooth functions. (ii) Simulator outputs can be random variables, e.g., if some inputs are random variables, but we are usually interested in expectations of outcomes, which marginalize over the random input variables and result in deterministic mappings. (iii) Most simulators have multidimensional outputs, but, often, we are interested in one critical or composite outcome or are willing to consider one outcome at a time. (iv) Finally, we treat the simulator as a black box and do not attempt to exploit its analytical form. Often the analytical form is not available, or, when it is, its theoretical analysis may severely test our mathematical ability or be intractable.

With reference to more general setups for emulator development, in Section 6.2 we argue that extensions of our work to stochastic simulators and to simulators with multidimensional outputs require only minor modifications of the adaptive design algorithms that we present below.

2.2 Goal of emulation

We aim to develop an emulator that statistically approximates the simulator’s input-output mapping over the domain of the simulators’ input parameters, where  [8]. We seek


where are values for the inputs whose mapping we wish to emulate,

are values for the remaining inputs of the simulator that are kept equal to their corresponding elements in the baseline-values vector

, and the symbol “” means “either equal or close enough for the purpose of the application”. From now on we will be writing instead of to ease notation. We assume that is a polytope, i.e., a convex polyhedron in , which is most often the case in applications.

Approximating the behavior of the emulator over all vectors in is a reasonable goal when we wish to gain insights about the behavior of the simulator or for sensitivity and uncertainty analysis. For different tasks, e.g., for calibration of input variables to external data it may suffice to approximate the simulator in the neighborhood of the set of optimal values [20].

To fit , we need a design that specifies the set of points in at which we will evaluate the simulator. Let be a set of distinct vectors in that includes the extreme vectors of , so that all vectors in are convex combinations of vectors in . Then, with some abuse of terminology, we will call the vectors in the design vectors or design points. We further simplify notation by writing instead of for each vector in .

2.3 Choice of emulator model

We follow others in requiring that the emulator be an

exact interpolator

 [8, 4, 11, 7, 10]. Specifically, we demand that

  1. for , that is, the emulator’s prediction should agree with the simulator’s output value at the design points because the simulator is deterministic.

  2. At all other input vectors, the emulator must provide an interpolation of the simulator output value, whose uncertainty decreases when closer to a design point, so that it becomes 0 at the design points.

  3. The emulator should be orders of magnitude faster to evaluate than the simulator.

Despite the fact that practically all statistical and machine learning models are fast to evaluate, i.e., they satisfy criterion 


, approaches such as linear regressions and neural networks do not satisfy the first two criteria, whereas kernel-based methods such as Gaussian Processes (GPs), do. Figure 

1 helps build intuition. Therefore, despite its popularity in the literature in part due to its simple form and familiarity by researchers [21, 7], regression is not a preferred emulator type in this context. We use GP-based emulation, also referred to as kriging in other fields [8, 4, 11, 7, 10]. We review GPs briefly in Appendix C.

Regression emulator.
Gaussian Process emulator.
Figure 1: Two emulator models. Two emulator functions predict the output of a simplistic model (grey curve). The input is a single variable (X1, horizontal axis). The 10 grey points represent the data (“design points") used to fit the emulators. The solid blue curves are the emulators’ predictions, and the areas between the red dashed curves are the associated 95% prediction intervals. The prediction from the linear regression emulator (a) passes near, but not through, all design points. Further, the prediction uncertainty is not zero at the design points. The Gaussian Process emulator (b) passes through all design points, and the uncertainty of its predictions has the desired pattern. Figure adapted from [4].

2.4 Experimental designs for developing emulators

The most common experimental designs for emulators are space-filling designs and sequential designs. Space-filling designs such as LHS and its variants choose input points from ensuring that pre-specified intervals of each input variable are sampled. Space filling designs sample points regardless of an input variable’s influence on the output variable [22, 23, 24]. Alternatively, sequential designs make use of the estimated relationship between inputs and outputs in the simulator [11, 7, 25]. Broadly, these approaches fit an emulator based on an initial set of design points and then choose an additional new design point based on pre-established criteria related to the emulator’s fit or uncertainty. When a new design point is chosen, it is added to the current set of design points and the emulator is refit with this new set of design points. The process repeats until pre-established stopping criteria are met.

LHS and other space-filling designs are easy to implement and are available in many software packages (e.g., the lhs package [26] in R). The number of design points is fixed in advance in the form of a computational budget; however, there is little guidance on pre-determining the adequate number of points. Furthermore, although LHS performs well when the simulator behaves similarly across the input domain, it may approximate poorly when the simulator behaves qualitatively differently in an input subdomain that is not sampled [27]. By contrast, sequential designs do not require the user to specify the number of design points in advance, but rather allow them to continue to add design points until the stopping criteria are met. They may be more efficient in that they may require fewer design points than fixed designs like LHS [7, 28]. However, sequential designs may be computationally expensive due to the re-estimation of the emulator’s parameters after each additional design point [29].

3 An active learning algorithm

DESIGN POINT ALGORITHM (ALGORITHM 1) Start with a seeding set of input vectors. At least one vector in the set is in the interior of the polytope of . Call the set of corresponding design points.

For each iteration until criteria are met:

Fit an EMULATOR to all design points in

For the -th interior input vector :

Remove the design point (

Refit EMULATOR to the remaining design points in

Obtain a set of candidate input vectors (see Appendix D)

For each candidate input vector :

Estimate predictions , and for all in Step Get the range of predictions.

Identify the candidate input vector

IF , where is a predefined threshold:

- Add design point to and return to Step 0.


Estimate the standard error of predictions

at all

Identify the candidate point with

IF , where is a predefined threshold:

- Add design point to and return to Step 0.

ELSE: END The proposed active learning (AL) algorithm (Box 3: ALGORITHM 1) adopts aspects of sequential designs from the current literature [7]. The algorithm seeks to select design points in the regions where (i) the simulator output changes quickly and (ii) the emulator has high predictive uncertainty. The first goal is pursued through a resampling procedure which successively removes existing design points from the complete set of design points, refits emulators using each of the resulting subsets of design points, and then identifies the “candidate” input vectors with the largest range in predictions obtained using this series of emulators. Candidate input vectors are those input vectors that have not yet been evaluated with the simulator. With this resampling procedure, the regions where the simulator is “fast-changing” are likely to have larger ranges in predictions than in regions where the simulator is relatively flat [18]

. For example, removing a design point from a region where small changes in a PSA threshold value produce large changes in life-years saved will likely change the prediction of a nearby PSA threshold value. If the resampling procedure does not identify any “fast-changing” regions (towards satisfying the first goal above), then the second goal is pursued by examining the regions with large variance in the emulator’s prediction. Choosing additional design points in these regions will reduce the emulator’s overall predictive uncertainty. Details of the algorithm’s process are described below, and Figure

2 illustrates the steps of the algorithm using a simplistic -dimensional example.

To start, the algorithm requires an initial set of input vectors, which includes the extreme vertices of the input space and a non-empty set of input vectors which are in the interior of the polytope of (Figure 2 (a)). The initial input vectors are paired with their output values evaluated in the simulator to obtain the set of initial design points (Figure 2 (b)). An emulator is then fit to (Step , Figure 2 (c)). Next, for each interior input vector , the corresponding design point is removed from the complete set of design points (Step ) and a new emulator is fit to the remaining set of design points (Step , Figure 2 (d)). A set of candidate input vectors with “large” prediction errors from the current emulator are then obtained (Step ). We describe a simple algorithm for selecting candidate points in Appendix D. For each candidate input vector , predictions are estimated using the emulator and each re-fit emulator from Step 1.2. (Step , Figure 2 (e)); the range of these predictions is obtained (Step , Figure 2 (f)). The “winning” candidate input vector is the one with the largest range in predictions (Step ). If the range from this candidate input vector is above a pre-specified threshold (i.e., ), the new design point is added to the set of existing design points and we return to Step (Step , Figure  2 (g)). Step to Step are repeated until the range of predictions for is below the threshold (Figure  2 (h)). Then, the standard error of the predictions with emulator is estimated for each candidate input vector (Step , Figure  2 (i)), and the candidate input vector with the largest prediction error is identified (Step ). If the prediction error for this candidate input vector is above a pre-specified threshold (i.e., ), then the new design point ) is added to the set of existing design points and we return to Step (Step ). The process repeats until both criteria are satisfied.

(a) Initial set-up: Input vectors.
(b) Initial set-up: Design points.
(c) Fit an emulator (Step 0).

(d) Remove -th interior point (Step 1.1) and re-fit emulator (Step 1.2).
(e) For each candidate input (Step 2), obtain predictions using re-fit emulators (Step 3.1).
(f) Obtain range of predictions at each candidate input over and all (Step 3.2).

(g) With the candidate input with the largest range in predictions (Step 4), fit a new emulator (Step 5).
(h) Repeat Steps 0-5 until range of predictions is below threshold.
(i) Estimate standard error of predictions (Step 5.1) and identify input with largest prediction error (Step 5.2).
Figure 2: 1-dimensional example of ALGORITHM 1. Blue points: interior design points; Grey points: design points corresponding to extreme vertices; Grey curve: simulator; Blue curve: emulator prediction. Red dashed curve: emulator’s 95% prediction intervals.

3.1 Implementation

The AL algorithm is implemented in R and uses a modified version of the GPfit package [30] for fitting GP emulators. When refitting emulators during iterations of the active learning algorithm, we used the initial values of previously fit emulators as initial values. We use the geometry package [31] to obtain the tesselation of that is required for the algorithm in Appendix D that identifies the candidate points at each iteration of the active learning algorithm, and the package lhs to generate LHS designs [26].

4 Application

4.1 The PSAPC model

The PSAPC microsimulation model accounts for the relationship between PSA levels, prostate cancer disease progression, and clinical detection [32, 2]. The model, its estimation approach, its calibration, and its comparison with other prostate cancer models have been described in detail elsewhere [2, 3, 33, 34]. Here, we treat the PSAPC model as a “black box”.

Figure 3 outlines the PSAPC model. Briefly, simulated healthy men may develop preclinical, local-regional cancer (disease onset). The PSAPC version we are using incorporates disease grade (low=Gleason scores 2-7; high=Gleason scores 8-10) which is determined and fixed upon disease onset. Patients with low- or high-grade, local-regional cancer may progress to distant sites (metastatic spread). Patients with either local-regional or metastatic disease may present with symptoms (clinical detection). Those with a clinically-detectable form of disease may die from prostate cancer (prostate cancer mortality). At any time and any health state in the model, patients may die from other causes (other-cause mortality). Disease progression is related to age or PSA levels. PSA levels are modeled as a function of age and age of disease onset, such that there is a linear changepoint in (log) PSA after disease onset. PSA levels after disease onset differ for those with low versus high-grade disease. Parameters for the age of disease onset, metastatic spread, and clinical detection are estimated from calibration.

Figure 3: PSAPC model natural history of disease progression. Healthy, preclinical, clinical, prostate cancer mortality, and other-cause mortality states in the absence of screening. Rounded rectangle represent the various health states. Arrows between rectangles represent allowable transitions between health states. People develop preclinical local-regional or distal disease, which may manifest clinically. Patients can die of prostate cancer only after they have developed clinical local regional or distal disease. People can die of other causes from any “alive” health state. For simplicity, transitions from the “alive” health states to “death from other causes” are not drawn explicitly, but are depicted by the broken arrows and the letter “d”. GS: Gleason Score.

In the presence of screening, the simulated individuals with cancer may be identified and treated earlier (i.e., during preclinical state) than without screening. The model assigns each simulated individual a schedule of PSA-based screening tests and biopsies, as determined by the simulated screening strategy. Every time screening occurs, men with PSA levels above the PSA positivity threshold are referred to biopsy, and those with a positive biopsy result are managed with radical prostatectomy, radiation therapy, or active surveillance (i.e., no treatment but continued monitoring) [32, 2].

The PSAPC model projects several outcomes, including the number of screenings, false-positive results, prostate cancer diagnoses, prostate cancer deaths, and life-years gained with PSA-screening versus no screening (i.e., clinical detection only). The model results are presented as the mean number of events or the lifetime probability of each outcome based on the simulated cohorts of men (e.g., 100 million men)


4.2 Example goal: Life expectancy with different PSA positivity thresholds for a set of screening policies

Using the PSAPC model described above, we aimed to estimate the expected life-days saved with PSA-based screening versus no screening for policy-relevant screening strategies that used age-specific PSA thresholds. We considered all annual PSA screening strategies that used PSA positivity thresholds between 0.5 and 6.0 ng/mL and used four age-specific PSA positivity thresholds (): one for men aged 40-44 years (), a second for men aged 45-49 years (), a third for men aged 50-54 (), and a fourth for men aged 55-74 (). Because PSA levels increase with increasing age [35], we assumed “policy-relevant” strategies consist of those with age-specific PSA positivity thresholds being constant in each age-group, where the PSA positivity threshold value for a given age group was at least as high as the value in the preceding age group. In sensitivity analyses, we also considered each of age-specific PSA positivity thresholds.

To obtain results with the PSAPC model with negligible Monte Carlo error, we simulated cohorts of 100 million men aged 40 in the year 2000 who were screened annually from age 40 to age 74.

4.3 The PSAPC emulators

A series of GP emulators of the PSAPC model were developed to predict the expected life-days saved with screening versus no screening using as predictors the PSA-positivity thresholds in each age-group. The parameter values for the emulators’ initial set-up and the selection of subsequent design points are outlined in Table 1. Briefly, an initial seeding set of or PSA positivity threshold values was sampled over the -dimensional input space, which included the extreme vertices of the input space and additional randomly sampled interior vectors. Because the location of the seeding design points can affect the fit of the initial emulator, one emulator was fit to each of 10 initial seeding sets. The mean function for the GP-based emulators was a constant. Overall, there were 8 cases governing the initial set-up of the emulators (Table 1).

Parameter for emulator development Values
Initial set-up (both approaches)
   Number of input variables () {}
   Number of initial design points () {}
   Number of initial seeding sets of design points {}
   Emulator mean function () {constant}
AL approach
   Resampling threshold in life-days () {}
   SE threshold in life-days () {}
   Number of consecutive iterations meeting threshold criteria {}
LHS approach
   Number of seeds per initial set of design points {}
   Number of LHS samples per iteration () {}
   Final number of design points* () {}
  • AL: active learning algorithm; LHS: Latin hypercube sampling.

  • *The LHS approach stopped selecting design points when 100 points had been selected. For , the final number of design points with LHS was 103 (), 101 (), 104 (), and 102 ().

Table 1: Set-up for emulators developed with the AL and LHS approaches.

For each of the 8 initial cases, the GP emulators were developed using the proposed active learning algorithm (AL approach) and using a LHS space-filling design (LHS approach). The AL algorithm in Section 3 was applied for each of the 10 initial seeding sets of design points. The stopping criteria were 0.2 or 0.5 life-days gained for the resampling threshold () and 0.5 life-days gained for the predictive uncertainty threshold (), because we judged that a precision in the order of 1 day was good enough for policy making in this application. The stopping criteria needed to be met for up to 5 consecutive iterations. With the LHS approach, sequential batches of 5 points () were repeatedly sampled across the input space, evaluated in the PSAPC simulator, and added as design points to the previous set until 100 design points were obtained. Because LHS is a randomized routine, 10 LHS-based emulators were developed using different random seeds for each of the 10 initial seeding sets resulting in a total of 100 LHS-based emulators (10 replications of 10 random initial sets) for each of the 8 initial cases.

4.4 Performance metrics

To assess each emulator’s “closeness” to the PSAPC simulator, the following metrics were estimated by evaluating the PSAPC simulator over a fine grid of PSA positivity threshold values: (i) the square-root mean squared error , where , which averages discrepancies over the whole input domain; and (ii) the maximum difference between the emulator prediction and the PSAPC simulator output, . For both metrics, a lower value indicates better performance (i.e., closer to the PSAPC simulator). PSA positivity threshold values ranged from 0.5 to 6 ng/mL, with the PSA positivity threshold value for a given age group at least as high as the value in the preceding age group as described above. For , the number of evaluations was .111The number of evaluations is as follows. For , there were 101 evaluations so that each interval had length 0.01, yielding . For , a fine grid with intervals of 0.02 was obtained, then restricted to the subset where the values for . The 101 values from were also included for a total of evaluations. For , a fine grid with intervals of 0.04 was obtained, then restricted to the subset where the values for . The 1,376 values from were also included for a total of evaluations. For , a fine grid with intervals of 0.05 was obtained, then restricted to the subset where the values for . The 4,301 values from were also included for a total of .

5 Results

Figure 4 illustrates the performance of emulators fit with the AL approach using a resampling threshold of 0.5 life-days (solid black lines) and the LHS approach (dashed grey lines) for the case where and . In particular, Figures 4 (a, b) show the RMSE and Figures 4 (c, d) show the MAX, as assessed at each iteration of the design point selection process. For both approaches, the addition of design point(s) may improve or worsen an emulator’s performance, although the performance typically improves with more design points. The improvement tends to be greater in the beginning, when there are few design points, as indicated by the initial steep slopes of the curves. The curves level off as more design points were selected; the AL approach terminates shortly after the curves level (e.g., 40-50 design points ) whereas the LHS approach exhausted the budget of 100 design points. At any number of design points, all AL-trained emulators performed at least as well as the LHS-trained emulators. A more extensive set of emulator result comparisons is available elsewhere [20].

Table 2 present the emulators’ performance results for all scenarios. For the AL-trained emulators, the results are the emulators fit to the final set of design points (i.e., when the AL terminated). To facilitate comparisons with the LHS-based emulators, the tables present performance results of LHS-emulators fit to , , , and the full budget of design points. The AL-trained emulators typically achieved a better performance more quickly (i.e., with fewer design points/simulator evaluations) than the LHS-trained emulators. For example, with and stopping criteria of for 5 consecutive iterations, the median [range] MAX for emulators trained with the AL approach was 0.78 [0.37, 1.30] with between 26 and 50 design points whereas the MAX for LHS-based emulators was 0.86 [0.50, 1.44] with 62 design points. However, in practice, one would train the LHS emulators on the whole budget, where the respective results are 0.71 [0.47, 1.22]. Then, the AL-trained emulators would have attained comparable MAX metrics with 2 to 4 times fewer design points.

Across all analyses, there were gains in the performance of the AL-trained emulators versus the LHS-trained emulators with comparable numbers of design points when using more stringent stopping criteria (e.g, a lower resampling threshold or higher number of consecutive iterations).

(a) RMSE
(b) RMSE for first 60 design points.

(c) MAX
(d) MAX for first 60 design points.
Figure 4: Performance of emulators with , , , and a constant mean function. From 10 initial seeding sets of design points, 10 emulators were fit using AL (black solid curve) and 100 emulators (10 per initial seed) were fit using LHS (grey dashed curves). (a) Square-root mean squared error (RMSE). (b) Magnification of the RMSE from (a) of the first 60 design points. (c) Maximum difference between emulator prediction and simulator results (MAX). (d) Magnification of the MAX from (c) of the first 60 design points.
Initial emulators 12 1.83 [1.71, 1.99] 0.62 [0.44, 0.81] 40 1.15 [0.82, 1.55] 0.25 [0.19, 0.40]
AL ()
  # consecutive iterations
         1 [14-36] 0.98 [0.43, 1.75] 0.19 [0.10, 0.47] [43-55] 0.78 [0.44, 1.24] 0.14 [0.11, 0.28]
         2 [15-41] 0.86 [0.44, 1.70] 0.16 [0.11, 0.49] [45-59] 0.71 [0.29, 0.86] 0.12 [0.07, 0.16]
         3 [19-44] 0.77 [0.48, 1.68] 0.14 [0.10, 0.50] [49-63] 0.67 [0.31, 0.88] 0.12 [0.07, 0.17]
         4 [24-49] 0.76 [0.42, 1.41] 0.14 [0.09, 0.36] [50-64] 0.67 [0.31, 0.96] 0.11 [0.07, 0.19]
         5 [26-50] 0.78 [0.37, 1.30] 0.13 [0.09, 0.33] [53-65] 0.68 [0.32, 0.87] 0.11 [0.07, 0.17]
AL ()
  # consecutive iterations
         1 [14-39] 1.25 [0.86, 1.75] 0.31 [0.13, 0.51] [43-63] 0.76 [0.27, 1.38] 0.13 [0.06, 0.28]
         2 [19-54] 0.99 [0.47, 1.38] 0.21 [0.09, 0.38] [49-70] 0.69 [0.27, 1.13] 0.11 [0.06, 0.22]
         3 [25-55] 1.03 [0.46, 1.32] 0.21 [0.09, 0.34] [50-71] 0.70 [0.27, 1.14] 0.11 [0.06, 0.22]
         4 [26-56] 0.99 [0.46, 1.32] 0.17 [0.09, 0.34] [56-72] 0.54 [0.27, 0.96] 0.09 [0.06, 0.27]
         5 [27-57] 0.99 [0.45, 1.33] 0.16 [0.09, 0.32] [57-73] 0.54 [0.27, 1.04] 0.09 [0.06, 0.31]
   17 1.59 [0.89, 2.45] 0.38 [0.22, 0.62] 45 0.97 [0.64, 1.85] 0.19 [0.12, 0.35]
   22 1.47 [0.80, 2.00] 0.28 [0.17, 0.41] 50 0.94 [0.62, 1.74] 0.17 [0.10, 0.32]
   62 0.86 [0.50, 1.44] 0.14 [0.08, 0.26] 90 0.74 [0.41, 1.38] 0.11 [0.07, 0.21]
  last 102 0.71 [0.47, 1.22] 0.11 [0.07, 0.16] 100 0.70 [0.39, 1.32] 0.10 [0.07, 0.20]
  • AL: Active Learning algorithm; LHS: Latin Hypercube Sampling; MAX: Maximum deviation between emulator and simulator; RMSE: Square-Root Mean Squared Error.

Table 2: Overview of performance results for and a constant mean function. is the number of design points used to fit the emulators. For the active learning algorithm approach, the in brackets is the range of design points when stopping criteria were met for 1, 2, 3, 4, or 5 consecutive iterations. Results for the MAX and RMSE are the median [range] for emulators developed with 10 different initial seeds (AL approach) and 10 replications of the 10 random initial seeds (LHS approach)

6 Discussion

Our AL algorithm appears to be a reasonable choice for many emulation problems, because (i) it is self-stopping and (ii) it is efficient, in that it requires only a limited number of simulator evaluations to achieve a close-enough approximation to the simulator. The self-stopping attribute is important, because there is little guidance in selecting the total number of design points for GP emulators. “Rules of thumb” based on simulation testing suggest that trying about 100 points may be sufficient for a few input variables [36, 27]. Others have needed up to 400 points for emulators with 28 to 30 input variables [37]. By contrast, in all analyses, the AL algorithm self-terminated to yield well-performing emulators. We believe (but have not proven) that for smooth simulators, the AL algorithm will always terminate, because GPs converge to the underlying simulator with increasing number of design points [38]. Therefore, we believe that this work responds to the call by the Second Panel for Cost-Effectiveness in Health and Medicine for future research on “practical approaches to efficiently develop well-performing emulators” of mathematical models in healthcare [15, 13, 39].

Our results are consistent with the literature, which also suggests that sequential designs can approximate simulators with predictive accuracy that is at least comparable to or better than that of conventional space-filling designs [18, 40]. The efficiency of the AL algorithm is an important consideration for simulators that require substantially more computational resources compared to the PSAPC, such as the physiology-based chronic disease model Archimedes [41, 42] or detailed modeling of the biomechanics of red blood cells [43]. In our case, we chose a computationally tractable simulator so as to quantify the evolution of RMSE and MAX metrics for emulators developed with different designs and to demonstrate computational efficiencies. Such calculations would be impractical with mathematical models that take hours or days, rather than minutes, to evaluate.

6.1 Limitations

Several of the limitations pertain to the fact that we use GPs for emulation, which do not scale well to many inputs, do not extrapolate well, and can themselves be expensive to fit [38]. O’Hagan suggests that that GPs may be practical for up to input dimensions [8]. If a large number of inputs must be emulated, one should explore whether developing several lower-dimensional emulators suffices. Knowing the analytic specification of the underlying model could offer insights about the feasibility of this simplification. GPs are interpolators and should not be used to extrapolate outside the input polytope. If such extrapolations are required, other types of emulators should be used, as the emulation problem has different goals that those described in Section 2.2.

The computational cost of learning GPs is about  [38] in the number of design points, and for large , fitting GPs becomes computationally expensive. The AL algorithm requires successively fitting a large number of GPs. However, it fits GPs that have one less or one more design point than an already fitted GP. Re-fitting in a minimally reduced or augmented dataset can be sped up by using previously-estimated parameter values as starting values for the fitting algorithm (as was done here), and with other approximations [44].

GPs trained with the AL algorithm are not guaranteed to converge to the simulator, whereas GPs trained with LHS will eventually do so [23]. Asymptotic convergence would probably be achieved by modifying step 5 in the AL algorithm to also allow choosing, with some probability, a set of random design points in an “exploration” step. However, this might undercut the AL algorithms’ efficiency gains.

6.2 Extensions

In some applications, we may be interested in both the expectation and the variance of the simulator output. The AL algorithm could be used by substituting stochastic GPs for the deterministic GPs that we have used here. Stochastic GPs model the error in the observation [38, 8, 7]. If a stochastic GP is used, some care is needed to avoid setting too stringent a resampling or SE threshold in the stopping criteria. For example, if the threshold is smaller than the simulator’s prediction uncertainty, the AL algorithm will never terminate.

For emulating multivariate simulator outputs a multivariate GP can be used, which can also model correlations between the multiple outputs [38, 16, 45]. If multivariate emulators are to be used, the stopping criteria for the AL algorithm would also need to be modified, e.g., and should be be met for each output, or for some function of the outputs. However, in practice, for a large class of models, there appears to be little gain attempting to learn multivariate emulators compared to several univariate ones [46, 47, 48].

The AL algorithm can be used with different types of emulators, including regression or neural network learners, although we have not examined this case in an example. We have also not explicitly discussed the case of linked emulators, that are used to approximate systems of computer simulators, or equivalently, distinct modules of a modular simulator, where the results of one module are used by other modules to yield the output of the overall simulator [48, 49]. This result suggests that we should be able to fit separate emulators for each simulator module with our AL algorithm. Finally, elsewhere, we build on this work to develop AL algorithms that train emulators to a target accuracy, albeit at an increased computational cost [20].


We thank Dr. Roman Gulati for providing the PSAPC model source code and for answering technical questions about the implementation.

conflict of interest

We have no conflicts to declare.


  • Owens et al. [2016] Owens DK, Whitlock EP, Henderson J, Pignone MP, Krist AH, Bibbins-Domingo K, et al. Use of Decision Models in the Development of Evidence-Based Clinical Preventive Services Recommendations: Methods of the US Preventive Services Task Force Decision Modeling for USPSTF Recommendations. Annals of Internal Medicine 2016;165(7):501–508.
  • Gulati et al. [2010] Gulati R, Inoue L, Katcher J, Hazelton W, Etzioni R. Calibrating disease progression models using population data: a critical precursor to policy development in cancer control. Biostatistics 2010;11(4):707–719.
  • Gulati et al. [2013] Gulati R, Gore J, Etzioni R. Comparative effectiveness of alternative prostate-specific antigen–based prostate cancer screening strategies: Model estimates of potential benefits and harms. Annals of Internal Medicine 2013;158(3):145–153. +
  • National Research Council [2012] National Research Council. Assessing the Reliability of Complex Models: Mathematical and Statistical Foundations of Verification, Validation, and Uncertainty Quantification. Washington, D.C.: NRC; 2012.
  • Rutter et al. [2009] Rutter CM, Miglioretti DL, Savarino JE. Bayesian calibration of microsimulation models. Journal of the American Statistical Association 2009;104(488):1338–1350.
  • Saltelli et al. [2008] Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, et al. Global sensitivity analysis: the primer. John Wiley & Sons; 2008.
  • Kleijnen [2008] Kleijnen J. Design and analysis of simulation experiments, vol. 20. Springer; 2008.
  • O’Hagan [2006] O’Hagan A. Bayesian analysis of computer code outputs: A tutorial. Reliability Engineering & System Safety 2006;91(10–11):1290 – 1300., the Fourth International Conference on Sensitivity Analysis of Model Output (SAMO 2004).
  • Bertsimas et al. [2018] Bertsimas D, Silberholz J, Trikalinos TA. Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening. Health care management science 2018;21(1):105–118.
  • Santner et al. [2013] Santner TJ, Williams BJ, Notz WI. The design and analysis of computer experiments. Springer Science & Business Media; 2013.
  • Sacks et al. [1989] Sacks J, Welch W, Mitchell T, Wynn H. Design and analysis of computer experiments. Statistical science 1989;p. 409–423.
  • Blanning [1975] Blanning RW. The construction and implementation of metamodels. simulation 1975;24(6):177–184.
  • Neumann et al. [2018] Neumann PJ, Kim DD, Trikalinos TA, Sculpher MJ, Salomon JA, Prosser LA, et al. Future Directions for Cost-effectiveness Analyses in Health and Medicine. Med Decis Making 2018 Oct;38(7):767–777.
  • Jalal et al. [2015] Jalal H, Goldhaber-Fiebert JD, Kuntz KM. Computing expected value of partial sample information from probabilistic sensitivity analysis using linear regression metamodeling. Medical Decision Making 2015;35(5):584–595.
  • Neumann et al. [2016] Neumann PJ, Sanders GD, Russell LB, Siegel JE, Ganiats TG. Cost-effectiveness in health and medicine. Oxford University Press; 2016.
  • Conti and O’Hagan [2010] Conti S, O’Hagan A. Bayesian emulation of complex multi-output and dynamic computer models. Journal of Statistical Planning and Inference 2010;140(3):640 – 651.
  • Kennedy et al. [2006] Kennedy M, Anderson C, Conti S, O’Hagan A. Case studies in Gaussian process modelling of computer codes. Reliability Engineering & System Safety 2006;91(10–11):1301 – 1309., the Fourth International Conference on Sensitivity Analysis of Model Output (SAMO 2004).
  • Kleijnen and Van Beers [2004] Kleijnen J, Van Beers W. Application-driven sequential designs for simulation experiments: Kriging metamodelling. Journal of the Operational Research Society 2004;55(8):876–883.
  • McKay et al. [1979] McKay MD, Beckman RJ, Conover WJ. Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 1979;21(2):239–245.
  • Ellis [2018] Ellis AG. Emulators to facilitate the development and analysis of computationally expensive decision analytic models in health. PhD thesis, Brown University; 2018.
  • İnci Batmaz and Tunali [2003] İnci Batmaz, Tunali S. Small response surface designs for metamodel estimation. European Journal of Operational Research 2003;145(2):455 – 470.
  • Lin et al. [2010] Lin CD, Bingham D, Sitter RR, Tang B. A new and flexible method for constructing designs for computer experiments. The Annals of Statistics 2010;p. 1460–1477.
  • McKay et al. [2000] McKay MD, Beckman RJ, Conover WJ. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 2000;42(1):55–61.
  • Salagame and Barton [1997] Salagame RR, Barton RR. Factorial hypercube designs for spatial correlation regression. Journal of Applied Statistics 1997;24(4):453–474.
  • Jones et al. [1998] Jones DR, Schonlau M, Welch WJ. Efficient global optimization of expensive black-box functions. Journal of Global optimization 1998;13(4):455–492.
  • Carnell [2016] Carnell R. lhs: Latin Hypercube Samples; 2016,, r package version 0.14.
  • Meckesheimer et al. [2001] Meckesheimer M, Barton RR, Simpson T, Limayem F, Yannou B. Metamodeling of combined discrete/continuous responses. AIAA journal 2001;39(10):1950–1959.
  • Park et al. [2002] Park S, Fowler JW, Mackulak GT, Keats JB, Carlyle WM. D-optimal sequential experiments for generating a simulation-based cycle time-throughput curve. Operations Research 2002;50(6):981–990.
  • Gano et al. [2006] Gano SE, Renaud JE, Martin JD, Simpson TW. Update strategies for kriging models used in variable fidelity optimization. Structural and Multidisciplinary Optimization 2006;32(4):287–298.
  • MacDonald et al. [2015] MacDonald B, Ranjan P, Chipman H, et al. GPfit: An R package for fitting a gaussian process model to deterministic simulator outputs. Journal of Statistical Software 2015;64(i12).
  • Barber et al. [2012] Barber C, Habel K, Grasman R, Gramacy RB, Stahel A, Sterratt DC. geometry: Mesh generation and surface tesselation. R Package Version 03-1, URL http://cran r-project org/web/packages/geometry/index html 2012;.
  • CISNET [????] CISNET, Cancer Intervention and Surveillance Modeling Network. Prostate Cancer Modeling;. Accessed: July 2015.
  • Gulati et al. [2014] Gulati R, Tsodikov A, Etzioni R, Hunter-Merrill RA, Gore JL, Mariotto AB, et al. Expected population impacts of discontinued prostate-specific antigen screening. Cancer 2014;120(22):3519–3526.
  • FHCRC [2009 (accessed April 26, 2018] FHCRC, Fred Hutchinson Cancer Research Center (PSAPC) model profile; 2009 (accessed April 26, 2018).
  • Oesterling et al. [1993] Oesterling JE, Jacobsen SJ, Chute CG, Guess HA, Girman CJ, Panser LA, et al. Serum prostate-specific antigen in a community-based population of healthy men: establishment of age-specific reference ranges. Jama 1993;270(7):860–864.
  • Helton et al. [2006] Helton JC, Johnson JD, Oberkampf W, Sallaberry CJ. Sensitivity analysis in conjunction with evidence theory representations of epistemic uncertainty. Reliability Engineering & System Safety 2006;91(10):1414–1434.
  • Rojnik and Naveršnik [2008] Rojnik K, Naveršnik K. Gaussian Process Metamodeling in Bayesian Value of Information Analysis: A Case of the Complex Health Economic Model for Breast Cancer Screening. Value in Health 2008;11(2):240 – 250.
  • Rasmussen and Williams [2006] Rasmussen CE, Williams CK. Gaussian processes for machine learning, vol. 2. the MIT Press; 2006.
  • Sanders et al. [2016] Sanders GD, Neumann PJ, Basu A, Brock DW, Feeny D, Krahn M, et al. Recommendations for conduct, methodological practices, and reporting of cost-effectiveness analyses: second panel on cost-effectiveness in health and medicine. JAMA 2016;316(10):1093–1103.
  • Jin et al. [2001] Jin R, Chen W, Simpson TW. Comparative studies of metamodelling techniques under multiple modelling criteria. Structural and Multidisciplinary Optimization 2001;23(1):1–13.
  • Schlessinger and Eddy [2002] Schlessinger L, Eddy DM. Archimedes: a new model for simulating health care systems—the mathematical formulation. Journal of biomedical informatics 2002;35(1):37–50.
  • Eddy and Schlessinger [2003] Eddy DM, Schlessinger L. Archimedes: a trial-validated model of diabetes. Diabetes care 2003;26(11):3093–3101.
  • Tang et al. [2017] Tang YH, Lu L, Li H, Evangelinos C, Grinberg L, Sachdeva V, et al. OpenRBC: a fast simulator of red blood cells at protein resolution. Biophysical journal 2017;112(10):2030–2037.
  • Shen et al. [2006] Shen Y, Seeger M, Ng AY. Fast gaussian process regression using kd-trees. In: Advances in neural information processing systems; 2006. p. 1225–1232.
  • Genton et al. [2015] Genton MG, Kleiber W, et al. Cross-covariance functions for multivariate geostatistics. Statistical Science 2015;30(2):147–163.
  • Kleijnen and Mehdad [2014] Kleijnen JP, Mehdad E. Multivariate versus univariate Kriging metamodels for multi-response simulation models. European Journal of Operational Research 2014;236(2):573–582.
  • Gu et al. [2016] Gu M, Berger JO, et al. Parallel partial Gaussian process emulation for computer models with massive output. The Annals of Applied Statistics 2016;10(3):1317–1347.
  • Kyzyurova [2017] Kyzyurova K. On Uncertainty Quantification for Systems of Computer Models. PhD thesis, Duke University; 2017.
  • Kyzyurova et al. [2018] Kyzyurova KN, Berger JO, Wolpert RL. Coupling Computer Models through Linking Their Statistical Emulators. SIAM/ASA Journal on Uncertainty Quantification 2018;6(3):1151–1171.
  • Kleijnen and Sargent [2000] Kleijnen J, Sargent R. A methodology for fitting and validating metamodels in simulation. European Journal of Operational Research 2000;120(1):14 – 29.
  • Delaunay [1934] Delaunay B. Sur la sphere vide. Bulletin of Academy of Sciences of the USSR 1934;7:793–800.


C Review of Gaussian Processes

A Gaussian Process is defined as

a collection of random variables, any finite number of which have a joint normal distribution

 [38]. A Gaussian Process is specified by its mean function

and covariance function

We write the notation .

In practice, a simulator is evaluated times to obtain a set of design points, . Typically, the inputs are transformed so that their domain is the unit cube. A Gaussian Process emulator model approximates the simulator with

where is a zero-mean Gaussian Process and .

Commonly, the mean function is a constant model , but it may also be a function of the input vectors, e.g., a linear component , where is a vector of coefficients and is a vector of known functions of the input vectors. Under this specification, the residuals from the linear regression are modeled by a zero-mean Gaussian Process. Although specifying a non-constant is not necessary, doing so may increase the smoothness of the fitted Gaussian Process and hence may require fewer design points for a “good” model fit [8].

The “behavior” of the zero-mean Gaussian Process is determined by the covariance function . Many choices for the covariance function exist; see [38]

for a discussion. A common class of functions assumes a stationary covariance process, i.e., that the relationship between any two points depends only on their distance and not their location. Within this class, a typical choice for the covariance function is the squared exponential or radial basis function

where is the variance of the process, a diagonal matrix of non-negative roughness parameters, and the implied correlation function. Note that , and that the correlation between is positive and decreases with the distance between and  [50, 38]. The roughness parameters determine how quickly the correlation decreases in each dimension. A large value suggests a low correlation for a given dimension, even when two points are close.

Let be a vector of observations modeled with a Gaussian Process. From the definition of the Gaussian Process, these observations are modeled as following an -dimensional multivariate normal distribution. The parameters , , and of the mean function and the covariance function can be obtained by optimizing the likelihood of the observations [7, 38, 30]. For example, for a constant mean function , the mean and the variance of the Gaussian Process can be expressed in closed form as a function of the roughness parameters

where and is a symmetric positive definite correlation matrix in which the correlation between the -th and -th observations is . The negative profile log likelihood

is a function of the roughness parameters because , and are functions of . In the equation above, is the determinant of . Optimizing the likelihood yields estimates for , and thus estimates for and the correlation matrix .

The best linear unbiased predictor of the output value at a new input vector is [7, 38]

where is a -vector of correlations between and the input vectors from the design points, such that the -th element of is . The prediction error is [7, 38]

The maximum likelihood estimate is used to obtain and in the two prediction formulas above.

D Algorithm for candidate points

The identification of the set of candidate input vectors noted in ALGORITHM is outlined in ALGORITHM 2 (Box D). A second example is provided in Figure D.1 to illustrate this algorithm with a more complex -dimensional example, where the steps can be illustrated more clearly than in the -dimensional case. To start, the algorithm requires a set of design points (Figure D.1 (a)), with denoting the corresponding set of input vectors, and an emulator fit with .

First, a triangulation of is obtained (Step 1, Figure D.1 (b)). The triangulation partitions in dimensional simplexes using all as vertices. The partitioning is such that for is either the empty set or a lower-dimensional simplex (a shared extreme point, line, or facet).

Any triangulation would do. For relatively small numbers of points (in the few hundreds), and few dimensions (say, less than 10), it is practical to use a Delaunay triangulation, for which many routines exist [51]. Within each returned simplex, we identify the vector that maximizes the standard error of the predictions with emulator , for all (Step in ALGORITHM 1, Figure D.1 (c)). The set of unique are selected as the candidate input vectors to be exploited in ALGORITHM .


For a given set of design vectors, let be the set of the corresponding input vectors, and an emulator fit with .

Obtain a triangulation . For each simplex in :

Find the input vector that maximizes the standard error of the prediction for all .

Return the candidate input vectors from the input vectors identified in Step 1.

(a) Initial set-up: Input vectors. The polytope of is the grey shaded rectangle, which includes the extreme vertices A, B, C, D (grey points). E, F are interior vectors (blue points).

(b) Step 1. The triangulation gives the 6 simplexes defined by the blue line segments.

(c) Step 1.1. The vectors (green diamonds) that maximize the standard error of the predictions within each of the 6 simplexes are identified.
Figure D.1: Illustrative example of ALGORITHM 2.