1 Introduction
Many decisions in health must be made under uncertainty or with incomplete understanding of the underlying phenomena. In such cases, mathematical modeling can help decisionmakers synthesize evidence from different sources, estimate and aggregate anticipated outcomes while accounting for stakeholder preferences, understand tradeoffs, and quantify the impact of uncertainty on decision making
[1]. To be informative, a model should be detailed enough to capture salient aspects of the decisional problem at hand, but a highly detailed model can render routine analyses computationally expensive, hindering its usability.As an example, consider the prostatespecific antigen (PSA) growth and prostate cancer progression (PSAPC) model, which is a microsimulation model that projects outcomes for PSAbased screening strategies defined by combinations of screening schedules and agespecific PSA positivity thresholds [2, 3]. Although evaluating the expected outcomes of one screening strategy takes just a few minutes on modern computer hardware, analyses that require many (on the order of to ) model evaluations, such as calibration of model parameters to external data [4, 5], sensitivity analysis [4, 6, 7], and uncertainty analysis [4, 8], become expensive if not impractical. For example, in part because of the computational burden, Gulati and colleagues evaluated only screening strategies [3], out of more than policyrelevant strategies that could have been explored [9].
A way to mitigate the computational cost of such analyses is to use emulators of the original mathematical models. Emulators, also called surrogate models or metamodels, are approximations of the original models, hereafter referred to as simulators for consistency with prior works [7, 8, 10, 11]. Once developed, emulators are orders of magnitude faster to evaluate than the simulators and, thus, can be used instead of, or in tandem with, the simulators to perform computationally intensive tasks [7, 12, 11]. Although uncommon in healthcare applications [13, 14, 15], using emulators in place of simulators is a wellestablished practice in mechanical, electrical, and industrial engineering, geology, meteorology, and military modeling [7, 16, 17, 4, 8].
Developing an emulator that approximates a simulator over an input domain is analogous to estimating a response surface, so it is an experimentaldesign problem [18]. In this context, the simulator is used to generate data (a design), and an emulator is fit to the design. In this work we describe an adaptive algorithm for developing designs that are tailored to a specific simulator and a particular emulation task. We compare the performance of emulators generated with the adaptive design versus with the standard approach of Latin Hypercube Sampling (LHS), which generates generic designs that are agnostic about the emulation task [19] and distribute points across intervals of the input domains.
This work is organized as follows. In Section 2, we describe desirable characteristics for emulators of simulators in health to motivate our choice of Gaussian Process emulators. In Section 3, we introduce an adaptive design for developing emulators, which is the focus of this work. Section 4 uses the PSAPC model to explicate the development of emulators that predict life expectancy over no screening for large sets of practicallyimplementable prostate cancer screening strategies. Although our adaptive design is meant for very expensive simulators (that, e.g., take days rather than minutes to evaluate), we use the PSAPC simulator as a model with which we can actually do full computations. We compare emulators developed with the sequential design versus those developed with LHS in Section 5, and conclude with key remarks (Section 6).
2 Emulation goals and emulator models
2.1 Simulators
For this exposition, a simulator is a deterministic smooth function that maps from input parameters to scalar outputs. is typically in the many dozens or hundreds. We treat the simulator as a black box function that can be evaluated (expensively) at specific input values. Typically, a baseline set of values has been specified for the simulator through data analysis, calibration, or expert knowledge.
While this setup is not the most general case for which we could present our algorithm, it is not as restrictive as it appears for the following reasons: (i) Many simulators in health that are not smooth are Lipschitzcontinuous (i.e. have no “extreme” changes in the output given a marginal change in input) and, for the precision demanded in practice and for our purposes, can be treated as if they were smooth functions. (ii) Simulator outputs can be random variables, e.g., if some inputs are random variables, but we are usually interested in expectations of outcomes, which marginalize over the random input variables and result in deterministic mappings. (iii) Most simulators have multidimensional outputs, but, often, we are interested in one critical or composite outcome or are willing to consider one outcome at a time. (iv) Finally, we treat the simulator as a black box and do not attempt to exploit its analytical form. Often the analytical form is not available, or, when it is, its theoretical analysis may severely test our mathematical ability or be intractable.
With reference to more general setups for emulator development, in Section 6.2 we argue that extensions of our work to stochastic simulators and to simulators with multidimensional outputs require only minor modifications of the adaptive design algorithms that we present below.
2.2 Goal of emulation
We aim to develop an emulator that statistically approximates the simulator’s inputoutput mapping over the domain of the simulators’ input parameters, where [8]. We seek
(1) 
where are values for the inputs whose mapping we wish to emulate,
are values for the remaining inputs of the simulator that are kept equal to their corresponding elements in the baselinevalues vector
, and the symbol “” means “either equal or close enough for the purpose of the application”. From now on we will be writing instead of to ease notation. We assume that is a polytope, i.e., a convex polyhedron in , which is most often the case in applications.Approximating the behavior of the emulator over all vectors in is a reasonable goal when we wish to gain insights about the behavior of the simulator or for sensitivity and uncertainty analysis. For different tasks, e.g., for calibration of input variables to external data it may suffice to approximate the simulator in the neighborhood of the set of optimal values [20].
To fit , we need a design that specifies the set of points in at which we will evaluate the simulator. Let be a set of distinct vectors in that includes the extreme vectors of , so that all vectors in are convex combinations of vectors in . Then, with some abuse of terminology, we will call the vectors in the design vectors or design points. We further simplify notation by writing instead of for each vector in .
2.3 Choice of emulator model
We follow others in requiring that the emulator be an
exact interpolator
[8, 4, 11, 7, 10]. Specifically, we demand that
for , that is, the emulator’s prediction should agree with the simulator’s output value at the design points because the simulator is deterministic.

At all other input vectors, the emulator must provide an interpolation of the simulator output value, whose uncertainty decreases when closer to a design point, so that it becomes 0 at the design points.

The emulator should be orders of magnitude faster to evaluate than the simulator.
Despite the fact that practically all statistical and machine learning models are fast to evaluate, i.e., they satisfy criterion
3, approaches such as linear regressions and neural networks do not satisfy the first two criteria, whereas kernelbased methods such as Gaussian Processes (GPs), do. Figure
1 helps build intuition. Therefore, despite its popularity in the literature in part due to its simple form and familiarity by researchers [21, 7], regression is not a preferred emulator type in this context. We use GPbased emulation, also referred to as kriging in other fields [8, 4, 11, 7, 10]. We review GPs briefly in Appendix C.2.4 Experimental designs for developing emulators
The most common experimental designs for emulators are spacefilling designs and sequential designs. Spacefilling designs such as LHS and its variants choose input points from ensuring that prespecified intervals of each input variable are sampled. Space filling designs sample points regardless of an input variable’s influence on the output variable [22, 23, 24]. Alternatively, sequential designs make use of the estimated relationship between inputs and outputs in the simulator [11, 7, 25]. Broadly, these approaches fit an emulator based on an initial set of design points and then choose an additional new design point based on preestablished criteria related to the emulator’s fit or uncertainty. When a new design point is chosen, it is added to the current set of design points and the emulator is refit with this new set of design points. The process repeats until preestablished stopping criteria are met.
LHS and other spacefilling designs are easy to implement and are available in many software packages (e.g., the lhs package [26] in R). The number of design points is fixed in advance in the form of a computational budget; however, there is little guidance on predetermining the adequate number of points. Furthermore, although LHS performs well when the simulator behaves similarly across the input domain, it may approximate poorly when the simulator behaves qualitatively differently in an input subdomain that is not sampled [27]. By contrast, sequential designs do not require the user to specify the number of design points in advance, but rather allow them to continue to add design points until the stopping criteria are met. They may be more efficient in that they may require fewer design points than fixed designs like LHS [7, 28]. However, sequential designs may be computationally expensive due to the reestimation of the emulator’s parameters after each additional design point [29].
3 An active learning algorithm
of input vectors. At least one vector in the set is in the interior of the polytope of . Call the set of corresponding design points.
Start with a seeding setFor each iteration until criteria are met:
Fit an EMULATOR to all design points in
For the th interior input vector :
Remove the design point (
Refit EMULATOR to the remaining design points in
Obtain a set of candidate input vectors (see Appendix D)
For each candidate input vector :
Estimate predictions , and for all in Step Get the range of predictions.
Identify the candidate input vector
IF , where is a predefined threshold:
 Add design point to and return to Step 0.
ELSE:
Identify the candidate point with
IF , where is a predefined threshold:
 Add design point to and return to Step 0.
ELSE: END The proposed active learning (AL) algorithm (Box 3: ALGORITHM 1) adopts aspects of sequential designs from the current literature [7]. The algorithm seeks to select design points in the regions where (i) the simulator output changes quickly and (ii) the emulator has high predictive uncertainty. The first goal is pursued through a resampling procedure which successively removes existing design points from the complete set of design points, refits emulators using each of the resulting subsets of design points, and then identifies the “candidate” input vectors with the largest range in predictions obtained using this series of emulators. Candidate input vectors are those input vectors that have not yet been evaluated with the simulator. With this resampling procedure, the regions where the simulator is “fastchanging” are likely to have larger ranges in predictions than in regions where the simulator is relatively flat [18]
. For example, removing a design point from a region where small changes in a PSA threshold value produce large changes in lifeyears saved will likely change the prediction of a nearby PSA threshold value. If the resampling procedure does not identify any “fastchanging” regions (towards satisfying the first goal above), then the second goal is pursued by examining the regions with large variance in the emulator’s prediction. Choosing additional design points in these regions will reduce the emulator’s overall predictive uncertainty. Details of the algorithm’s process are described below, and Figure
2 illustrates the steps of the algorithm using a simplistic dimensional example.To start, the algorithm requires an initial set of input vectors, which includes the extreme vertices of the input space and a nonempty set of input vectors which are in the interior of the polytope of (Figure 2 (a)). The initial input vectors are paired with their output values evaluated in the simulator to obtain the set of initial design points (Figure 2 (b)). An emulator is then fit to (Step , Figure 2 (c)). Next, for each interior input vector , the corresponding design point is removed from the complete set of design points (Step ) and a new emulator is fit to the remaining set of design points (Step , Figure 2 (d)). A set of candidate input vectors with “large” prediction errors from the current emulator are then obtained (Step ). We describe a simple algorithm for selecting candidate points in Appendix D. For each candidate input vector , predictions are estimated using the emulator and each refit emulator from Step 1.2. (Step , Figure 2 (e)); the range of these predictions is obtained (Step , Figure 2 (f)). The “winning” candidate input vector is the one with the largest range in predictions (Step ). If the range from this candidate input vector is above a prespecified threshold (i.e., ), the new design point is added to the set of existing design points and we return to Step (Step , Figure 2 (g)). Step to Step are repeated until the range of predictions for is below the threshold (Figure 2 (h)). Then, the standard error of the predictions with emulator is estimated for each candidate input vector (Step , Figure 2 (i)), and the candidate input vector with the largest prediction error is identified (Step ). If the prediction error for this candidate input vector is above a prespecified threshold (i.e., ), then the new design point ) is added to the set of existing design points and we return to Step (Step ). The process repeats until both criteria are satisfied.
3.1 Implementation
The AL algorithm is implemented in R and uses a modified version of the GPfit package [30] for fitting GP emulators. When refitting emulators during iterations of the active learning algorithm, we used the initial values of previously fit emulators as initial values. We use the geometry package [31] to obtain the tesselation of that is required for the algorithm in Appendix D that identifies the candidate points at each iteration of the active learning algorithm, and the package lhs to generate LHS designs [26].
4 Application
4.1 The PSAPC model
The PSAPC microsimulation model accounts for the relationship between PSA levels, prostate cancer disease progression, and clinical detection [32, 2]. The model, its estimation approach, its calibration, and its comparison with other prostate cancer models have been described in detail elsewhere [2, 3, 33, 34]. Here, we treat the PSAPC model as a “black box”.
Figure 3 outlines the PSAPC model. Briefly, simulated healthy men may develop preclinical, localregional cancer (disease onset). The PSAPC version we are using incorporates disease grade (low=Gleason scores 27; high=Gleason scores 810) which is determined and fixed upon disease onset. Patients with low or highgrade, localregional cancer may progress to distant sites (metastatic spread). Patients with either localregional or metastatic disease may present with symptoms (clinical detection). Those with a clinicallydetectable form of disease may die from prostate cancer (prostate cancer mortality). At any time and any health state in the model, patients may die from other causes (othercause mortality). Disease progression is related to age or PSA levels. PSA levels are modeled as a function of age and age of disease onset, such that there is a linear changepoint in (log) PSA after disease onset. PSA levels after disease onset differ for those with low versus highgrade disease. Parameters for the age of disease onset, metastatic spread, and clinical detection are estimated from calibration.
In the presence of screening, the simulated individuals with cancer may be identified and treated earlier (i.e., during preclinical state) than without screening. The model assigns each simulated individual a schedule of PSAbased screening tests and biopsies, as determined by the simulated screening strategy. Every time screening occurs, men with PSA levels above the PSA positivity threshold are referred to biopsy, and those with a positive biopsy result are managed with radical prostatectomy, radiation therapy, or active surveillance (i.e., no treatment but continued monitoring) [32, 2].
The PSAPC model projects several outcomes, including the number of screenings, falsepositive results, prostate cancer diagnoses, prostate cancer deaths, and lifeyears gained with PSAscreening versus no screening (i.e., clinical detection only). The model results are presented as the mean number of events or the lifetime probability of each outcome based on the simulated cohorts of men (e.g., 100 million men)
[3].4.2 Example goal: Life expectancy with different PSA positivity thresholds for a set of screening policies
Using the PSAPC model described above, we aimed to estimate the expected lifedays saved with PSAbased screening versus no screening for policyrelevant screening strategies that used agespecific PSA thresholds. We considered all annual PSA screening strategies that used PSA positivity thresholds between 0.5 and 6.0 ng/mL and used four agespecific PSA positivity thresholds (): one for men aged 4044 years (), a second for men aged 4549 years (), a third for men aged 5054 (), and a fourth for men aged 5574 (). Because PSA levels increase with increasing age [35], we assumed “policyrelevant” strategies consist of those with agespecific PSA positivity thresholds being constant in each agegroup, where the PSA positivity threshold value for a given age group was at least as high as the value in the preceding age group. In sensitivity analyses, we also considered each of agespecific PSA positivity thresholds.
To obtain results with the PSAPC model with negligible Monte Carlo error, we simulated cohorts of 100 million men aged 40 in the year 2000 who were screened annually from age 40 to age 74.
4.3 The PSAPC emulators
A series of GP emulators of the PSAPC model were developed to predict the expected lifedays saved with screening versus no screening using as predictors the PSApositivity thresholds in each agegroup. The parameter values for the emulators’ initial setup and the selection of subsequent design points are outlined in Table 1. Briefly, an initial seeding set of or PSA positivity threshold values was sampled over the dimensional input space, which included the extreme vertices of the input space and additional randomly sampled interior vectors. Because the location of the seeding design points can affect the fit of the initial emulator, one emulator was fit to each of 10 initial seeding sets. The mean function for the GPbased emulators was a constant. Overall, there were 8 cases governing the initial setup of the emulators (Table 1).
Parameter for emulator development  Values 

Initial setup (both approaches)  
Number of input variables ()  {} 
Number of initial design points ()  {} 
Number of initial seeding sets of design points  {} 
Emulator mean function ()  {constant} 
AL approach  
Resampling threshold in lifedays ()  {} 
SE threshold in lifedays ()  {} 
Number of consecutive iterations meeting threshold criteria  {} 
LHS approach  
Number of seeds per initial set of design points  {} 
Number of LHS samples per iteration ()  {} 
Final number of design points* ()  {} 

AL: active learning algorithm; LHS: Latin hypercube sampling.

*The LHS approach stopped selecting design points when 100 points had been selected. For , the final number of design points with LHS was 103 (), 101 (), 104 (), and 102 ().
For each of the 8 initial cases, the GP emulators were developed using the proposed active learning algorithm (AL approach) and using a LHS spacefilling design (LHS approach). The AL algorithm in Section 3 was applied for each of the 10 initial seeding sets of design points. The stopping criteria were 0.2 or 0.5 lifedays gained for the resampling threshold () and 0.5 lifedays gained for the predictive uncertainty threshold (), because we judged that a precision in the order of 1 day was good enough for policy making in this application. The stopping criteria needed to be met for up to 5 consecutive iterations. With the LHS approach, sequential batches of 5 points () were repeatedly sampled across the input space, evaluated in the PSAPC simulator, and added as design points to the previous set until 100 design points were obtained. Because LHS is a randomized routine, 10 LHSbased emulators were developed using different random seeds for each of the 10 initial seeding sets resulting in a total of 100 LHSbased emulators (10 replications of 10 random initial sets) for each of the 8 initial cases.
4.4 Performance metrics
To assess each emulator’s “closeness” to the PSAPC simulator, the following metrics were estimated by evaluating the PSAPC simulator over a fine grid of PSA positivity threshold values: (i) the squareroot mean squared error , where , which averages discrepancies over the whole input domain; and (ii) the maximum difference between the emulator prediction and the PSAPC simulator output, . For both metrics, a lower value indicates better performance (i.e., closer to the PSAPC simulator). PSA positivity threshold values ranged from 0.5 to 6 ng/mL, with the PSA positivity threshold value for a given age group at least as high as the value in the preceding age group as described above. For , the number of evaluations was .^{1}^{1}1The number of evaluations is as follows. For , there were 101 evaluations so that each interval had length 0.01, yielding . For , a fine grid with intervals of 0.02 was obtained, then restricted to the subset where the values for . The 101 values from were also included for a total of evaluations. For , a fine grid with intervals of 0.04 was obtained, then restricted to the subset where the values for . The 1,376 values from were also included for a total of evaluations. For , a fine grid with intervals of 0.05 was obtained, then restricted to the subset where the values for . The 4,301 values from were also included for a total of .
5 Results
Figure 4 illustrates the performance of emulators fit with the AL approach using a resampling threshold of 0.5 lifedays (solid black lines) and the LHS approach (dashed grey lines) for the case where and . In particular, Figures 4 (a, b) show the RMSE and Figures 4 (c, d) show the MAX, as assessed at each iteration of the design point selection process. For both approaches, the addition of design point(s) may improve or worsen an emulator’s performance, although the performance typically improves with more design points. The improvement tends to be greater in the beginning, when there are few design points, as indicated by the initial steep slopes of the curves. The curves level off as more design points were selected; the AL approach terminates shortly after the curves level (e.g., 4050 design points ) whereas the LHS approach exhausted the budget of 100 design points. At any number of design points, all ALtrained emulators performed at least as well as the LHStrained emulators. A more extensive set of emulator result comparisons is available elsewhere [20].
Table 2 present the emulators’ performance results for all scenarios. For the ALtrained emulators, the results are the emulators fit to the final set of design points (i.e., when the AL terminated). To facilitate comparisons with the LHSbased emulators, the tables present performance results of LHSemulators fit to , , , and the full budget of design points. The ALtrained emulators typically achieved a better performance more quickly (i.e., with fewer design points/simulator evaluations) than the LHStrained emulators. For example, with and stopping criteria of for 5 consecutive iterations, the median [range] MAX for emulators trained with the AL approach was 0.78 [0.37, 1.30] with between 26 and 50 design points whereas the MAX for LHSbased emulators was 0.86 [0.50, 1.44] with 62 design points. However, in practice, one would train the LHS emulators on the whole budget, where the respective results are 0.71 [0.47, 1.22]. Then, the ALtrained emulators would have attained comparable MAX metrics with 2 to 4 times fewer design points.
Across all analyses, there were gains in the performance of the ALtrained emulators versus the LHStrained emulators with comparable numbers of design points when using more stringent stopping criteria (e.g, a lower resampling threshold or higher number of consecutive iterations).
MAX  RMSE  MAX  RMSE  

Initial emulators  12  1.83 [1.71, 1.99]  0.62 [0.44, 0.81]  40  1.15 [0.82, 1.55]  0.25 [0.19, 0.40] 
AL ()  
# consecutive iterations  
1  [1436]  0.98 [0.43, 1.75]  0.19 [0.10, 0.47]  [4355]  0.78 [0.44, 1.24]  0.14 [0.11, 0.28] 
2  [1541]  0.86 [0.44, 1.70]  0.16 [0.11, 0.49]  [4559]  0.71 [0.29, 0.86]  0.12 [0.07, 0.16] 
3  [1944]  0.77 [0.48, 1.68]  0.14 [0.10, 0.50]  [4963]  0.67 [0.31, 0.88]  0.12 [0.07, 0.17] 
4  [2449]  0.76 [0.42, 1.41]  0.14 [0.09, 0.36]  [5064]  0.67 [0.31, 0.96]  0.11 [0.07, 0.19] 
5  [2650]  0.78 [0.37, 1.30]  0.13 [0.09, 0.33]  [5365]  0.68 [0.32, 0.87]  0.11 [0.07, 0.17] 
AL ()  
# consecutive iterations  
1  [1439]  1.25 [0.86, 1.75]  0.31 [0.13, 0.51]  [4363]  0.76 [0.27, 1.38]  0.13 [0.06, 0.28] 
2  [1954]  0.99 [0.47, 1.38]  0.21 [0.09, 0.38]  [4970]  0.69 [0.27, 1.13]  0.11 [0.06, 0.22] 
3  [2555]  1.03 [0.46, 1.32]  0.21 [0.09, 0.34]  [5071]  0.70 [0.27, 1.14]  0.11 [0.06, 0.22] 
4  [2656]  0.99 [0.46, 1.32]  0.17 [0.09, 0.34]  [5672]  0.54 [0.27, 0.96]  0.09 [0.06, 0.27] 
5  [2757]  0.99 [0.45, 1.33]  0.16 [0.09, 0.32]  [5773]  0.54 [0.27, 1.04]  0.09 [0.06, 0.31] 
LHS  
17  1.59 [0.89, 2.45]  0.38 [0.22, 0.62]  45  0.97 [0.64, 1.85]  0.19 [0.12, 0.35]  
22  1.47 [0.80, 2.00]  0.28 [0.17, 0.41]  50  0.94 [0.62, 1.74]  0.17 [0.10, 0.32]  
62  0.86 [0.50, 1.44]  0.14 [0.08, 0.26]  90  0.74 [0.41, 1.38]  0.11 [0.07, 0.21]  
last  102  0.71 [0.47, 1.22]  0.11 [0.07, 0.16]  100  0.70 [0.39, 1.32]  0.10 [0.07, 0.20] 

AL: Active Learning algorithm; LHS: Latin Hypercube Sampling; MAX: Maximum deviation between emulator and simulator; RMSE: SquareRoot Mean Squared Error.
6 Discussion
Our AL algorithm appears to be a reasonable choice for many emulation problems, because (i) it is selfstopping and (ii) it is efficient, in that it requires only a limited number of simulator evaluations to achieve a closeenough approximation to the simulator. The selfstopping attribute is important, because there is little guidance in selecting the total number of design points for GP emulators. “Rules of thumb” based on simulation testing suggest that trying about 100 points may be sufficient for a few input variables [36, 27]. Others have needed up to 400 points for emulators with 28 to 30 input variables [37]. By contrast, in all analyses, the AL algorithm selfterminated to yield wellperforming emulators. We believe (but have not proven) that for smooth simulators, the AL algorithm will always terminate, because GPs converge to the underlying simulator with increasing number of design points [38]. Therefore, we believe that this work responds to the call by the Second Panel for CostEffectiveness in Health and Medicine for future research on “practical approaches to efficiently develop wellperforming emulators” of mathematical models in healthcare [15, 13, 39].
Our results are consistent with the literature, which also suggests that sequential designs can approximate simulators with predictive accuracy that is at least comparable to or better than that of conventional spacefilling designs [18, 40]. The efficiency of the AL algorithm is an important consideration for simulators that require substantially more computational resources compared to the PSAPC, such as the physiologybased chronic disease model Archimedes [41, 42] or detailed modeling of the biomechanics of red blood cells [43]. In our case, we chose a computationally tractable simulator so as to quantify the evolution of RMSE and MAX metrics for emulators developed with different designs and to demonstrate computational efficiencies. Such calculations would be impractical with mathematical models that take hours or days, rather than minutes, to evaluate.
6.1 Limitations
Several of the limitations pertain to the fact that we use GPs for emulation, which do not scale well to many inputs, do not extrapolate well, and can themselves be expensive to fit [38]. O’Hagan suggests that that GPs may be practical for up to input dimensions [8]. If a large number of inputs must be emulated, one should explore whether developing several lowerdimensional emulators suffices. Knowing the analytic specification of the underlying model could offer insights about the feasibility of this simplification. GPs are interpolators and should not be used to extrapolate outside the input polytope. If such extrapolations are required, other types of emulators should be used, as the emulation problem has different goals that those described in Section 2.2.
The computational cost of learning GPs is about [38] in the number of design points, and for large , fitting GPs becomes computationally expensive. The AL algorithm requires successively fitting a large number of GPs. However, it fits GPs that have one less or one more design point than an already fitted GP. Refitting in a minimally reduced or augmented dataset can be sped up by using previouslyestimated parameter values as starting values for the fitting algorithm (as was done here), and with other approximations [44].
GPs trained with the AL algorithm are not guaranteed to converge to the simulator, whereas GPs trained with LHS will eventually do so [23]. Asymptotic convergence would probably be achieved by modifying step 5 in the AL algorithm to also allow choosing, with some probability, a set of random design points in an “exploration” step. However, this might undercut the AL algorithms’ efficiency gains.
6.2 Extensions
In some applications, we may be interested in both the expectation and the variance of the simulator output. The AL algorithm could be used by substituting stochastic GPs for the deterministic GPs that we have used here. Stochastic GPs model the error in the observation [38, 8, 7]. If a stochastic GP is used, some care is needed to avoid setting too stringent a resampling or SE threshold in the stopping criteria. For example, if the threshold is smaller than the simulator’s prediction uncertainty, the AL algorithm will never terminate.
For emulating multivariate simulator outputs a multivariate GP can be used, which can also model correlations between the multiple outputs [38, 16, 45]. If multivariate emulators are to be used, the stopping criteria for the AL algorithm would also need to be modified, e.g., and should be be met for each output, or for some function of the outputs. However, in practice, for a large class of models, there appears to be little gain attempting to learn multivariate emulators compared to several univariate ones [46, 47, 48].
The AL algorithm can be used with different types of emulators, including regression or neural network learners, although we have not examined this case in an example. We have also not explicitly discussed the case of linked emulators, that are used to approximate systems of computer simulators, or equivalently, distinct modules of a modular simulator, where the results of one module are used by other modules to yield the output of the overall simulator [48, 49]. This result suggests that we should be able to fit separate emulators for each simulator module with our AL algorithm. Finally, elsewhere, we build on this work to develop AL algorithms that train emulators to a target accuracy, albeit at an increased computational cost [20].
acknowledgements
We thank Dr. Roman Gulati for providing the PSAPC model source code and for answering technical questions about the implementation.
conflict of interest
We have no conflicts to declare.
References
 Owens et al. [2016] Owens DK, Whitlock EP, Henderson J, Pignone MP, Krist AH, BibbinsDomingo K, et al. Use of Decision Models in the Development of EvidenceBased Clinical Preventive Services Recommendations: Methods of the US Preventive Services Task Force Decision Modeling for USPSTF Recommendations. Annals of Internal Medicine 2016;165(7):501–508.
 Gulati et al. [2010] Gulati R, Inoue L, Katcher J, Hazelton W, Etzioni R. Calibrating disease progression models using population data: a critical precursor to policy development in cancer control. Biostatistics 2010;11(4):707–719.
 Gulati et al. [2013] Gulati R, Gore J, Etzioni R. Comparative effectiveness of alternative prostatespecific antigen–based prostate cancer screening strategies: Model estimates of potential benefits and harms. Annals of Internal Medicine 2013;158(3):145–153. +http://dx.doi.org/10.7326/00034819158320130205000003.
 National Research Council [2012] National Research Council. Assessing the Reliability of Complex Models: Mathematical and Statistical Foundations of Verification, Validation, and Uncertainty Quantification. Washington, D.C.: NRC; 2012.
 Rutter et al. [2009] Rutter CM, Miglioretti DL, Savarino JE. Bayesian calibration of microsimulation models. Journal of the American Statistical Association 2009;104(488):1338–1350.
 Saltelli et al. [2008] Saltelli A, Ratto M, Andres T, Campolongo F, Cariboni J, Gatelli D, et al. Global sensitivity analysis: the primer. John Wiley & Sons; 2008.
 Kleijnen [2008] Kleijnen J. Design and analysis of simulation experiments, vol. 20. Springer; 2008.
 O’Hagan [2006] O’Hagan A. Bayesian analysis of computer code outputs: A tutorial. Reliability Engineering & System Safety 2006;91(10–11):1290 – 1300. http://www.sciencedirect.com/science/article/pii/S0951832005002383, the Fourth International Conference on Sensitivity Analysis of Model Output (SAMO 2004).
 Bertsimas et al. [2018] Bertsimas D, Silberholz J, Trikalinos TA. Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening. Health care management science 2018;21(1):105–118.
 Santner et al. [2013] Santner TJ, Williams BJ, Notz WI. The design and analysis of computer experiments. Springer Science & Business Media; 2013.
 Sacks et al. [1989] Sacks J, Welch W, Mitchell T, Wynn H. Design and analysis of computer experiments. Statistical science 1989;p. 409–423.
 Blanning [1975] Blanning RW. The construction and implementation of metamodels. simulation 1975;24(6):177–184.
 Neumann et al. [2018] Neumann PJ, Kim DD, Trikalinos TA, Sculpher MJ, Salomon JA, Prosser LA, et al. Future Directions for Costeffectiveness Analyses in Health and Medicine. Med Decis Making 2018 Oct;38(7):767–777.
 Jalal et al. [2015] Jalal H, GoldhaberFiebert JD, Kuntz KM. Computing expected value of partial sample information from probabilistic sensitivity analysis using linear regression metamodeling. Medical Decision Making 2015;35(5):584–595.
 Neumann et al. [2016] Neumann PJ, Sanders GD, Russell LB, Siegel JE, Ganiats TG. Costeffectiveness in health and medicine. Oxford University Press; 2016.
 Conti and O’Hagan [2010] Conti S, O’Hagan A. Bayesian emulation of complex multioutput and dynamic computer models. Journal of Statistical Planning and Inference 2010;140(3):640 – 651. http://www.sciencedirect.com/science/article/pii/S0378375809002559.
 Kennedy et al. [2006] Kennedy M, Anderson C, Conti S, O’Hagan A. Case studies in Gaussian process modelling of computer codes. Reliability Engineering & System Safety 2006;91(10–11):1301 – 1309. http://www.sciencedirect.com/science/article/pii/S0951832005002395, the Fourth International Conference on Sensitivity Analysis of Model Output (SAMO 2004).
 Kleijnen and Van Beers [2004] Kleijnen J, Van Beers W. Applicationdriven sequential designs for simulation experiments: Kriging metamodelling. Journal of the Operational Research Society 2004;55(8):876–883.
 McKay et al. [1979] McKay MD, Beckman RJ, Conover WJ. Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 1979;21(2):239–245.
 Ellis [2018] Ellis AG. Emulators to facilitate the development and analysis of computationally expensive decision analytic models in health. PhD thesis, Brown University; 2018.
 İnci Batmaz and Tunali [2003] İnci Batmaz, Tunali S. Small response surface designs for metamodel estimation. European Journal of Operational Research 2003;145(2):455 – 470. http://www.sciencedirect.com/science/article/pii/S0377221702002072.
 Lin et al. [2010] Lin CD, Bingham D, Sitter RR, Tang B. A new and flexible method for constructing designs for computer experiments. The Annals of Statistics 2010;p. 1460–1477.
 McKay et al. [2000] McKay MD, Beckman RJ, Conover WJ. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 2000;42(1):55–61.
 Salagame and Barton [1997] Salagame RR, Barton RR. Factorial hypercube designs for spatial correlation regression. Journal of Applied Statistics 1997;24(4):453–474.
 Jones et al. [1998] Jones DR, Schonlau M, Welch WJ. Efficient global optimization of expensive blackbox functions. Journal of Global optimization 1998;13(4):455–492.
 Carnell [2016] Carnell R. lhs: Latin Hypercube Samples; 2016, https://CRAN.Rproject.org/package=lhs, r package version 0.14.
 Meckesheimer et al. [2001] Meckesheimer M, Barton RR, Simpson T, Limayem F, Yannou B. Metamodeling of combined discrete/continuous responses. AIAA journal 2001;39(10):1950–1959.
 Park et al. [2002] Park S, Fowler JW, Mackulak GT, Keats JB, Carlyle WM. Doptimal sequential experiments for generating a simulationbased cycle timethroughput curve. Operations Research 2002;50(6):981–990.
 Gano et al. [2006] Gano SE, Renaud JE, Martin JD, Simpson TW. Update strategies for kriging models used in variable fidelity optimization. Structural and Multidisciplinary Optimization 2006;32(4):287–298.
 MacDonald et al. [2015] MacDonald B, Ranjan P, Chipman H, et al. GPfit: An R package for fitting a gaussian process model to deterministic simulator outputs. Journal of Statistical Software 2015;64(i12).
 Barber et al. [2012] Barber C, Habel K, Grasman R, Gramacy RB, Stahel A, Sterratt DC. geometry: Mesh generation and surface tesselation. R Package Version 031, URL http://cran rproject org/web/packages/geometry/index html 2012;.
 CISNET [????] CISNET, Cancer Intervention and Surveillance Modeling Network. Prostate Cancer Modeling;. Accessed: July 2015. http://cisnet.cancer.gov/prostate/.
 Gulati et al. [2014] Gulati R, Tsodikov A, Etzioni R, HunterMerrill RA, Gore JL, Mariotto AB, et al. Expected population impacts of discontinued prostatespecific antigen screening. Cancer 2014;120(22):3519–3526.
 FHCRC [2009 (accessed April 26, 2018] FHCRC, Fred Hutchinson Cancer Research Center (PSAPC) model profile; 2009 (accessed April 26, 2018). http://cisnet.cancer.gov/prostate/profiles.html.
 Oesterling et al. [1993] Oesterling JE, Jacobsen SJ, Chute CG, Guess HA, Girman CJ, Panser LA, et al. Serum prostatespecific antigen in a communitybased population of healthy men: establishment of agespecific reference ranges. Jama 1993;270(7):860–864.
 Helton et al. [2006] Helton JC, Johnson JD, Oberkampf W, Sallaberry CJ. Sensitivity analysis in conjunction with evidence theory representations of epistemic uncertainty. Reliability Engineering & System Safety 2006;91(10):1414–1434.
 Rojnik and Naveršnik [2008] Rojnik K, Naveršnik K. Gaussian Process Metamodeling in Bayesian Value of Information Analysis: A Case of the Complex Health Economic Model for Breast Cancer Screening. Value in Health 2008;11(2):240 – 250. http://www.sciencedirect.com/science/article/pii/S1098301510605177.
 Rasmussen and Williams [2006] Rasmussen CE, Williams CK. Gaussian processes for machine learning, vol. 2. the MIT Press; 2006.
 Sanders et al. [2016] Sanders GD, Neumann PJ, Basu A, Brock DW, Feeny D, Krahn M, et al. Recommendations for conduct, methodological practices, and reporting of costeffectiveness analyses: second panel on costeffectiveness in health and medicine. JAMA 2016;316(10):1093–1103.
 Jin et al. [2001] Jin R, Chen W, Simpson TW. Comparative studies of metamodelling techniques under multiple modelling criteria. Structural and Multidisciplinary Optimization 2001;23(1):1–13. http://dx.doi.org/10.1007/s0015800101604.
 Schlessinger and Eddy [2002] Schlessinger L, Eddy DM. Archimedes: a new model for simulating health care systems—the mathematical formulation. Journal of biomedical informatics 2002;35(1):37–50.
 Eddy and Schlessinger [2003] Eddy DM, Schlessinger L. Archimedes: a trialvalidated model of diabetes. Diabetes care 2003;26(11):3093–3101.
 Tang et al. [2017] Tang YH, Lu L, Li H, Evangelinos C, Grinberg L, Sachdeva V, et al. OpenRBC: a fast simulator of red blood cells at protein resolution. Biophysical journal 2017;112(10):2030–2037.
 Shen et al. [2006] Shen Y, Seeger M, Ng AY. Fast gaussian process regression using kdtrees. In: Advances in neural information processing systems; 2006. p. 1225–1232.
 Genton et al. [2015] Genton MG, Kleiber W, et al. Crosscovariance functions for multivariate geostatistics. Statistical Science 2015;30(2):147–163.
 Kleijnen and Mehdad [2014] Kleijnen JP, Mehdad E. Multivariate versus univariate Kriging metamodels for multiresponse simulation models. European Journal of Operational Research 2014;236(2):573–582.
 Gu et al. [2016] Gu M, Berger JO, et al. Parallel partial Gaussian process emulation for computer models with massive output. The Annals of Applied Statistics 2016;10(3):1317–1347.
 Kyzyurova [2017] Kyzyurova K. On Uncertainty Quantification for Systems of Computer Models. PhD thesis, Duke University; 2017.
 Kyzyurova et al. [2018] Kyzyurova KN, Berger JO, Wolpert RL. Coupling Computer Models through Linking Their Statistical Emulators. SIAM/ASA Journal on Uncertainty Quantification 2018;6(3):1151–1171.
 Kleijnen and Sargent [2000] Kleijnen J, Sargent R. A methodology for fitting and validating metamodels in simulation. European Journal of Operational Research 2000;120(1):14 – 29. http://www.sciencedirect.com/science/article/pii/S0377221798003920.
 Delaunay [1934] Delaunay B. Sur la sphere vide. Bulletin of Academy of Sciences of the USSR 1934;7:793–800.
Appendices
C Review of Gaussian Processes
A Gaussian Process is defined as
a collection of random variables, any finite number of which have a joint normal distribution
[38]. A Gaussian Process is specified by its mean functionand covariance function
We write the notation .
In practice, a simulator is evaluated times to obtain a set of design points, . Typically, the inputs are transformed so that their domain is the unit cube. A Gaussian Process emulator model approximates the simulator with
where is a zeromean Gaussian Process and .
Commonly, the mean function is a constant model , but it may also be a function of the input vectors, e.g., a linear component , where is a vector of coefficients and is a vector of known functions of the input vectors. Under this specification, the residuals from the linear regression are modeled by a zeromean Gaussian Process. Although specifying a nonconstant is not necessary, doing so may increase the smoothness of the fitted Gaussian Process and hence may require fewer design points for a “good” model fit [8].
The “behavior” of the zeromean Gaussian Process is determined by the covariance function . Many choices for the covariance function exist; see [38]
for a discussion. A common class of functions assumes a stationary covariance process, i.e., that the relationship between any two points depends only on their distance and not their location. Within this class, a typical choice for the covariance function is the squared exponential or radial basis function
where is the variance of the process, a diagonal matrix of nonnegative roughness parameters, and the implied correlation function. Note that , and that the correlation between is positive and decreases with the distance between and [50, 38]. The roughness parameters determine how quickly the correlation decreases in each dimension. A large value suggests a low correlation for a given dimension, even when two points are close.
Let be a vector of observations modeled with a Gaussian Process. From the definition of the Gaussian Process, these observations are modeled as following an dimensional multivariate normal distribution. The parameters , , and of the mean function and the covariance function can be obtained by optimizing the likelihood of the observations [7, 38, 30]. For example, for a constant mean function , the mean and the variance of the Gaussian Process can be expressed in closed form as a function of the roughness parameters
where and is a symmetric positive definite correlation matrix in which the correlation between the th and th observations is . The negative profile log likelihood
is a function of the roughness parameters because , and are functions of . In the equation above, is the determinant of . Optimizing the likelihood yields estimates for , and thus estimates for and the correlation matrix .
The best linear unbiased predictor of the output value at a new input vector is [7, 38]
where is a vector of correlations between and the input vectors from the design points, such that the th element of is . The prediction error is [7, 38]
The maximum likelihood estimate is used to obtain and in the two prediction formulas above.
D Algorithm for candidate points
The identification of the set of candidate input vectors noted in ALGORITHM is outlined in ALGORITHM 2 (Box D). A second example is provided in Figure D.1 to illustrate this algorithm with a more complex dimensional example, where the steps can be illustrated more clearly than in the dimensional case. To start, the algorithm requires a set of design points (Figure D.1 (a)), with denoting the corresponding set of input vectors, and an emulator fit with .
First, a triangulation of is obtained (Step 1, Figure D.1 (b)). The triangulation partitions in dimensional simplexes using all as vertices. The partitioning is such that for is either the empty set or a lowerdimensional simplex (a shared extreme point, line, or facet).
Any triangulation would do. For relatively small numbers of points (in the few hundreds), and few dimensions (say, less than 10), it is practical to use a Delaunay triangulation, for which many routines exist [51]. Within each returned simplex, we identify the vector that maximizes the standard error of the predictions with emulator , for all (Step in ALGORITHM 1, Figure D.1 (c)). The set of unique are selected as the candidate input vectors to be exploited in ALGORITHM .
For a given set of design vectors, let be the set of the corresponding input vectors, and an emulator fit with .
Obtain a triangulation . For each simplex in :
Find the input vector that maximizes the standard error of the prediction for all .
Return the candidate input vectors from the input vectors identified in Step 1.
Comments
There are no comments yet.