1 Introduction
Simulation studies are computer experiments that involve creating data by pseudorandom sampling from known probability distributions. They are an invaluable tool for statistical research, particularly for the evaluation of new methods and for the comparison of competing methods. Simulation studies are much used in the pages of
Statistics in Medicine, but our experience is that some statisticians lack the necessary understanding to execute a simulation study with confidence, while others are overconfident and so fail to think carefully about design and report results poorly. Proper understanding of simulation studies would enable the former to both run and critically appraise published simulation studies themselves and the latter to conduct simulation studies with greater care and report with transparency. Simulation studies are empirical experiments and statisticians should therefore use knowledge of experimental design and analysis in running them. As we shall see, inadequacies with design, analysis and reporting lead to uncritical use and interpretation of simulation studies. In this context, better understanding of the rationale, design, execution, analysis and reporting of simulation studies is necessary to improve understanding and interpretation of the findings.Simulation studies are used to obtain empirical results about the performance of statistical methods in certain scenarios, as opposed to more general analytic (algebraic) results, which may cover many scenarios. It is not always possible, or may be difficult, to obtain analytic results. Simulation studies come into their own when methods make wrong assumptions or data are messy because they can assess the resilience of methods in such situations. This is not always possible with analytic results, where results may apply only when data arise from a specific model.
‘Monte Carlo simulation’ means statistical techniques that use repeated random sampling, and has many uses that are not simulation studies
. For example it is required for example multiple imputation and Markov Chain Monte Carlo methods. The remainder of this paper does not consider these uses of Monte Carlo simulation, unless the properties of some such method are being evaluated by a simulation study.
There are many ways to use simulation studies in medical statistics. Some examples are:

To check algebra (and code), or to provide reassurance that no large error has been made, where a new statistical method has been derived mathematically.

To assess the relevance of largesample theory approximations (e.g. considering the sampling distribution of an estimator) in finite samples.

The absolute evaluation of a new or existing statistical method. Often a new method is checked using simulation to ensure it works in the scenarios for which it was designed.

The comparative evaluation of two or more statistical methods.

Calculation of sample size / power, when designing a study under certain assumptions about the datagenerating process^{1}.
This article is focused primarily on using simulation studies for the evaluation of methods. Simulation studies for this purpose are typically motivated by frequentist theory and used to evaluate the frequentist properties of methods, even if the methods are Bayesian^{2, 3}.
Simulation studies are empirical experiments, and statisticians – particularly those doing working in applications such as clinical trials – should be familiar with good practice regarding design, analysis, presentation and reporting. It seems that as a profession we fail to follow this practice in our methodological work, as previously lamented by Hoaglin & Andrews^{4}, Hauck & Anderson^{5}, Ripley^{6}, Burton et al.^{7}, and Koehler, Brown & Haneuse^{8}. For example, few reports of simulation studies acknowledge that Monte Carlo procedures will give different results when based on a different set of random numbers and hence are subject to uncertainty; failing to report measures of uncertainty would be unacceptable in medical research. There are some wonderful books on simulation methods in general ^{6, 9, 10} and several excellent articles encouraging rigour in specific aspects of simulation studies (for example ^{1, 4, 5, 8, 11, 12, 13, 14, 15, 16}) but, until now, no unified practical guidance on simulation studies.
This tutorial outlines the rationale for using simulation studies, and offers practical guidance for design, execution, analysis, reporting and presentation. More specifically we: introduce a structured approach for planning and reporting simulation studies; provide coherent terminology for simulation studies; offer guidance on coding simulation studies; critically discuss key performance measures and their estimation; make suggestions for structuring tabular and graphical presentation of results; and introduce several new graphical presentations. This guidance should enable practitioners to execute a simulation study for the first time and contains much for more experience practitioners. For reference, the main steps involved, key decisions and recommendations are summarised in table 1.
The structure of this tutorial is as follows. We describe a review of a sample of the simulation studies reported in Statistics in Medicine Volume 34 (section 2). In section 3 we outline a systematic approach to planning simulation studies, using the new ‘ademp’ structure (which we define there). Section 4 gives guidance on computational considerations for coding simulation studies. In section 5, we discuss the purposes of various performance measures and their estimation, stressing the importance of estimating and reporting
the Monte Carlo standard error (SE) as a measure of uncertainty due to using a finite number of simulation repetitions
. Section 6 outlines how to report simulation studies, again using the ademp structure, and offers guidance on tabular and graphical presentation of results. Section 7 works through a simple simulation to illustrate in practice the approaches that we are advocating. Section 8 offers some concluding remarks, with a short subsection (8.1) that considers some future directions. Examples are drawn from the review and from the authors’ areas of interest (which relate mainly to modeling survival data, missing data, metaanalysis and randomised trial design).Section  
Planning  3  
Aims  3.1  
Identify specific aims of simulation study.  
Datagenerating mechanisms  3.2  
In relation to the aims, decide whether to use resampling or simulation from some parametric model. 

For simulation from a parametric model, decide how simple or complex the model should be and whether it should be based on real data.  
Determine what factors to vary and what levels of factors to use,  
Decide whether factors should be varied fully factorially, partly factorially or oneatatime.  
Estimand / target of analysis  3.3  
Define estimands and/or other targets of the simulation study.  
Methods  3.4  
Identify methods to be evaluated and consider whether they are appropriate for estimand/target identified. For method comparison studies, make a careful review of the literature to ensure inclusion of relevant methods.  
Performance measures  3.5, 5.2  
List all performance measures to be estimated, justifying their relevance to estimands or other targets.  
For lessused performance measures, give explicit formulae for the avoidance of ambiguity.  5.2  
Choose a value of that achieves acceptable Monte Carlo SE for key performance measures.  5.2, 5.3  
Coding and execution  4  
Separate scripts used to analyse simulated datasets from scripts to analyse estimates datasets.  
Start small and build up code, including plenty of checks.  
Set the random number seed once per simulation repetition.  
Store the random number states at the start of each repetition.  
If running chunks of the simulation in parallel, use separate streams of random numbers ^{17}.  
Analysis  5  
Conduct exploratory analysis of results, particularly graphical exploration.  
Compute estimates of performance and Monte Carlo SEs for these estimates.  5.2  
Reporting  6  
Describe simulation study using ademp structure with sufficient rationale for choices.  
Structure graphical and tabular presentations to place performance of competing methods sidebyside.  
Include Monte Carlo SE as an estimate of simulation uncertainty.  5.2  
Publish code to execute the simulation study including userwritten routines.  8 
2 Simulation in practice: a review of Statistics in Medicine, Volume 34
We undertook a review of practice in articles published in Volume 34 of Statistics in Medicine (2015). This review recorded information relevant to the ideas in this article. In this section we briefly outline the review but do not give results. Instead, results appear in the tutorial at relevant points.
We restricted attention to research articles, excluding tutorials in biostatistics, commentaries, book reviews, corrections, letters to the editor and authors’ responses. In the volume, there were a total of 264 research articles of which 199 (75%) included at least one simulation study.
In planning the review, we needed to select a sample size. Most of the questions of interest involved binary answers. For such questions, to estimate proportions with maximum standard error of 0.05 (occurring when the proportion is 0.5), we randomly selected 100 articles that involved a simulation study, before randomly assigning articles to a reviewer. TPM reviewed 35 simulation studies, IRW reviewed 34 and MJC reviewed 31. In case the reviewer was an author or coauthor of the article, the simulation study was swapped with another reviewer. TPM also reviewed five of the simulation studies allocated to each of the other reviewers to check agreement on key information.
3 Planning simulation studies using ademp
For clarity about the concepts that will follow, we introduce some notation in table 2.
An estimand (conceptually); also true value of the estimand  
Sample size of a simulated dataset  
Number of repetitions used; the simulation sample size  
Indexes the repetitions of the simulation  
the estimate of from the th repetition  
the mean estimate of across repetitions  
the estimate of from the th repetition  
the true variance of , which can be estimated with large 

the nominal significance level  
the pvalue returned by the th repetition 
In the following sections, we outline the ademp structured approach to planning simulation studies. This acronym comes from: Aims, Datagenerating mechanisms, Methods, Estimands, Performance measures.
3.1 Aims
In considering the aims of a simulation study it is instructive to first consider desirable properties of an estimator of from a frequentist perspective.

should be consistent: as , . It is also desirable that be unbiased for in finite samples:
. Some estimators may be consistent but exhibit smallsample bias (logistic regression for example). Kahan reports a procedure that appears to be unbiased but inconsistent
^{18}. 
The sample estimate should be a consistent estimate of the longrun variance of (see for example ^{19}).

Confidence intervals should have the property that at least of intervals contain (see section 5.2).

It is desirable that be as small as possible: that be an efficient estimator of .
There are other properties we may hope for, but these generally involve combinations of the above. For example, short average confidence interval length may be desirable; this relates to (4) and its validity depends on (1), (2) and (3). Mean squared error is a combination of (1) and (4). Further, properties may be traded off; small bias may be accepted if there is a substantial reduction in the standard error.
The aims of a simulation study will typically be set out in relation to the above properties, depending on what specifically we wish to learn. A simulation study might primarily investigate: large or smallsample bias (e.g. ^{20}; precision, particularly relative to other available methods (e.g. ^{21}); Variance estimation (e.g. ^{22}); or robustness to misspecification (e.g. ^{23}).
There is a distinction between simulation studies that offer a proofofconcept, i.e. showing that a method is viable (or fallible) in some settings, and those that aim to stretch or break methods, i.e. identifying settings where the method may fail. Both are useful and important in statistical research. For example, one may be faced with two competing methods of analysis, both of which are equally easy to implement. Even if the choice is unlikely to materially affect the results, it may be useful to have unrealistically extreme datagenerating mechanisms to understand when and how methods fail; e.g. ^{23}.
Alternatively, it may be of interest to compare methods where some or all methods have been shown to work in principle but the methods under scrutiny were designed to address slightly different problems. They may be put headtohead in realistic scenarios. This could be to investigate properties when one method is correct – How badly do others fail? – or when all are incorrect in some way – Which is most robust? No method will be perfect and it is useful to understand how methods are likely to perform in the sort of scenarios that might be expected in practice. However, such an approach poses tough questions in terms of generating data: Does the datagenerating mechanism favour certain methods over others? How can this be justified? One common justification is by reference to motivating data. However, in the absence of a broad spectrum of such motivating data, there is a risk of failing to convince readers that a method is fit for general use.
3.2 Datagenerating mechanisms
We use the term ‘datagenerating mechanism’ to denote how random numbers are used to generate a dataset. This is in preference to the term ‘datagenerating model’, which implies parametric models and so is a specific class of datagenerating mechanism. It is not the purpose of this article to explain how specific types of data should be generated. See Ripley ^{6} or Morgan ^{9} for methods to simulate data from specific distributions.
Data may be generated by producing parametric draws from a known model (once or many times), or by repeated resampling with replacement from a specific dataset (where the true datagenerating model is unknown). For resampling studies, the true datagenerating mechanism is unknown and resamples are used to study the sampling distribution. While parametric simulation can explore many different datagenerating mechanisms (which may be completely unrealistic), resampling typically explores only one mechanism (which will be relevant for at least the study at hand).
The choice of datagenerating mechanism(s) will depend on the aims. As noted above, we might investigate a method under a simple datagenerating mechanism, under a realistic mechanism, or under a completely unrealistic mechanism designed to stretch a method to breaking point.
Simulation studies provide us with empirical results for specific scenarios. For this reason, simulation studies will often involve more than one datagenerating mechanism to ensure coverage of different scenarios. For example, it is very common to vary the sample size of simulated datasets.
Much can be controlled in a simulation study and statistical principles for designing experiments can and should thus be called on. In particular, there is often more than one factor that will vary across specific datagenerating mechanisms. Varying these factorially is likely to be more informative than onebyone away from a ‘basecase’ datagenerating mechanism, as doing so permits the exploration of interactions between factors. There are however practical implications that might make this infeasible. The first regards presentation of results (covered in section 6) and the second computational time. If the issue is simply around presentation, it may be preferable to define a ‘base case’ but perform a factorial simulation study anyway and, if results are consistent with no interaction, presentation can vary factors away from the base case onebyone.
If the main issue with executing a fully factorial design is computational time, it may be necessary for the simulation study to follow a nonfactorial structure. Three approaches are noted below.
A first pragmatic check may be to consider interactions only where main effects exist. If performance seems acceptable and does not vary according to factor A, it would seem unlikely to have chosen a datagenerating mechanism that happened to exhibit this property when performance would have been poor for other choices of datagenerating mechanism.
A more careful approach could be taken based on making and checking predictions beyond the datagenerating mechanisms initially used; an idea similar to external validation. Suppose we have two factors, A and B, where and in the datagenerating mechanism. The basecase is . If the nonfactorial portion of the design varies A from 1 to 8 holding , and varies B from 1 to 5 holding , this portion of the simulation study could be used to predict performance when . Predictions may be purely qualitative (‘bias increases as A increases and as B increases so when we increase both together we would expect even larger bias’), or quantitative (based on the marginal effects after fitting a model to existing results, thereby producing explicit predictions at unexplored values of A and B). The simulation study can then be rerun for that single datagenerating mechanism, say and predictions compared with the empirical results (with a responsibility to expore further when predictions are poor or incorrect).
Finally, a more satisfactory solution is of course to use a fractional factorial design for the datagenerating mechanisms^{24, 3}.
In our review, 97 simulation studies used some form of parametric model to generate data while three used resampling methods. Of the 97 that simulated from a parametric model, 27 based parameter values on data, one based parameter values partly on data, and the remaining 69 on no data. Of these 97, 91 (94%) provided the parameters used. The most careful example ^{25} explored analysis of metaanalysis data and drew the design factors from empirical data on 14,886 performed metaanalyses from 1,991 Cochrane Reviews. The total number of datagenerating mechanisms per simulation study ranged from 1 to ; figure 7 (in the appendix) summarises aspects of the datagenerating mechanisms. Where more than one factor was varied, fully factorial designs were the most frequent, while some used partially factorial designs. None used any of the alternative approaches we have described.
3.3 Estimands and other targets
The majority of simulation studies evaluate or compare methods for estimating one or more population quantities, which we term estimands and denote by . Usually an estimand is a parameter of the data generating model, but occasionally it is some other quantity. For example, when fitting regression models with parameter , the estimand may be a specific , a measure of prognostic ability, the fitted outcome mean, or something else. To choose a relevant estimand it is important to understand the aims of analysis in practice.
The choice of estimand is important and is sometimes a simple matter of stating a parameter of interest. At other times, it is more subtle. For example, a logistic regression model unadjusted for covariates implies a marginal estimand; a model adjusted for covariates implied a conditional estimand with a different true value (this example is further unpacked in section 3.4).
Not all simulation studies evaluate or compare methods that concern an estimand. Other simulation studies evaluate methods for testing a null hypothesis, for selecting a model, or for prediction. We refer to these as
targets of the simulation study. The same statistical method could be evaluated against multiple targets. For example, the best method to select a regression model to estimate the coefficient of an exposure (targeting an estimand) may differ from the best model for prediction of outcomes (targeting prediction). Where a simulation study evaluates methods for design, rather than analysis, of a biomedical study, the design is the target.Table 3 summarises different possible targets of a simulation study and suggests some performance measures (described more fully in section 3.5) that may be relevant for each target, with examples taken from Volume 34.
Statistical task  Target  Examples of performance  Example 

measures  
Analysis  
Estimation  Estimand  Bias, empirical SE, meansquared error, coverage  Kuss compares a number of existing methods in terms of bias, power and coverage ^{25} 
Testing  Null hypothesis  Type I error rate, power  Chaurasia and Harel compare new methods inn terms of type I and II error rates ^{26} 
Model selection  Model  Correct model rate, sensitivity or specificity for covariate selection  Wu et al. compare four new methods in terms of ‘true positive’ and ‘false positive’ rates of covariate selection ^{27} 
Prediction  Prediction/s  Measures of predictive accuracy, calibration, discrimination  Ferrante compares four methods in terms of mean absolute prediction error, etc. ^{28} 
Design  
Design a study  Selected design  Sample size, expected sample size, power / precision  Zhang compares designs across multiple datagenerating mechanisms in terms of number of significant test results (described as ‘gain’) and frequency of achieving the (near) optimal design ^{29} 
In our review, 64 simulation studies targeted an estimand, 21 targeted a null hypothesis, eight targeted a selected model, three targeted predictive performance, and four had some other target. Of the 64 targeting an estimand, 51 stated what the estimand was (either in the description of the simulation study or elsewhere in the article). A figure detailing the number of estimands in simulation studies that targeted an estimand is given in the appendix, figure 8.
3.4 Methods
The term ‘method’ is generic. It most often refers to a model for analysis, but might refer to a design or some procedure (such as a decision rule). For example, ^{18} and ^{30} evaluated procedures that involved choosing an analysis based on the result of a preliminary test in the same data.
In some simulation studies there will be only one method with no comparators. In this case, selecting the method to be evaluated is very simple. When we aim to compare several methods in order to identify the best, it is important to include serious contenders. There are two issues.
First, it is necessary to have knowledge of previous work in the area to understand which methods are and are not serious contenders. Some methods may be legitimately excluded if they have already been shown to be flawed, and it may be unnecessary to include such methods if the only consequences are repetition of previous research and bloating of results. An exception might be if a flawed method is used frequently in practice (for example last observation carried forward with incomplete longitudinal data, or the ‘’ design for doseescalation).
Second, code. New methods are sometimes published but not implemented in any software (for example ^{31} and ^{32}), implemented poorly, or implemented in unfamiliar software. For methods that have not been implemented, it is hard to argue that they should be included in simulation studies. Although R and Stata appear to dominate for userwritten methods, it is not uncommon to see methods implemented in other packages. See section 4.3 for a discussion of software implemented in other software. Note that one important role of simulation is to verify that code is correct.
Methods may also be excluded if they do not target the specified estimand/s. Understanding whether methods target an estimand or not can be subtle. Returning to the example of randomised trial with a binary outcome, one might compare two logistic regression analyses, one unadjusted and one adjusted for a covariate, where the estimand is the log odds ratio for randomised group. In a simulation study, one would be likely to find that the two methods give different mean estimates, and it would be tempting to conclude that at least one of the methods is biased. However, the two methods target different estimands – that is, the true unadjusted and adjusted log odds ratios differ.
^{33}One way to tackle the problem of different estimands is to ensure that both methods estimate the same estimand: in the example of the randomised trial using logistic regression, this would involve postprocessing the adjusted regression results to estimate the adjusted marginal odds ratio, which is the same estimand as the unadjusted analysis ^{34}. This of course implies that the adjustment vs. non adjustment is the method comparison we are interested in (it may not be), and that the conditional estimand is simply a nuisance part of standard adjustment. An alternative (but coarser) way to tackle the problem is to target the null hypothesis, if the two methods test the same null. In the logistic regression example described above, because the setting is a randomised trial, the null hypothesis that the odds ratio equals 1 is the same whether the odds ratio is conditional or marginal.
The number of methods evaluated in our review of Volume 34 ranged from 1 to 33 (see figure 8).
Nonconvergence and other related issues such as perfect prediction (‘separation’^{35}) can blight some simulation studies. In such situations, there is a conceptual issue with defining a method. A ‘pure’ method evaluation would simply assess performance of a model when it converges. However, in practice, an analyst whose model fails to converge would not give up but explore other models until one converges. Thus, evaluation of such a procedure may be of interest in simulation studies. Crowther, Look and Riley evaluated such a procedure^{36}: if a model failed to converge, they applied a model with more quadrature points. We will commment further on this issue in section 5.2.
3.5 Performance measures
The term ‘performance measure’ describes a numerical quantity used to assess the performance of a method. The equivalent term ‘operating characteristic’ is sometimes used, particularly in the context of study designs (see for example ^{37}). Statistical methods for estimation may output for example an estimate , its variance (or standard error
), degrees of freedom, confidence intervals, test statistics and more (such as an estimate of prognostic performance).
The performance measures required in a simulation study depend on the aims and what the study targets (see section 3.3). When the target is an estimand, the most obvious performance measure to consider is bias: the amount by which exceeds on average (this can be positive or negative). Precision and coverage of confidence intervals will also be of interest. Meanwhile, if the target is a null hypothesis, power and type I error rates will be of primary interest. A simulation study targeting an estimand may of course also assess power and type I error.
The performance measures seen in our review are summarised in table 4. The denominator changes according across performance measures because some are not applicable for some simulation studies. Further, sometimes simulation studies had secondary targets. For example, nine simulation studies primarily targeted a null hypothesis but secondarily targeted an estimand and could have assessed bias, and one of these did so. For eight articles, some performance measures were unclear. In some, a performance measure was given a name that its formula demonstrated to be misleading (an example is the term ‘mean error’, which is bias, when the formula is for mean absolute error), emphasising the importance of clear terminology in simulation studies.
Overall  By primary target  

Null hyp  Selected  Predictive  
Performance  Estimand  othesis  model  performance  Other  
measure  ()  ()  ()  ()  ()  
Convergence  12/85 (14%)  10/61 (16%)  1/15 (7%)  1/6 (17%)  0/2  0/1 
Bias  63/80 (79%)  59/64 (92%)  1/9 (11%)  0/2  2/3  1/2 
Empirical SE  31/78 (40%)  31/62 (50%)  0/9  0/2  0/3  0/2 
Mean squared error  26/78 (33%)  22/62 (35%)  2/9 (22%)  0/2  1/3  1/2 
Model SE  22/77 (29%)  21/62 (34%)  1/9 (11%)  0/2  0/3  0/1 
Type I error  31/95 (33%)  8/62 (13%)  18/21 (86%)  4/6  0/3  1/3 
Power  28/95 (29%)  8/63 (13%)  14/20 (17%)  4/6  0/3  2/3 
Coverage  42/79 (53%)  39/63 (62%)  1/9 (11%)  0/2  1/3  1/2 
Conf. int. length  11/80 (14%)  9/63 (14%)  0/10  0/2  1/3  1/2 
Note – denominator changes across performance measures because not all are applicable in all simulation studies
Description and estimation of common performance measures of interest are given in section 5. An important point to appreciate in design and analysis is that simulation studies are empirical experiments, meaning performance measures are themselves estimated, and estimates of performance are thus subject to error. This fundamental feature of simulation studies does not seem to be widely appreciated, as previously noted^{6}. The implications are twofold. First, we should present estimates of uncertainty (quantified as the Monte Carlo standard error; see section 5.2). Second, we need to consider the number of repetitions and how this can be chosen (see section 5.2).
4 Computational and programming issues in simulation studies
In this section we discuss consideration when coding a simulation study. It is useful to understand what sort of data are involved. There may be up to four classes of dataset, listed and described in table 5.
Dataset  Description and notes 

Simulated  A dataset of size produced with some element of randomnumber generation, to which one or more methods are applied to produce some quantity relating to the target of the study, such as an estimate of . 
Estimates  Dataset containing summaries of information from repetitions (e.g. , , indication of hypothesis rejection) for each combination of datagenerating mechanism, method and target (e.g. each estimand). 
States  Dataset of containing randomnumbergenerator states: the start state for each simulated dataset and the final state (see section 4.1). 
Performance measures  Dataset containing estimated performance and Monte Carlo standard errors for each datagenerating mechanism, method and target. 
†or corresponding summaries for nonestimand targets
4.1 Random numbers: setting seeds and storing states
All statistical packages capable of Monte Carlo simulation use a pseudorandomnumber generator. Each random number is a deterministic function of the current ‘state’ of the randomnumber generator. After a random number is produced, the state changes, ready to produce the next random number. Because the function is deterministic, the state can be set. Typically, the state is set using a ‘seed’. Seeds to not necessarily map 1:1 to states, and provide doors onto the path of possible states. After enough randomnumber draws (a very large number in software using modern pseudorandomnumber generators), the state will eventually repeat: the path is circular.
The ‘pseudo’ element to randomnumber generators is sometimes characterised as negative. This is perhaps an artefact of the fact that some early algorithms provided very poor imitations of random numbers. However, modernera algorithms such as the Mersenne Twister do not suffer from these problems and can, for simulation purposes, be regarded as truly random when used correctly. The toss of a coin or roll of a die may be regarded as equally deterministic, albeit the result of a complex set of unknown factors that act in an uncontrollable fashion. These are not denigrated with the term ‘pseudorandom’: in statistical teaching they are often given as the ultimate example of randomness. However, many stage magicians can control the flip of a coin! If a computer pseudorandom number generator is sufficiently unpredictable and passes the various tests for randomness, it is churlish to regard the ‘pseudo’ aspect as a weakness.
There are several positive implications of using a deterministic and reproducible process for generating random numbers. First, if the number of repetitions is regarded as insufficient, the simulation study can continue from its end state. Second, and more importantly, if a certain repetition results in some failure such as nonconvergence, the starting state for that repetition can be noted and the repetition rerun under that state, enabling better understanding of when the method does not work so that issues leading to nonconvergence can be tackled. Finally, the whole simulation study can be independently run by other researchers, giving the potential for exact (rather than approximate) reproduction of results and the scope for additional methods to be included.
Our practical advice for utilising the deterministic nature of randomnumber generators is simple but strong: 1. set the seed at the beginning, once and only once; 2. store the state of the randomnumber generator often (ideally once at the beginning of each repetition and once following repetition ). This is important; the following chunk of pseudocode demonstrates the concept:
SET randomseed to #
FOR repetition 1 to n_sim
STORE repetition and randomstate in StatesData[repetition]
GENERATE simulated dataset
…
END FOR STORE nsim+1 and randomstate in StatesData[nsim+1]
The reason for this advice is to avoid unintended dependence between simulated datasets. We will illustrate our caution: one undesirable method of knowing the states for
repetitions is to set an initial seed and generate a single vector of length
by recording the starting state, generating a single random number, recording the new state, and so on. For the simulation itself, the seed for the th repetition is then set to the th element. To clarify the problem, let and let the first simulation step be generation of vector from a Uniform(0,1) distribution. The first repetition simulates (which changes the random number state four times) and proceeds. The second repetition then simulates , which is made up of observations 2 to 4 from repetition and just one new value. Run in Stata 15 (see supplementary material), the resulting draws of for the four repetitions are:Note that elements with the same shading contain the same values across rows: The fourth element of is the first element of and appears in all repetitions. Only when is the draw of actually independent of the first repetition. Such dependency in simulated data can compromise both performance estimates and Monte Carlo SEs and must be avoided.
4.1.1 ‘Stream’ random numbers
It is common for parts of simulation studies – fractions of all the repetitions, for example – to be run in parallel on different cores of highperformance computers (which this article will not mention further). If the advice to set the seed once only is followed, the implication for parallelisation is that, while runs for different datagenerating mechanisms may be parallelised, it is inadvisable to parallelise repetitions within a specific datagenerating mechanism. ‘Stream’ random number generators are designed to handle this problem ^{17}. A conceptual (and caricatured) outline is given by the picture. The period of modern random number generators is extremely long, but any pseudorandom number generator will eventually repeat. The two circles represent every step in this sequence (drawing on the circularpath description at the beginning of section 4.1).
Suppose we wish to parallelise two sets of repetitions. Any simulation study will use random numbers (in order) from a section of the circle. Here, each set of repetitions is represented by a clockwise arrow, and uses of the total of random numbers available in the full circle (a caricature for illustrative purposes; in practice, a much smaller fraction would be used). The seed dictates the position on the circle at which an arrow begins (and thus ends). The random numbers used up by for the first repetitions are represented by the red arrow and for the second by the blue arrow. The left circle depicts two chunks run in parallel with two different, arbitrarilychosen starting seeds. By chance, they may overlap as seen. This would be a cause for concern. The right circle uses separate streams of random numbers. This breaks the circle into quadrants, and setting the same value of a seed within a stream means that the separate chunks will start at the equivalent point on the quadrants and here there is no chance that one stream will enter another. In the absence of streams, repetitions should not be parallelised for the same datagenerating mechanism.
There are several userwritten R packages allowing independent streams of random numbers. In Stata (version 15 or newer), it is achieved with
. set rngstream #
prior to setting the seed. In Sas, it is achieved within a data step with
. call stream(#);
Regardless of the package, the same seed must be used within different values of #.
When a simulation study uses multiple datagenerating mechanisms, these may be run in parallel. Because performance is typically estimated separately for different datagenerating mechanisms, using the same seeds is less of a problem (and may in fact be advantageous, as described in section 5.4).
Many programs execute methods involving some stochastic element. Examples include multiple imputation, the bootstrap, the gcomputation formula, multistate models and Bayesian methods that use Markov Chain Monte Carlo. Commands to implement these methods involve some randomnumber generation. It is important to check that such programs do not manipulate the seed. Some packages do have a default seed if not input by the user. If they do set the seed internally, many of the results will be highly correlated, if not identical, and results should not then be trusted. Checking for such behaviour is worthwhile. One simple technique is to display the current state of the randomnumber generator, twice issue the command, and display the state after each run. If the first and second states are the same then the program probably does not use random numbers. If the first and second states differ but the second and third do not, the seed is being reset by the program.
4.2 Start small and build up code
It is alltooeasy to obtain misleading results in a simulation study through very minor coding errors; see for example the comments section of ^{38}, where fixing a mistake in a single line of code completely changed the results. Such errors are often detected when results are unexpected: for example, when bias appears much greater than theory suggests. One design implication is that methods with known properties should be included where possible as a check that these properties are exhibited. One straightforward and intuitive approach for minimising errors is to start small and specific for one repetition, then build and generalise, including plenty of builtin checks.
In a simulation study with and several simulated variables, a good starting point is to generate one simulated dataset with large . If variables are being generated separately then the code for each should be added one by one and the generated data explored to 1) check that the code behaves as expected and 2) ensure the data have the desired characteristics. For example, in Stata, the rnormal(m,s) function simulates normal variates with mean m
s. The usual notation for a normal distribution uses a mean and
variance. We have seen this syntax trip up several good programmers. By checking the variance of a variable simulated by rnormal in a large simulated dataset, it will be obvious if it does not behave in the expected fashion. The simulation file should be built to include different datagenerating mechanisms, methods or estimands, again checking that behaviour is as expected. Using the above example again, if the basic datagenerating mechanism used , the issue with specifying standard deviations vs. variances would not be detected, but it would for datagenerating mechanisms with . When satisfied with the large dataset being generated, we apply each method.Once satisfied that one large run is behaving sensibly, it is worth setting the required for the simulation study and exploring the simulated datasets produced under a handful of different seeds. When satisfied that the program still behaves sensibly, it may be worth running a few (say tens of) repetitions. If for example convergence problems are anticipated, or bias is expected to be 0, this can be checked informally without the full set of simulations.
After thoroughly checking through and generalising code, the full set of repetitions may be run. However, recall the precaution in section 4.1 to store the states of the randomnumber generator and the reasons. If failure occurs in repetition of , we will want to understand why. In this case, a record of the th start state means we can reproduce the problematic dataset quickly.
While the ability to reproduce specific errors is useful, it is also practically helpful to be able to continue even when an error occurs. For this purpose, we direct readers to the capture command in Stata and the try command in R. The failed analysis must be recorded as a missing value in the Estimates dataset, together with reasons if possible.
4.3 Using different software packages for different methods
It is frequently the case that competing methods are implemented in different software packages, and it would be more burdensome to try and code them all in one package than to implement them in different packages. There are two possible solutions. The first is to simulate data separately in the different packages and then use the methods on those data. The second is to simulate data in one package and export simulated data so that different methods are based on the same simulated datasets.
Both approaches are valid in principle but we advocate the latter for two reasons. First, it is cumbersome to do a job twice, and because different software packages have different quirks, it will not be easy to ensure data really are being generated identically. Second, if data are generated independently for different methods, there will be different (random) Monte Carlo error affecting each repetition. By using the same simulated data for both comparisons, this Monte Carlo error will affect methods’ performance in the same way because methods are matched on the same generated data.
Sixty two simulation studies in our review mentioned software. Table 9 (in the appendix) describes the specific statistical software mentioned. Seven simulation studies mentioned using more than one statistical package.
5 Analysis of estimates data
This section describes estimation for various performance measures along with Monte Carlo SEs. We advocate two preliminaries: checking for missing estimates and plots of the estimates data.
5.1 Checking the estimates data and preliminaries
The number of missing values, e.g. of and (for example due to nonconvergence), is the first performance measure to assess. The data produced under repetitions for which missing values were returned should be explored to understand how a method failed (see section 4) and, ideally, the code made more robust to reduce the frequency of failures.
Missing values in the estimates dataset pose a missing data problem regarding the analysis of other performance measures. It seems implausible that values would be missing completely at random^{39}; estimates will usually be missing due to nonconvergence so will likely depend on some characteristic/s of a given simulated dataset. When the ‘method’ being evaluated involves an analyst’s procedure (as described in section 3.4), for example, the model changes if the firstchoice model does not converge, this can reduce or remove missing values from the estimates data (though it changes the nature of the method being evaluated; see section 3.4).
If more than two methods are evaluated, and one always returns an estimate , then missing values for another method may be related to the returned values for the first method. In the presence of a nontrivial proportion of missing estimates data, analysis of further performance measures should be tentative, particularly when comparing methods with different numbers of missing. ‘Nontrivial’ means any proportion that could meaningfully alter estimated performance. If we are interested in detecting tiny biases, even 1% may be nontrivial.
Before undertaking a formal analysis of the estimates dataset, it is sensible to undertake some exploratory analysis. Plots are often helpful here. For example, ^{18} assessed the performance of a twostage procedure for the analysis of factorial trials. The procedure was unbiased (both conditionally and unconditionally), yet a histogram of exhibited a bimodal distribution with modes equally spaced at either side of , with almost no values of close to . This may cause concern and would have been missed had the analysis proceeded straight to the estimation of performance.
For simulation studies targeting an estimand, the following plots are often informative:

A univariate plot of the distribution of and
for each datagenerating mechanism, estimand and method, to inspect the distribution and, in particular, to look for outliers.

A bivariate plot of vs. for each datagenerating mechanism, estimand and method, with the aim of identifying bivariate outliers.

Bivariate plots of for one method vs. another for each datagenerating mechanism and estimand. The purpose here is to look for correlations between methods and any systematic differences. Where more than two methods are compared, a graph of every method vs. every other or vs. the comparator can be useful.

A plot of confidence intervals fractionally ranked by the significance of the interval’s test against the null (as in figure 5). This is called a zip plot and is a means of understanding any issues with coverage.
These plots will be demonstrated and interpreted in the worked example (section 7).
5.2 Estimation of performance and Monte Carlo standard errors for some common performance measures
This section outlines some common performance measures, properties they are designed to assess, how they are estimated and how Monte Carlo standard errors are computed. We suppress the hat notation for performance measures, but emphasise that these are estimates.
For interpretation of results, performance measures should usually be considered jointly (one could prefer a method with zero variance by conveniently ignoring bias).
Monte Carlo standard errors quantify simulation uncertainty: they provide an estimate of the SE of (estimated) performance due to using finite . The Monte Carlo SE targets the sampling distribution of repeatedly running the same simulation study (with repetitions) under different randomnumber seeds.
In our review of simulation studies in Statistics in Medicine Volume 34, 93 did not mention Monte Carlo SEs for estimated performance. The formulas for computing Monte Carlo SEs given in table 6 with description and comments in the text. For empirical SE, relative % increase in precision, and relative error, the Monte Carlo SE formulas assume normally distributed ; for nonnormal , robust SEs exist – see ^{40}.
Performance  

measure  Definition  Estimate  Monte Carlo SE of estimate 
Bias  
EmpSE  
Relative % increase in precision (B vs. A) * 

MSE  
Average ModSE *  
Relative % error in ModSE * 

Coverage  
Biaseliminated coverage 

Rejection % (power or type I error) 
*Monte Carlo SEs are approximate for Relative % increase in precision, Average ModSE and Relative % error in ModSE.
.
Bias is frequently of central interest, and quantifies whether a method targets on average. Frequentist theory holds unbiasedness to be a key property.
The mean of , , is often reported instead. This is estimated in the same way but without subtracting the constant , and so has the same Monte Carlo SE. It is sometimes preferable to report the relative bias, rather than absolute. If different values of are used for different datagenerating mechanisms then relative bias permits a more straightforward comparison across values. However, relative bias can be used only for . Lack of bias is only one property of an estimator; while it is often of central interest, we may sometimes accept small biases because of other good properties.
The empirical SE is a measure of the precision or efficiency of the estimator of . It depends only on and does not require knowledge of . The empirical SE estimates the longrun standard deviation of over the repetitions. Several other designations are in common use; in our review, the terms used included ‘empirical standard deviation’, ‘Monte Carlo standard deviation’, ‘observed SE’, and ‘sampling SE’.
The empirical standard error can be hard to interpret for a single method (unless compared to a lower bound), and the relative precision is often of interest when comparing methods.
Note that, if either method is biased, relative precision should be interpreted with caution because an estimator that is biased towards the null can have small empirical SE as a result of the bias: has smaller empirical SE than .
A related measure, which also takes the true value of into account, is the mean squared error (MSE). The MSE is the sum of the squared bias and variance of . This appears a natural way to integrate both measures into one summary performance measure (low variance is penalised for bias), but we caution that, for method comparisons, the relative influence of bias and of variance on the MSE tends to vary with (except when all methods are unbiased), making generalisation of results difficult. This issue is illustrated in the first three panels of figure 1, which depict the bias, empirical standard error and root MSE (favoured here because it is on the same scale as Empirical SE) for three hypothetical methods. Method A is unbiased but imprecise (and so root MSE is simply the empirical SE); method B is biased, with bias constant over but more precise (as is often the case with biased methods, see for example ^{41}); and method C, which is biased in a different way (bias and with the same precision as method B. For , root MSE is lower for method B than A, but for , root MSE is lower for method A. The lesson is that comparisons of (root) MSE are more sensitive to choice of than comparisons of bias or empirical SE alone. MSE is nonetheless an important performance measure – particularly when the aims of a method relate to prediction rather than estimation – but the implication is that, when MSE is a performance measure, datagenerating mechanisms should include a range of values of .
We term the average of the estimated SEs the ‘average model SE’. The aim of the model SE is that . The model SE explicitly targets the empirical SE. If it systematically misses, this represents a bias in the estimation of model SE. The relative error in average model SE is then an informative performance measure (some prefer the ratio of average model SE to empirical SE).
Coverage of confidence intervals is a key property for the longrun frequentist behaviour of an estimator. It is defined as the probability that a confidence interval contains .
Note that Neyman’s original description of confidence intervals defined the property of randomisation validity as exactly of intervals containing (see ^{42, 43, 44}). Confidence validity is the property that the true percentage is at least . This latter definition is less well known than the former, with the result that over and undercoverage are sometimes regarded as similarly bad^{45}. Of course, randomisation validity would usually be preferred over confidence validity because it implies shorter intervals – but this is not always the case! There are examples of procedures that return both shorter intervals and higher coverage (see for example ^{43, 44}).
Undercoverage is to be expected if, for example, i) , ii) , iii) the distribution of is not normal and intervals have been constructed assuming normality, or iv) is too variable. Overcoverage occurs as a result of . This may occur either in the absence or presence of issues (i) and (iii).
Undercoverage due to bias will tend to worsen as increases (unless bias reduces at or faster than a rate of ). Intuitively, as increases, confidence intervals zeroin on the wrong value. The situation is illustrated in the fourth panel of figure 1 (with some loss of generality, the true coverage was calculated for normalbased confidence intervals with bias the only source of undercoverage. Method B, for which bias is independent of has deteriorating coverage as increases. Method C, for which , has constant undercoverage. As for MSE, bias dominates as increases. The implication is that, if coverage is being assessed in the presence of bias, the datagenerating mechanisms should include a range of values of .
As noted previously, bias leads to undercoverage, while under or overcoverage may be caused by or other reasons. We propose a decomposition of poor coverage into two parts with a new performance measure: ‘biaseliminated coverage’. In essence, the bias of a method is eliminated from confidence intervals by studying confidence interval coverage for rather than . Note that this is not a performance measure in its own right and we suggest its use for understanding coverage performance. It is obvious that, for the methods in figure 1, biaseliminated coverage will be equal for all methods.
Rejection rates – power and type I error – are often of principal interest in simulation studies that target a null hypothesis, and power is of particular interest when competing designs are being compared by simulation. Assume we have pvalues in the estimates data and are considering nominal significance level . The pvalues may be derived from a Wald statistic or output directly, for example by a likelihoodratio test. An appropriate test would reject a proportion of the repetitions when the null is true and as often as possible when it is false. The obvious warning is that if the test does not control type I error at level , power should be interpreted with caution.
It is sometimes, but not always, of interest to estimate conditional performance. This is particularly true for simulation studies that aim to evaluate alternative study designs, for example where design decisions are made based on the early data. Two simulation studies in our review sample explored twostage procedures in randomised trials, where the estimand is selected after the first stage: the estimand was the treatment effect in a selected subpopulation ^{46} or the effect of a selected treatment ^{47}. In both cases, estimators were designed to be conditionally unbiased. Kimani, Todd & Stallard reported bias conditional on each possible selection of estimand ^{46}, while Carreras, Gutjahr & Brannath considered the bias averaged across estimands ^{47}. The former method is stricter and arguably more appropriate since, having selected an estimand, the observer is not interested in the other case^{48}.
Now consider the following form of conditional performance: observations are simulated from , with , . For each repetition, 95% confidence intervals for are constructed using the distribution. The process is repeated times, and we study the coverage, 1) for all repetitions, and 2) according to tertiles of the model SE. The results, table 7, show that coverage is below 95% for the lowest third of standard errors, above 95% for the highest third, and slightly above for the middle third. Poor conditional performance in this sense should not cause concern. Further, it is unhelpful for an analyst face with a dataset: one would not in practice know in which tertile of possibly model SEs a particular model SE lies.
Approach  analysed  Coverage (Monte Carlo SE) 

All observations  30,000  95.0% (0.1%) 
Conditional: ModSE in highest third  10,000  98.0% (0.1%) 
Conditional: ModSE in middle third  10,000  95.5% (0.2%) 
Conditional: ModSE in lowest third  10,000  91.5% (0.3%) 
Estimating performance conditional on true (rather than sampleestimated) parameters that vary across datagenerating mechanisms is where methods should be expected to provide good performance, and we do not recommend averaging over these, as is done informally in ^{49}.
We have described the most commonly reported and generally applicable performance measures, particularly when a simulation study targets an estimand. There are others that are sometimes used (such as the proportion of times the correct dose is selected by dose–finding designs) and others that we have not yet thought of.
5.3 Sample size for simulation studies
In choosing , the central issue is Monte Carlo error: key performance measures need to be estimated to an acceptable degree of precision.
The values of reported in our review are shown in figure 7, panel d. Four simulation studies did not report . Common sample sizes are and , as previously reported by Burton et al.^{7}. Of the 87 studies reporting , four provided any justification of the choice. These were:

‘To evaluate the asymptotic biases’ ^{50}

‘errors can be reduced by the large number of simulation replicates’ ^{51}

‘number was determined mainly to keep computing time within a reasonable limit. A reviewer pointed out that, as an additional justification, by using 10,000 metaanalyses the standard error of an estimated percentage (e.g., for the empirical coverage) is guaranteed to be smaller than 0.5.’ ^{25}

Most positively, Marozzi gave an explicit derivation of Monte Carlo SE ^{52}
Clearly this is a suboptimal state of affairs. For some more concrete justifications, see the worked illustrative example in section 7, Keogh and Morris^{53}, or Morris et al.^{54}
There exist situations where only one repetition is necessary, particularly when investigating largesample bias; see for example ^{20}. Here, the aim was to demonstrate largesample bias of an estimator and the single estimate of was many model standard errors from its true value.
Where the key performance measure is coverage, can be defined as follows. The Monte Carlo SE of coverage is given in section 5.2. Plugging in the expected coverage (for example 95%) and rearranging, we get
(1) 
with a similar expression if is to be determined based on power. For example, if the SE required for a coverage of 95% is 0.5%,
Coverage is estimated from binary summaries of the repetitions, so the worstcase SE occurs when coverage is 50%. In this scenario, to keep the required Monte Carlo SE below 0.5% , (1) says that repetitions will achieve this Monte Carlo SE.
A convenient feature of simulation studies is that the Monte Carlo SE can be assessed and increased much more cheaply than with other empirical studies. The cost is computational time. To continue, rather than start again, it is important to have a record the end state of the randomnumber generator (which can be used as the seed if further repetitions are added) or to use a different stream.
5.4 Remarks on analysis
We have emphasised repeatedly that simulation studies are empirical experiments. In many biomedical experiments, ‘controls’ are used as a benchmark and the estimated effects of other conditions are estimated as a contrast vs. control. However, simulation studies often benefit from having a known ‘truth’, meaning that the contrast vs. a control is not often of interest (hence the term ‘comparator’ in section 3.4). That is, bias need not be estimated as the difference between for method A and for the comparator; rather the bias for a method stands alone, being computed against , which is the control. There are benchmarks for other performance measures as well, such as coverage (the nominal %) and precision (the Cramér–Rao lower bound^{55, 56}).
In some cases, the true value of is unknown: it may not appear in the datagenerating mechanism. If performance measures involving are not of interest, this poses no problem. Otherwise, one solution is to estimate by simulation. Williamson et al. simulated data from a logistic model, but was not the conditional odds ratio used to generate data; was the marginal odds ratio, risk ratio and risk difference^{57}. They thus estimated for each of these estimands from a large simulated dataset.
In our review of Volume 34, of 74 studies that included some , nine estimated it, 57 used a known and 8 were unclear. Estimating is in our view a sensible and pragmatic approach. However, such an approach must simulate a dataset so large that it is fair to assume that the variance of ‘’ is negligible, particularly compared to that of , and ensure that the states of the randomnumber generators used in the simulation study do not overlap with the states used for the purpose of estimating . In practice, the way to do this is either to use a separate stream for the random numbers, or to run the estimation simulation immediately before the main run.
The estimation of performance measures in section 5.2 is described for estimating performance once per datageneratingmechanism. This is most suited to simulation studies with few datagenerating mechanisms, but many simulation studies are considerably more complex. In such cases, it is natural to fit a model (termed ‘metamodel’ by Skrondal^{24}) for performance in terms of the datagenerating mechanisms.
An advantage of modeling performance across datagenerating mechanisms is that we are able to match repetitions. This reduces the Monte Carlo SE for the comparison of methods. For example, suppose that we have two datagenerating mechanisms with equal to 1 and 2. We could use the same starting seed so that results are correlated within .
6 Reporting
6.1 The ‘methods’ section
The rationale for the ordering of elements in ademp is that this is usually the appropriate order to report them in a methods section. If the simulation study has been planned and written out before it is executed, the methods section is largely written. This is a particularly helpful ordering for other researchers who might wish to replicate it. Details should be included to allow reproduction as far as possible, such as the value of and how this was decided on, dependence among simulated datasets. Another important element to report is a justification of the chosen targets for particular applied contexts.
6.2 Presentation of results
Some simulation studies can be very small, for example exploring one or two performance measures under a single datagenerating mechanism. These can be reported in text (as in He et al. ^{58}). In other cases, there are enough results that it becomes necessary to report them in tabular or graphical form. For any tabulation or plot of results, there are four potential dimensions: data generating mechanisms, methods, estimands and performance measures. This section provides some considerations for presenting these results.
In tabular displays, it is common to divide rows according to datagenerating mechanisms and methods as columns (as in Chen, et al.^{59}), though if there are more methods than datagenerating mechanisms it is better to swap these (as in Hsu, Taylor and Hu ^{60}). Performance measures and estimands may vary across columns or across rows depending on what makes the table easier to digest (see for example Alonso et al. ^{61}).
There are two key considerations in the design of tables. The first is how to place the important comparisons sidebyside. The most important comparisons will typically be of methods, so bias (for example) for different methods should be arranged in adjacent rows or columns.
The second consideration regards presentation of Monte Carlo SEs, and this tends to confound the first. By presenting them next to performance results, for example in parentheses, the table becomes cluttered and hard to digest, obscuring interesting comparisons. For this reason, some authors will report the maximum Monte Carle SE in the caption of tables (for example ^{62, 41}). Results should not be presented to a greater accuracy than is justified by the Monte Carlo SE (e.g. 3dp for coverage). In our review of Volume 34, seven articles presented Monte Carlo SEs for estimated performance: three in the text, two in a table, one in a graph, and one in a float caption.
The primary advantage of graphical displays of performance is that it is easier to quickly spot patterns, particularly over dimensions that are not compared sidebyside. A second advantage is that it becomes possible to present raw data estimates (for example the ) as well as performance results summarising them (see for example figure 3 of ^{63}). In our experience, these plots are popular and intuitive ways to summarise the and model SE’s. Another example of a plot of estimates data is a histogram given in Kahan^{18} (this was particularly important as , but almost no was close to ). Even if plots of estimates are not planned to be included in publications we urge their use in exploration of simulation results. The main disadvantages of graphical displays of results is that plots can be less spaceefficient than tables, it is not possible to read the exact numbers, and separate plots will frequently be required for different performance measures.
Compared with tables, it is easier to accommodate display of Monte Carlo SE’s directly in plots of performance results, and this should be done, for example as 95% confidence intervals. The considerations about design of plots to facilitate the most relevant comparisons apply as with tables. Methods often have names that are hard to arrange side by side in a legible manner. It is preferable to arrange methods in horizontal rows and performance results moving horizontally.
As noted previously, full factorial designs can pose problems for presentation of results. One option for presentation is to present data assuming no interaction unless one is obviously present. An alternative approach taken by Rücker and Schwarzer is to present all results of a full factorial simulation study with datagenerating mechanisms, and comparison of six methods^{64}. Their proposal is a ‘nestedloop plot’, which loops through nested factors – usually datagenerating mechanisms – for an estimand, and plots results for different methods on top of each other^{64}. This is a useful graphic but will not suit all designs (and makes depiction of Monte Carlo SE difficult).
There is no one correct way to present results, but we encourage careful thought to facilitate readability and clarity, considering the comparisons that need to be made by readers.
7 Worked illustrative example
To make clear the ideas described in this article and demonstrate how they may be put into practice, we conduct one example simulation study. We hope that the aims and methods are simple enough to be understood by all readers. Further, the files required to run the simulation in Stata are available at https://github.com/tpmorris/simtutorial.
7.1 Design of example
The example is a comparison of three different methods for estimating the hazard ratio in a randomised trial with a survival outcome.
Consider the proportional hazards model, where we have the hazard rate (event rate at time conditional on survival until at least time ) for the th patient
(2) 
with the baseline hazard function, a binary treatment indicator variable coded 0 for control and 1 for the research arm, and the log hazard ratio for the effect of treatment. There are various ways to estimate this hazard ratio, with common approaches being the Cox model, and standard parametric survival models, such as the exponential and Weibull. The parametric approaches make assumptions about the form of the baseline hazard function whereas the Cox model makes no such assumption. We now describe a simulation study to evaluate the three methods in this simple setting.
Aims: To evaluate the impacts of 1) misspecifying the baseline hazard function on the estimate of the treatment effect ; 2) of fitting too complex a model when an exponential is sufficient; and 3) of avoiding the issue by using a semiparametric model.
Datagenerating mechanisms: We consider two datagenerating mechanisms. For both, data are simulated on patients, representing a possible phase III trial with survival outcome. Let be an indicator denoting assignment to treatment, where assignment is generated using – simple randomisation with an equal allocation ratio. We simulate survival times from the model in equation 2, assuming that , corresponding to a hazard ratio of 0.607 (3dp). We let . The two datagenerating mechanisms differ only in the values of :
1:
both an exponential and a Weibull model
2:
a Weibull but not an exponential model
A plot of the hazard rate for the two datagenerating mechanisms is given in figure 2.
Data are simulated using Stata 15 using the 64bit Mersenne twister for random number generation. The input seed is ‘72789’.
Methods: Each simulated dataset is analysed in three ways, using:

An exponential proportionalhazards model

A Weibull proportionalhazards model

A Cox proportionalhazards model
Note that the exponential model is correctly specified for the first datagenerating mechanism but misspecified for the second; the Weibull model is correctly specified for both mechanisms; and the Cox model does not make any assumption about the baseline hazard so is not misspecified for either mechanism.
Estimands: Our estimand is the loghazard ratio for vs. , which would represent the treatment effect in a randomised trial.
Performance measures: We will assess convergence, bias, coverage, empirical and modelbased standard errors for .
Bias is our key performance measure of interest, and we will assume that , meaning that . (A conservative estimate based on an initial small simulation run.) We decide that we require Monte Carlo SE of bias to be lower than 0.005. Given that
this implies that we need 1,600 repetitions. If coverage of all methods is 95%, the implication of using is
With 50% coverage, the Monte Carlo SE is maximised at 1.25. We consider this satisfactory and so proceed with (to be revised if, for example, ).
7.2 Exploration and visualisation of results
The first result to note is that there were no missing or , and ‘separation’ was not an issue.
We first explore the raw results. Figure 3 plots the estimates and for the two datagenerating mechanisms and three methods, with means displayed as yellow pipes. The left panels plot . It is clear that, when , the mean and variance of is very similar for the three methods. The mean is close to the true value of for all methods. When in truth , the empirical SE is slightly higher for all methods (because there are fewer events among the 500 observations under this datagenerating mechanism). The exponential proportionalhazards model is now misspecified and we observe a shift of the mean of towards the null, indicating some bias. The right panels of figure 3 plot the estimated standard errors . These are smaller for the upper panel () than the lower panel () but there is little to choose between the methods.
We next compare these estimates by plotting for each method vs. every other method, and the same for . The data pairs come from the same repetition (i.e. they are estimated in the same simulated dataset) and are compared to the line of equality. This is done in figure 4, for the second datagenerating mechanism only (), which is interesting because the exponential model is misspecified. We can see that the estimates of both and are highly correlated across all methods. The upper triangle of plots in figure 4 shows that, while is almost identical for the Weibull and Cox models, it tends to be systematically closer to 0 for the exponential model. The estimates of show that again, the estimates are extremely similar for the Weibull and Cox models, and are very slightly larger for the exponential model.
Figure 5 is a new visualisation, the ‘zip plot’, which helps to understand coverage by viewing the confidence intervals directly (implemented in Stata; for implementations in R and Sas, see ^{65} and ^{66} respectively). For each datagenerating mechanism and method, the confidence intervals are fractionalcentileranked according to their significance against the null that . This ranking is used for the vertical axis and is plotted against the intervals themselves. Intervals for which a twosided test yields are coloured purple (towards the top), while the remainder are in blue (at the bottom). When a method has 95% coverage, the colour of the intervals switches at 95 on the vertical axis. Finally, the yellow horizontal lines are Monte Carlo 95% confidence intervals for per cent coverage. (As a general comment, note that the interesting area in many zip plots will be near to the top and so it will often be more informative to ‘zoom in’ on the action, as suggested in ^{66}.)
In figure 5, the upper panel again displays the results when and the lower panel when . Despite coverage being approximately 95% as advertised, there are more intervals to the right of than to the left, particularly for those that do not cover . This indicates that the model SEs must overestimate the empirical SE, because coverage is adequate despite bias. Figure 5 helps to make such a feature clear.
7.3 Analysis of example
The previous section demonstrated some exploratory analyses that may be of value. Next, we estimate performance for the measures of interest and present them in a table for which (we hope) the ademp structure is clear: different performance measures are stacked vertically; for each performance measure, the results for the two datagenerating mechanisms each occupy one row; results for different methods are arranged across three columns (with Monte Carlo SEs in parantheses at a smaller point size than the estimate); there is only one estimand.
Performance  Datagenerating  Method  

measure  mechanism  Exponential  Weibull  Cox  
Bias  (0.005)  (0.005)  (0.005)  
0.049  (0.003)  0.005  (0.004)  0.006  (0.004)  
Coverage  95.4%  (0.5)  95.4%  (0.5)  95.4%  (0.5)  
96.0%  (0.5)  95.6%  (0.5)  95.8%  (0.5)  
Biaseliminated  95.6%  (0.5)  95.3%  (0.5)  95.4%  (0.5)  
coverage  97.2%  (0.4)  95.7%  (0.5)  96.1%  (0.5)  
Empirical SE  0.209  (0.004)  0.209  (0.004)  0.209  (0.004)  
0.138  (0.002)  0.152  (0.003)  0.151  (0.003)  
Relative precision  0.2%  (0.1)  0  (–)  0.3%  (0.1)  
gain vs. Weibull  20.5%  (0.4)  0  (–)  0.6%  (0.2)  
Model SE  0.208  (<0.001)  0.208  (<0.001)  0.208  (<0.001)  
0.154  (<0.001)  0.154  (<0.001)  0.154  (<0.001)  
Relative error in  %  (1.8)  %  (1.8)  %  (1.8)  
Model SE  11.5%  (2.0)  1.7%  (1.8)  2.1%  (1.8) 
Also given in figure 6 is an alternative graphical presentation of estimated performance called a lollipop plot. The admep structure is slightly different to the table but again clear: different performance measures are stacked vertically; for each performance measure the results for the three methods now occupy one row each; results for different methods are arranged across the two columns. Monte Carlo 95% confidence intervals are now represented via parentheses (a visual cue due to the usual presentation of intervals as two numbers within parentheses).
The results confirm more formally some of the features we saw in our exploration of the estimates data. The interesting features concern the exponential model when , since the Weibull and Cox models behave well in all cases. We see that the exponential model suffers some bias towards the null, which is approximately 10% of the true value. This is nonnegligible. Next, we see that coverage is still over the nominal 95%, which is surprising in the presence of bias. The empirical SE is the same for all models when and lowest for the exponential model when , while the Weibull and Cox models are very similar; recall however that in the presence of different biases, the empirical SE is not comparable across methods. For relative precision (vs. the Weibull model) a very similar pattern is seen as for empirical SE. The Model SE is the same for all methods and datagenerating mechanisms. This explains why the exponential model has acceptable coverage when : the bias is cancelled out by the fact that the model SE is overestimated. This is confirmed by the relative error in Model SE.
Looking at table 8, the Monte Carlo SEs of performance estimates are all acceptable and so we would be happy to draw conclusions about the methods based on the 1,600 repetitions.
7.4 Conclusions of example
When an exponential model is misspecified the hazard ratio can be biased. Probably not by much. Further research is needed (using different values of and of ).
More seriously, note that the datagenerating mechanisms we used to not cover a range of scenarios. For example, we might have explored varying , and over a range of values to explore when issues are present.
8 Concluding remarks
Simulation studies are an invaluable tool for research into statistical methods, evidenced by the large proportion of Volume 34 Statistics in Medicine articles whose conclusions relied in part on simulation studies. Because methods promoted may be used in medical research, transparent reporting of the design and execution of simulation studies is critical.
While simulation studies are widely used, they tend to be poorly reported by those who publish their results.
There are many areas to be improved in the reporting of simulation studies. Our view is that the two main shortcomings are (i) lack of clarity over the design, which ademp aims to deal with, and (ii) failure to report estimates of MonteCarlo uncertainty.
We have described – and advocate – a structured approach to the planning of simulation studies that involves identifying aims, datagenerating mechanisms, methods, estimands and performance measures. All of these and the rationale for decisions should be included in reporting. For an excellent example of a clearly described design, see Austin and Stuart^{67}. Reports of simulation studies are now beginning to explicitly use the ademp structure; see Thompson et al.^{68}, Sayers et al.^{69}, Morris et al.^{54} and Keogh & Morris^{53}.
We have given formulas for computing the Monte Carlo standard error for the most common performance measures, and made some suggestions about reporting. Note that the Stata package simsum^{12} and R package rsimsum^{65} automate this process for commonly used performance measures. See Boos and Osborne for more general assessment of Monte Carlo SEs for complex performance measures^{70}.
8.1 Future directions
Three areas that we regard as of increasing future importance are simulation protocols, release of code, and consortia of authors. We discuss these as a step towards resolving occurrences of simulation studies with contradictory results.
A charitable view of such contradictory results, which we tend to hold, is that methods are developed by researchers who concerned with handling the specific problems they have seen in practice. Given such a background, they are running the relevant simulation studies (that may not be relevant to others). A less charitable possibility might be selective reporting of only the most favourable (or unfavourable) configurations of datagenerating mechanisms, running the simulations many times under different seeds and selecting the most favourable^{3}, and more (left to the reader’s imagination).
A starting point to addressing this issue is to write detailed simulation protocols before writing code. This would ideally protect against authors choosing to report favourable results and force one to be clear about ‘ademp’ in advance. Of course, the weak point is that simulation studies do not to need approval before ‘data collection’, so protocols could be written afterthefact and this cannot be clear even with published protocols. The counterargument is that protocols must justify the rationale for choices: as well as describing what is planned, there is a burden to explain why.
Due to the prejudices introduced by experiences of data, it is constructive for authors who have produced contradictory simulation results to work together on ‘latephase’ simulation studies (using an analogy from clinical trials in drug development). This allows robust discussion of the design and exploration of disparities among previous work. One exemplar of such an approach is in methods for handling incomplete data where the analysis has a multilevel structure. Three groups of researchers had developed methods and worked together on understanding how the methods differ and on simulation studies to evaluate their performance, resulting in the paper by Audigier et al^{71}. This approach is in our view more satisfactory than a group of researchers executing a large, latephase simulation study without the input of authors of previous work, though this strategy is sometimes adopted^{72}.
No simulation study is definitive and new methods or refinements of methods are inevitable. For researchers wishing to replicate or extend the results of earlier simulation studies, the design of earlier work must have been written out fully and unambiguously. This can be a difficult task, and to ensure this, authors should release simulation code publicly (a policy that is now encouraged by some journals, most notably Biometrical Journal, and required by others). The happy corollary is that code may be checked more thoroughly by its authors if it is subject to external scrutiny. There is generally no excuse for withholding code. One caveat is for resampling studies, where permissions to release the original data may be lacking (note that code for running the simulation can still be made available even if it cannot be run on the same data).
8.2 Final remark
Simulation studies are a powerful tool. However, it is important to be aware that, because a simulation study took hard work and thought, we are liable to believe it tells us more than it truly does. To quote Patrick Royston, ‘Simulation studies reveal points of light on a landscape, but can not illuminate the entire landscape.’ We can hope and plan to illuminate important points and to build up a picture of the landscape, particularly where terrain may be particularly rocky or particularly fertile.
We hope that the guidance in this tutorial will improve researchers’ understanding, planning, execution and future reporting of simulation studies.
Conflicts of interest
All authors declare that they developed, and regularly deliver, a short course on simulation studies, from which this work grew, and from which they have benefited financially.
Acknowledgements
Tim Morris and Ian White are supported by the Medical Research Council (grant numbers MC_UU_12023/21 and MC_UU_12023/29). Michael Crowther is partly supported by an MRC New Investigator Research Grant (grant number MR/P015433/1).
For thoughtprovoking discussions and input to this work, we thank Alessandro Gasparini, Tra Pham, Brennan Kahan, Ruth Keogh, Clémence Leyrat, Kristian Brock, Christian Hennig and Patrick Royston. We also thank the many participants who have attended our courses and whose questions and feedback provided the motivation for this article.
The first submission of this article was released as a preprint, and we are grateful to the people whose comments have contributed to this version. In particular, those who made important substantive comments: Maarten van Smeden, Rolf Groenwold, Bas Penningde Vries, Kim Luijken, Martina McMenamin, Paula Dhiman, Leane McCabe, Alessandro Gasparini, David Mannheim, Daniel Oberski and Rick Wicklin.
References
 1 Feiveson A. H.. Power by simulation. The Stata Journal. 2002;2(2):107–124.
 2 Rubin D. B.. Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics. 1984;12(4):1151–1172.
 3 Grieve A. P.. Idle thoughts of a ‘wellcalibrated’ Bayesian in clinical drug development. Pharmaceutical Statistics. 2016;15(2):96–108.
 4 Hoaglin D. C., Andrews D. F.. The reporting of computationbased results in statistics. The American Statistician. 1975;29(3):122–126.
 5 Hauck W. W., Anderson S.. A survey regarding the reporting of simulation studies. The American Statistician. 1984;38(3):214–216.
 6 Ripley B. D.. Stochastic Simulation. New York: Wiley; 1987.
 7 Burton A., Altman D. G., Royston P., Holder R. L.. The design of simulation studies in medical statistics. Statistics in Medicine. 2006;25(24):4279–4292.
 8 Koehler E., Brown E., Haneuse . On the assessment of Monte Carlo error in simulationbased statistical analyses. The American Statistician. 2009;63:155–162.
 9 Morgan B. J. T.. Elements of simulation. Boca Raton: Chapman & Hall/CRC; 1995.
 10 Chang M.. Monte Carlo simulation for the pharmaceutical industry : concepts, algorithms, and case studies. CRC Press; 2011.
 11 DíazEmparanza I.. Is a small Monte Carlo analysis a good analysis?. Statistical Papers. 2002;43(4):567–577.
 12 White I. R.. simsum: Analyses of simulation studies including Monte Carlo error. Stata Journal. 2010;10(3):369–385.
 13 Smith M. K., Marshall A.. Importance of protocols for simulation studies in clinical drug development. Statistical Methods in Medical Research. 2011;20(6):613–622.
 14 Crowther M. J., Lambert P. C.. Simulating biologically plausible complex survival data. Statistics in Medicine. 2013;32(23):4118–4134.
 15 Holford N. H. G., Hale M., Ko H. C., et al. Simulation in drug development: good practices. 1999.
 16 O’Kelly M., Anisimov V., Campbell C., Hamilton S.. Proposed best practice for projects that involve modelling and simulation. Pharmaceutical Statistics. 2017;16(2):107–113.
 17 Haramoto H., Matsumoto M., Nishimura T., Panneton F., L’Ecuyer P.. Efficient jump ahead for F2linear random number generators. INFORMS Journal on Computing. 2008;20(3):385–390.
 18 Kahan B. C.. Bias in randomised factorial trials. Statistics in Medicine. 2013;32(26):4540–4549.
 19 Kenward M. G., Roger J. H.. Small sample inference for fixed effects from restricted maximum likelihood. Biometrics. 1997;53:983–997.
 20 White I. R.. Letter to the Editor: Survival analysis of randomized clinical trials adjusted for patients who switch treatments by M. G. Law and J. M. Kaldor, Statistics in Medicine, 15, 2069–2076 (1996). Statistics in Medicine. 1997;16(22):2619–2620.
 21 White I. R., Thompson S. G.. Adjusting for partially missing baseline measurements in randomised trials. Statistics in Medicine. 2005;24(7):993–1007.
 22 Hughes R. A., Sterne J. A. C., Tilling K.. Comparison of imputation variance estimators. Statistical Methods in Medical Research. 2014;:n/a+.
 23 Morris T. P., White I. R., Royston P.. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology. 2014;14(1):75+.
 24 Skrondal A.. Design and Analysis of Monte Carlo Experiments: Attacking the Conventional Wisdom. Multivariate Behavioral Research. 2000;35(2):137–167.
 25 Kuss O.. Statistical methods for metaanalyses including information from studies without any events—add nothing to nothing and succeed nevertheless. Statistics in Medicine. 2015;34(7):1097–1116.

26
Chaurasia A., Harel O.. Partial Ftests with multiply imputed data in the linear regression framework via coefficient of determination.
Statistics in Medicine. 2015;34(3):432–443.  27 Wu C., Shi X., Cui Y., Ma S.. A penalized robust semiparametric approach for gene–environment interactions. Statistics in Medicine. 2015;34(30):4016–4030.
 28 Ferrante L., Skrami E., Gesuita R., Cameriere R.. Bayesian calibration for forensic age estimation. Statistics in Medicine. 2015;34(10):1779–1790.
 29 Zhang Z., Wang C., Troendle J. F.. Optimizing the order of hypotheses in serial testing of multiple endpoints in clinical trials. Statistics in Medicine. 2015;34(9):1467–1482.
 30 Campbell H., Dean C. B.. The consequences of proportional hazards based model selection. Statistics in Medicine. 2014;:1042–1056.
 31 Robins J. M., Wang N.. Inference for imputation estimators. Biometrika. 2000;87(1):113–124.
 32 Reiter J. P.. Multiple imputation when records used for imputation are not used or disseminated for analysis. Biometrika. 2008;95(4):933–946.
 33 Hauck W. W., Anderson S., Marcus S. M.. Should we adjust for covariates in nonlinear regression analyses of randomized trials?. Controlled Clinical Trials. 1998;19:249–256.
 34 Zhang Z.. Estimating a marginal causal odds ratio subject to confounding. Communications in Statistics  Theory and Methods. 2008;38(3):309–321.

35
Smeden M., Groot J. A. H., Moons K. G. M., et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis.
BMC Medical Research Methodology. 2016;16(1).  36 Crowther M. J., Look M. P., Riley R. D.. Multilevel mixed effects parametric survival models using adaptive Gauss–Hermite quadrature with application to recurrent events and individual participant data metaanalysis. Statistics in Medicine. 2014;33(22):3844–3858.
 37 Royston P., Barthel F. MS, Parmar M. K. B., ChoodariOskooei B., Isham V.. Designs for clinical trials with timetoevent outcomes based on stopping guidelines for lack of benefit. Trials. 2011;12(1):81+.
 38 Bartlett J. W.. Combining bootstrapping with multiple imputation. 2016.
 39 Rubin D. B.. Inference and missing data. Biometrika. 1976;63:581–592.
 40 White Ian R., Carlin John B.. Bias and efficiency of multiple imputation compared with completecase analysis for missing covariate values.. Statistics in medicine. 2010;29(28):2920–2931.
 41 White I. R., Royston P.. Imputing missing covariate values for the Cox model.. Statistics in Medicine. 2009;28(15):1982–1998.
 42 Neyman J.. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society, Series A. 1934;97(4):558–625.
 43 Meng X. L.. Multipleimputation inferences with uncongenial sources of input. Statistical Science. 1994;9:538–558.
 44 Rubin D. B.. Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996;91(434):473–489.
 45 Morris T. P.. Rank minimization with a twostep analysis should not replace randomization in clinical trials. Journal of Clinical Epidemiology. 2012;65(7):810–811.
 46 Kimani P. K., Todd S., Stallard N.. Estimation after subpopulation selection in adaptive seamless trials. Statistics in Medicine. 2015;34(18):2581–2601.
 47 Carreras M., Gutjahr G., Brannath W.. Adaptive seamless designs with interim treatment selection: a case study in oncology. Statistics in Medicine. 2015;34(8):1317–1333.
 48 Efron B., Hastie T.. Computer age statistical inference. Cambridge University Press; 2016.
 49 Gutman R., Rubin D. B.. Robust estimation of causal effects of binary treatments in unconfounded studies with dichotomous outcomes.. Statistics in medicine. 2013;32(11):1795–1814.
 50 Taguri M., Chiba Y.. A principal stratification approach for evaluating natural direct and indirect effects in the presence of treatmentinduced intermediate confounding. Statistics in Medicine. 2015;34(1):131–144.
 51 Li P., Redden D. T.. Small sample performance of biascorrected sandwich estimators for clusterrandomized trials with binary outcomes. Statistics in Medicine. 2015;34(2):281–296.
 52 Marozzi M.. Multivariate multidistance tests for highdimensional low sample size casecontrol studies. Statistics in Medicine. 2015;34(9):1511–1526.
 53 Keogh R. H., Morris T. P.. Multiple imputation in Cox regression when there are timevarying effects of covariates. Statistics in Medicine. 2018;.
 54 Morris T. P., Fisher D. J., Kenward M. G., Carpenter J. R.. Metaanalysis of Gaussian individual patient data: two stage or not two stage?. Statistics in Medicine. inpress 2017;.
 55 Cramér H. C.. Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press; 1946.
 56 Rao C. R.. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society. 1945;37:81–91.
 57 Williamson E. J., Forbes A., White I. R.. Variance reduction in randomised trials by inverse probability weighting using the propensity score. Statistics in Medicine. 2014;33(5):721–737.
 58 He X., Whitmore G. A., Loo G. Y., Hochberg M. C., Lee ML T.. A model for time to fracture with a shock stream superimposed on progressive degradation: the Study of Osteoporotic Fractures. Statistics in Medicine. 2015;34(4):652–663.
 59 Chen Y., Hong C., Riley R. D.. An alternative pseudolikelihood method for multivariate randomeffects metaanalysis. Statistics in Medicine. 2015;34(3):361–380.
 60 Hsu C. H., Taylor J. M. G., Hu C.. Analysis of accelerated failure time data with dependent censoring using auxiliary variables via nonparametric multiple imputation. Statistics in Medicine. 2015;34(19):2768–2780.
 61 Alonso A., Milanzi E., Molenberghs G., Buyck C., Bijnens L.. A new modeling approach for quantifying expert opinion in the drug discovery process. Statistics in Medicine. 2015;34(9):1590–1604.
 62 Seaman S. R., Bartlett J. W., White I. R.. Multiple imputation of missing covariates with nonlinear effects and interactions: an evaluation of statistical methods.. BMC Medical Research Methodology. 2012;12(1):46+.
 63 Lambert P. C., Dickman P. W., Rutherford M. J.. Comparison of different approaches to estimating age standardized net survival.. BMC Medical Research Methodology. 2015;15(1):64+.
 64 Rücker G., Schwarzer G.. Presenting simulation results in a nested loop plot. BMC Medical Research Methodology. 2014;14(1):129+.
 65 Gasparini A., White I. R.. rsimsum: Analysis of simulation studies including monte carlo error2018. R package version 0.3.1.
 66 Wicklin R.. A zipper plot for visualizing coverage probability in simulation studies. 2018.
 67 Austin P. C., Stuart E. A.. Optimal full matching for survival outcomes: a method that merits more widespread use. Statistics in Medicine. 2015;34(30):3949–3967.
 68 Thompson J. A., Fielding K. L., Davey C., Aiken A. M., Hargreaves J. R., Hayes R. J.. Bias and inference from misspecified mixedeffect models in stepped wedge trial analysis. Statistics in Medicine. 2017;36(23):3670–3682.
 69 Sayers A., Crowther M. J., Judge A., Whitehouse M. R., Blom A. W.. Determining the sample size required to establish whether a medical device is noninferior to an external benchmark. BMJ Open. 2017;7(8):e015397+.

70
Boos D. D., Osborne J. A.. Assessing Variability of Complex Descriptive Statistics in Monte Carlo Studies Using Resampling Methods.
International Statistical Review. 2015;83(2):228–238. 
71
Audigier V., White I. R., Jolani S., et al.
Multiple imputation for multilevel data with continuous and binary variables.
2018.  72 Langan D., Higgins J. P. T., Simmonds M.. Comparative performance of heterogeneity variance estimators in metaanalysis: a review of simulation studies. Research Synthesis Methods. 2017;8(2):181–198.
Appendix A Review: summary of information around datageneration.
Figure 7 gives summary information about how data were generated. Panel (a) shows that there was great variation in the total number of datagenerating mechanisms, with the majority of simulation studies using under 20, and the largest number being billion. Panel (b) shows that simulation studies tended to vary few factors (with one exception). For the simulation studies varying more than one factor, the most common way to do this was in a fully factorial manner (panel (c)). However, some studies varied the factors oneatatime and others mixed the two together. Unfortunately, not all simulation studies noted the number of repetitions (panel (d)). The most common choices of were, in descending order: , and .

Figure 8a shows the number of estimands evaluated by the simulation studies included in the review. In general, there were few, with a single estimand the most common. Figure 8 panel (b) gives the number of methods evaluated by the simulation studies included in the review (right panel). The majority evaluated few methods (with four the most common number). This suggests that simulation studies provide a proofofconcept, or that the methods are designed for new problems for which there are few alternatives available.
Table 9 lists the software packages mentioned and the number of mentions in simulation studies included in the review. This was based on a lenient judgement: for example, many articles mentioned a software package in which a method was implemented but did not mention what software was used to run the simulation study.
Software  Freq. 

None mentioned  38 
C  1 
Jags  1 
Matlab  1 
R  41 
Sas  17 
Satscan  1 
Stata  4 
Statxact  1 
Winbugs  3 
Comments
There are no comments yet.