Pitfalls and Potentials in Simulation Studies

by   Samuel Pawel, et al.
Universität Zürich

Comparative simulation studies are workhorse tools for benchmarking statistical methods, but if not performed transparently they may lead to overoptimistic or misleading conclusions. The current publication requirements adopted by statistics journals do not prevent questionable research practices such as selective reporting. The past years have witnessed numerous suggestions and initiatives to improve on these issues but little progress can be seen to date. In this paper we discuss common questionable research practices which undermine the validity of findings from comparative simulation studies. To illustrate our point, we invent a novel prediction method with no expected performance gain and benchmark it in a pre-registered comparative simulation study. We show how easy it is to make the method appear superior over well-established competitor methods if no protocol is in place and various questionable research practices are employed. Finally, we provide researchers, reviewers, and other academic stakeholders with concrete suggestions for improving the methodological quality of comparative simulation studies, most importantly the need for pre-registered simulation protocols.



page 1

page 2

page 3

page 4


Using simulation studies to evaluate statistical methods

Simulation studies are computer experiments which involve creating data ...

Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

In recent years, the need for neutral benchmark studies that focus on th...

INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies

Simulation studies allow us to explore the properties of statistical met...

A Study Protocol for an Instrumental Variables Analysis of the Comparative Effectiveness of two Prostate Cancer Drugs

This paper presents a protocol, or design, for the analysis of a compara...

Computing with R-INLA: Accuracy and reproducibility with implications for the analysis of COVID-19 data

The statistical methods used to analyze medical data are becoming increa...

Simulation studies on Python using sstudy package with SQL databases as storage

Performance assessment is a key issue in the process of proposing new ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

“The first principle is that you must not fool yourself and you are the easiest person to fool. So you have to be very careful about that. After you’ve not fooled yourself, it’s easy not to fool other scientists.”


Simulation studies are to a statistician what experiments are to a scientist (Hoaglin1975). They have become a ubiquitous tool for the evaluation of statistical methods, mainly because simulation can be used to solve problems which cannot be solved using purely theoretical arguments. In this paper we focus on simulation studies where the objective is to compare the performance of two or more statistical methods (comparative simulation studies). Such studies are needed to ensure that previously proposed methods work as expected under various conditions, as well as to identify conditions under which they fail. Moreover, evidence from comparative simulation studies is often the only guidance available to data analysts for choosing from the plethora of available methods (Boulesteix2013; Boulesteix2017b). Proper design and execution of comparative simulation studies is therefore important, and the results of flawed studies can lead to serious damage due to misinformed decisions.

Just like empirical experiments in other fields of science, comparative simulation studies require many decisions to be made, for instance: How will the data be generated? How often will a simulation condition be repeated? Which statistical methods will be compared and how are their parameters specified? How will the performance of the methods be evaluated? The degree of flexibility, however, is much higher for simulation studies than for real experiments as they can often be rapidly repeated under different conditions at practically no additional cost. This is why numerous guidelines and best practices for design, execution, and reporting of simulation studies have been proposed (Hoaglin1975; Holford2000; Burton2006; Smith2010; OKelly2016; Monks2018; Elofsson2019; Morris2019; Boulesteix2020B). We recommend Morris2019 for an introduction to state-of-the-art simulation study methodology.

Despite wide availability of such guidelines, statistics articles often provide too little detail about the reported simulation studies to enable quality assessment and replication. Journal policies sometimes require the computer code to reproduce the results, but they rarely require or promote sound simulation methodology (e.g., the preparation of a simulation protocol). This leaves researchers with considerable flexibility in how they conduct and present simulations studies. As a consequence, readers of statistics papers can rarely be sure of the quality of evidence that a simulation study provides.

Unfortunately, there are many questionable research practices (QRPs) which may undermine the validity of comparative simulations studies and which can easily go undetected under current standards. There is often a fine line between QRPs and legitimate research practices. For instance, there are good reasons to modify the data-generating process of a simulation study based on the observed results, e.g., if the initially considered data-generating process results in many missing or non-convergent simulations. However, it is then important that such post hoc modifications are transparently reported. These practices only become questionable when they serve to confirm the hopes and beliefs of researchers regarding a particular method. Consequently, the results and conclusions of the study will be biased in favor of this method (Niessl2021).

It is imperative to note that researchers most often do not engage in QRPs with the intent to deceive, but rather due to their subconscious biases, expectations, or negligence (Simmons2011). External pressures, e.g., to publish novel and superior methods (Boulesteix2015) or to concisely report large amounts of simulation results, may also lead honest researchers to (unknowingly) employ QRPs. As we will argue, it is not only up to the researchers but also other academic stakeholders to improve on these issues.

The aim of this paper is to raise awareness about the issue of QRPs in comparative simulation studies, and to highlight the need for the adoption of higher standards. To this end, we provide an illustrative list of QRPs related to comparative simulation studies (Section 2). With an actual simulation study, we then show how easy it is to a present a novel, made-up method as an improvement over others if QRPs are employed and a priori simulation plans remain undisclosed (Section 3). The main inspiration for this work is drawn from similar illustrative studies which have been conducted by Niessl2021 and Jelizarow2010 in the context of benchmarking studies with real data sets and by Simmons2011 in the context of -hacking in psychological research. In Section 4, we then provide concrete suggestions for researchers, reviewers, editors, and funding bodies to to alleviate the issues of QRPs and improve the methodological quality of comparative simulation studies. Section 5 closes with a discussion of the results and concluding remarks.

2 Questionable research practices in comparative simulation studies

There are various QRPs which threaten the validity of comparative simulation studies (see Table 1 for an overview). QRPs can be categorized with respect to the stage of research at which they can occur. Our classification is inspired by the classification of QRPs in experimental research from Wicherts2016. In the following, we describe QRPs from all phases of conducting a simulation study, namely, design, execution, and reporting.

Tag Related Type of QRP
D1 R2 Not/vaguely defining objectives of simulation study
D2 E1 Not/vaguely defining data-generating process
D3 E2, E3 Not/vaguely defining which methods will be compared and how their parameters are specified
D4 E4 Not/vaguely defining evaluation criteria
D5 E6 Not/vaguely defining how to handle ambiguous analytic situations (e.g., non-convergence of methods)
D6 E5 Not computing required number of simulations to achieve desired precision
E1 D2 Adapting data-generating process to achieve certain outcomes
E2 D3, R1 Adding/removing comparison methods based on outcome of simulations
E3 D3

Selective tuning of hyperparameters of certain methods

E4 D4 Choosing evaluation criteria based on outcome of simulations
E5 D5 Choosing number of simulations to achieve desired outcome
E6 D6 Including/excluding certain simulations to achieve desired outcome
E7 E5 Choosing random seed to achieve desired outcome
R1 E2, D3 Selective reporting of results from simulation conditions that lead to certain outcomes
R2 D1 Presenting exploratory simulation studies as confirmatory (HARKing)
R3 Failing to report Monte Carlo uncertainty
R4 Failing to assure computational reproducibility (e.g., not sharing code and details about computing environment)
R5 Failing to assure replicability (e.g., not reporting design and execution methodology)
Table 1: Types of questionable research practices (QRPs) in comparative simulation studies at different stages of the research process.

2.1 Design

The a priori specification of research hypotheses, study design, and analytic choices is what separates confirmatory from exploratory research (Tukey1980). Often, large comparative simulation studies follow the publication of a novel method (which is commonly proposed theoretically or together with a smaller, exploratory simulation study). To allow readers to distinguish between confirmatory and exploratory research, many non-methodological journals require pre-registration of study design and analysis protocols. For instance, pre-registration is common practice in randomized controlled clinical trials (DeAngelis2004) and increasingly adopted in experimental psychology (Nosek2018). Thus, if researchers plan to conduct confirmatory research, it is generally recommended to write a simulation protocol (Morris2019). In contrast, vaguely defined or undefined simulation study goals can be considered a QRP. This is arguably even more important than in non-methodological research because the multiplicity of design and analysis choices in simulation studies is far higher (Hoffmann2021). For instance, failing to define a priori

the data-generating process (D2), the methods under investigation (D3), or the evaluation metrics (D4) leaves a high number of

researcher degrees of freedom

open, which increases the chance of overoptimistic conclusions.

Another crucial part of rigorous design is simulation size calculation (see Section 5.3 in Morris2019 for an overview). A thorough planning of the number of simulations in terms of expected precision of the primary estimand is important. While an arbitrarily chosen, often too small, number of simulations can be executed faster, they yield more heterogeneous results. By failing to conduct a simulation size calculation (D6), researchers are at a higher risk of drawing the wrong conclusions in the worst case (if their sample size is too small), or wasting computer resources in the best case (if their sample size is too large).

2.2 Execution

While executing a simulation study, researchers can (often unknowingly) engage in various QRPs that increase the chance of showing superiority of their preferred method. For instance, if under the originally envisioned data-generating process a proposed method does not perform better than its competitors, the data-generating process may be adapted until conditions are found in which the proposed method appears superior (E1). For example, noise levels, the number of predictors, or the effect sizes could be changed. It is usually not difficult to find reasonable justification for such modifications and then present them as if they were hypothesized during the planning of the study (R2).

In case no favorable data-generating processs can be found, competitor methods that are superior to the proposed method may be excluded from the comparison altogether (E2). Similarly, additional methods which perform worse under the (adapted) data-generating process may be included in the simulation study.

The methods under comparison may come with hyperparameters (e.g., regularization parameters in penalized regression models). In this case, hyperparameters of a favored method may be tuned until the method appears superior to others (E3). Similarly, the hyperparameters of competitor methods may be tuned selectively, e.g., left at their default values.

The evaluation criteria for comparing the performance of the investigated methods can be changed to make a particular method look better than the others (E4). For instance, even though the original aim of the study may have been to compare predictive performance among methods using the Brier score, the evaluation criterion of the simulation study may be switched to area under the curve if the results suggest that the favored method performs better with respect to the latter metric. This QRP parallels the well-known “outcome-switching” problem in clinical trials (Altman2017). Similarly, the objective of the simulation study may also be changed depending on the outcome, e.g.

, an initial comparison of predictive performance may be changed to comparing estimation performance if the results suggest that the favored method performs better at estimation tasks rather than prediction (see also R2).

If no a priori simulation size calculation was conducted, the simulation size may be changed until favorable results are obtained (E5). If the number of simulations is small (relative to the noise level), the risk that a positive finding is a false positive increases. Similarly, when too few simulations are conducted, the initializing seed for generating random numbers may be tuned until a seed is found for which a preferred method seems superior (E7).

In some simulations, a method may fail to converge and thus produce missing values in the estimands. If it is not pre-specified how these situations will be handled, different inclusion/exclusion or imputation strategies may be tried out until a favored method appears superior. Choosing an inadequate strategy can result in systematic bias and misleading conclusions.

2.3 Reporting

In the reporting stage, researchers are faced with the challenge of reporting the design, results, and analyses of their simulation study in a digestible manner. Various QRPs can occur at this stage. For instance, reporting may focus on results in which the method of interest performs best (R1). Failing to mention conditions in which the method was inferior (or at least not superior) to competitors creates overoptimistic impressions, and may lead readers to think that the method uniformly outperforms competitors. Similarly, presenting exploratory simulation studies as if they were hypothesized a priori (R2) leads to over-confidence in the results.

Failing to report Monte Carlo uncertainty (R3), e.g.

, error bars or confidence intervals reflecting uncertainty in the simulation, is a QRP with various consequences

(van2019communicating). It confounds the readers’ ability to assess the quality of evidence from the simulation study. Furthermore, not disclosing Monte Carlo uncertainty allows presenting random differences in performance as if they were systematic.

Finally, by failing to assure computational reproducibility of the simulation study (R4), it is more likely that coding errors remain undetected. By not reporting the design and execution of the study in enough detail, other researchers are unable to replicate and expand on the simulation study (R5).

3 Empirical study: The Adaptive Importance Elastic Net (AINET)

To illustrate the application of QRPs from Table 1 we conducted a simulation study. The objective of the study was to evaluate the predictive performance of a made-up regression method termed the “adaptive importance elastic net” (ainet). The main idea of ainet

is to use variable importance measures from a random forest for a weighted penalization of the variables in an elastic net regression model. The hope is that this

ad hoc modification improves predictive performance in clinical prediction modeling settings where penalized regression models are frequently used. Superficially, ainet may seem sensible, however, for a linear data-generating process no advantage over the classical elastic net is expected. For more details on the method, we refer to the simulation protocol (Appendix A). We report the per-protocol simulation study results in Appendix B. As expected, the performance of ainet was virtually identical to standard elastic net regression. ainet

also did not yield any improvements over logistic regression for the data-generating process that we considered sensible

a priori.

Figure 1: Differences in Brier score with 95% adjusted confidence intervals between ainet and random forest (RF), logistic regression (GLM), elastic net (EN), and adaptive elastic net (AEN) are shown for representative simulation conditions (correlated covariates , prevalence , a range of sample sizes and events per variable (EPV), in each simulation the Brier score is computed for 10’000 test observations; for details see Appendix A). The top row depicts the per-protocol results in which ainet does not outperform any competitor uniformly, except AEN. In the second row, we apply QRP E1: altering the data-generating process by adding a non-linear effect and sparsity. The arrows point from the per-protocol result to the results under the tweaked simulation. In the third row, the QRP E2 is applied: EN is removed as a competitor. In the bottom row, selective reporting R1 is applied: only low EPV settings are reported to give a more favorable impression for ainet. Note, to reduce computation time, the most computationally expensive conditions with and were removed in the tweaked simulation.

We now show how application of QRPs changes the above per-protocol conclusions. Figure 1 illustrates different types of QRPs sequentially applied to simulation-based evaluation of ainet. The top row depicts the per-protocol differences in Brier score (x-axis) between ainet and competitor methods (y-axis) for a representative subset of the simulation conditions. A negative difference indicates superior performance of ainet. In the second row, the arrows depict the change in the per-protocol results after changing the data-generating process (E1). The third row shows the result after removal of the elastic net competitor (E2). Finally, the bottom row shows the end result where selective reporting of simulation conditions and competitor methods (R1) is applied to give a more favorable impression of ainet. We will now discuss these QRPs in more detail.

Altering the data-generating process (E1)

We could not detect a systematic performance benefit of ainet over standard logistic regression, elastic net regression, or random forest for the scenarios specified in the protocol. After simple modifications of the data-generating process, we found that ainet outperforms logistic regression under the following conditions: only few variables being associated with the outcome (sparsity), a non-linear effect, and a low number of events per variable (EPV). Figure 1 (second row) shows the changes in Brier score difference between the pre-registered and the tweaked simulation. As can be seen, the tweaked data-generating process leads to ainet being superior to competitors in some conditions, and at least not inferior in others.

Removing competitor methods (E2)

Despite the adapted data-generating process, we still observed only minor (if any) improvements of ainet over the elastic net. In order to present ainet in a better light we could omit the comparisons with the elastic net (E2), as shown in Figure 1 (third row). This could be justified, for example, by arguing that a less flexible method (logistic regression), a more flexible method (random forest), and a comparably flexible method (adaptive elastic net) are sufficient for neutral comparison.

Selective reporting of simulation results (R1)

After the removal of the competitor elastic net, there are still some simulation conditions under which ainet is not superior to the remaining competitors. To make ainet appear more favorable, we may thus report only simulation conditions with low EPV, as shown in Figure 1 (fourth row). This could be justified by the fact that journals require authors to be concise in their reporting. Moreover, further conditions with low EPV values could be simulated to make the results seem more exhaustive. Focusing primarily on low EPV settings could be justified in hindsight by framing ainet

as a method designed for high-dimensional data (low sample size relative to the number of variables).

4 Recommendations

The previous sections painted a rather negative picture of how undisclosed changes in simulation design, analysis, and reporting increase the risk of overoptimistic conclusions. In the following, we summarize what we consider to be practical recommendations for improving the methodological quality of simulation studies; see Table 2 for an overview. Our recommendations are grouped with regards to which stakeholder they concern.

Adopt (pre-registered) simulation protocols
Adopt good computational practices (code review, packaging, unit-tests, etc.)
Blind analyses of simulation results
Collaborate with other research groups (with possibly “competing” methods)
Distinguish exploratory from confirmatory findings
Disclose multiplicity and uncertainty of results
Editors and reviewers
Require/demand conditions where methods break down
Require/demand (pre-registered) simulation protocols
Provide enough space for description of simulation methodology
Journals and funding bodies
Provide incentives for rigorous simulation studies (e.g., badges on papers)
Require code and data
Enforce adherence to reporting guidelines
Adopt reproducibility checks
Promote/fund research and software to improve simulation study methodology
Table 2: Recommendations for improving quality of comparative simulation studies and preventing QRPs.

4.1 Recommendations for researchers

Adopting pre-registrated simulation protocols is arguably the most important measure that researchers can take to prevent themselves from subconsciously engaging in QRPs. Pre-registration enables readers to distinguish between confirmatory and exploratory findings, and it lowers the risk of potentially flawed methods being promoted as an improvement over competitors. While pre-registered simulation protocols may at first seem disadvantageous due to the additional work and possibly lower chance of publication, they provide researchers with the means to differentiate their high-quality simulation studies from the numerous unregistered and possibly untrustworthy simulation studies in the literature. Platforms such as GitHub (https://github.com/), OSF (https://osf.io/), or Zenodo (https://zenodo.org/) can be used for uploading and time-stamping documents.

When pre-registering and conducting simulation studies, we recommend using a robust computational workflow. Such a workflow encompasses packaging the software, writing unit tests, and reviewing code (see e.g.schwab2021statistical). Other researchers and the authors themselves then benefit from improved computational reproducibility and less error-prone code.

While planning a simulation study, it is impossible to think of all potential weaknesses or problems that may arise when conducting the planned simulations. In turn, researchers may be reluctant to tie their hands in a pre-registered protocol. However, a transparently conducted and reported preliminary simulation can obviate most of these problems. We recommend researchers to disclose preliminary results and any ensuing changes to the protocol, e.g., in a revised and time-stamped version of the protocol. This approach parallels running a small pilot study, which is often done in empirical research. A different approach for making post hoc changes to the protocol is to use blinding in the analysis of the simulation results (Dutilh2019). Blinded analysis is a standard procedure in particle physics to prevent data analysts from biasing their result towards their own beliefs (Klein2005), and it lends legitimacy to post hoc modifications of the simulation study. For instance, researchers might shuffle the method labels and only unblind themselves after the necessary analysis pipelines are set in place.

Another way of improving simulation studies is to collaborate with other researchers, possibly familiar with “competing” methods. This helps to design simulation studies which are more objective and whose results are more useful for making a decision about which method to choose.

It is important for researchers to separate exploratory from confirmatory findings in the reporting of their simulation studies. By clearly indicating exploratory findings, overoptimism can be avoided and and a more realistic picture of method performance presented. Furthermore, researchers should try to disclose the multiplicity and uncertainty inherent to the design and analysis of their simulation studies (Hoffmann2021). For instance, if possible, they should report sensitivity analyses that show how much their conclusions change if different decisions would have been made. Methods from multivariate statistics can be used for visualizing the influence of different design choices, e.g., a multidimensional unfolding approach as shown by Niessl2021.

4.2 Recommendations for editors and reviewers

Peer review is an important tool for identifying QRPs in research results submitted to methodological journals. For instance, reviewers may demand researchers to include competitor methods which are not part of their comparison yet (or which might have been excluded from the comparison). However, reviewers can only identify a subset of all QRPs since some types are impossible to spot if no pre-registered simulation protocol is in place (e.g., a reviewer cannot know whether the evaluation criterion was switched). Even QRPs which can be detected by peer review may be difficult to spot in practice. It is thus important that reviewers and editors demand that authors make simulation protocols and computer code available alongside the manuscript. Moreover, by providing enough space and encouraging authors to provide detailed descriptions of their simulation studies, replicability of the simulation studies can be improved. Finally, reviewers should not be satisfied with manuscripts showing that a method is uniformly superior; they should also urge authors to find cases where their method is inferior or edge cases where it breaks down entirely.

4.3 Recommendations for journals and funding bodies

Journals and funding bodies can improve on the status quo by either demanding stricter requirements or by providing incentives for more rigorous simulation study methodology. For example, journals can make (pre-registered) simulation protocols mandatory for all articles featuring a simulations study. A less extreme measure would be to indicate with a badge whether an article contains a pre-registered simulation study. Such an approach rewards researchers who take the extra effort. Similar initiatives have led to a large increase in the adoption of pre-registered study protocols in the field of psychology (Kidwell2016). Another measure could be to require standardized reporting of simulation studies, e.g., the “ADEMP” reporting guideline by Morris2019

. Journals may also employ reproducibility checks to ensure computational reproducibility of the published simulation studies. This is already done, for example, by the Journal of Open Source Software or the Journal of Statistical Software. Finally, journals and funding bodies can promote or fund research and software to improve simulation study methodology. For instance, a journal might have special calls for papers on simulation methodology. Similarly, a funding body could have special grants dedicated to software development that facilitates sound design, execution, and reporting of simulation studies

(as, for example, White2010; Gasparini2018; Chalmers2020).

5 Conclusions

Simulation studies should be viewed and treated analogously to (empirical) experiments from other fields of science. The distinction between exploratory and confirmatory simulation studies is essential to contextualize the results of such a study. As in other empirical sciences, QRPs in simulation studies can obfuscate the usefulness of a novel method and lead to misleading and non-replicable results.

By deliberately using several QRPs we were able to present a method with no expected benefits and little theoretical justification – invented solely for this article – as an improvement over theoretically and empirically well-established competitors. While such intentional engagement in these practices is far from the norm, unintentional QRPs may have the same detrimental effect. We hope that our illustration will increase awareness about the fragility of findings from simulation studies and the need for higher standards.

While this article focused on comparative simulation studies, many of the issues and recommendations also apply to neutral comparison studies with real data sets as discussed in Niessl2021. Some of the noted problems even exist in theoretical research; due to the incentive to publish positive results, researchers often selectively study optimality conditions of methods rather than conditions under which they fail.

Again, it is imperative to note that researchers rarely engage in QRPs with malicious intent but because humans tend to interpret ambiguous information self-servingly, and because they are good at finding reasonable justifications that match their expectations and desires (Simmons2011). As in other domains of science, it is easier to publish positive results in methodological research, i.e., novel and superior methods (Boulesteix2015). Thus, methodological researchers will typically desire to show the superiority of a method rather than to disclose its strengths and weaknesses. Aligning incentives for individual researchers with rigorous simulation research will require a range of actions involving various stakeholders in the research community. We have provided some recommendations that, we believe, could help achieve this goal. Most importantly, we think that reviewers, journals, and funders need to raise the standards for simulation studies by requiring pre-registered simulation protocols and rewarding researchers who invest the extra effort.

Software and data

The simulation study was conducted in the R language for statistical computing (pkg:base) using the version 4.1.1. The method ainet is implemented in the ainet package and available on GitHub (https://github.com/SamCH93/SimPaper). We provide scripts for reproducing the different simulation studies on the GitHub repository. Due to the computational overhead, we also provide the resulting data so that the analyses can be conducted without rerunning the simulations. We used pROC version 1.18.0 to compute the AUC (pkg:proc). Random forests were fitted using ranger version 0.13.1 (ranger2017). For penalized likelihood methods, we used glmnet version 4.1.2 (Friedman2010; Simon2011). The SimDesign package version 2.7.1 was used to set up simulation scenarios (Chalmers2020).


We would like to thank Eva Furrer, Malgorzata Roos, and Torsten Hothorn for helpful discussion and comments on the simulation protocol and drafts of the manuscript. The authors declare that they do not have any conflicts of interest.

Appendix A Simulation protocol

Below, we include an excerpt of the final version of the protocol for the simulation-based evaluation of ainet. All time-stamped versions of the protocol are available at https://doi.org/10.5281/zenodo.6364575.

a.1 Aims

The aim of this simulation study is to systematically study the predictive performance of ainet for a binary prediction task. The simulation conditions should resemble typical conditions found in the development of prediction models in biomedical research. In particular we want to evaluate the performance of ainet conditional on

  • low- and high-dimensional covariates

  • (un-)correlated covariates

  • small and large sample sizes

  • varying baseline prevalences

ainet will be compared to other (penalized) binary regression models from the literature, namely

  • Binary logistic regression: the simplest and most popular method for binary prediction

  • Elastic net: a generalization of LASSO and ridge regression, the most widely used penalized regression methods

  • Adaptive elastic net: a generalization of the most popular weighted penalized regression method (adaptive LASSO)

  • Random forest: a popular, more flexible method. This method is related to ainet, see Section A.4.

These cover a wide range of established methods with varying flexibility and serve as a reasonable benchmark for ainet. There are many more extensions of the adaptive elastic net in the literature (see e.g., the review by Vidaurre2013). However, most of these extensions focus on variable selection and estimation instead of prediction, which is why we restrict our focus only on the four methods above.

a.2 Data-generating process

In each simulation , we generate a data set consisting of realizations, i.e.. A datum consists of a binary outcome and

-dimensional covariate vector

. The binary outcomes are generated by

with and the covariate vectors are generated by

with covariance matrix that may vary across simulation conditions (see below). The baseline prevalence is . The coefficient vector is generated from

once per simulation. Finally, the simulation parameters are varied fully factorially (except for the removal of some unreasonable conditions) as described below, leading to a total of 128 scenarios, see below.

Sample size

The sample size used in the development of predictions models varies widely (Damen2016). We will use , which span typical values occurring in practice. Note that previous simulation studies usually chose sample size based on the implied number of events together with the number of covariates in the model for easier interpretation (vanSmeden2018; Riley2018). We will use this approach in reverse to determine the dimensionality of the parameters below.


Previous simulation studies showed that events per variable () rather than the absolute sample size and dimensionality influences the predictive performance of a method. We will therefore define the dimensionality via EPV by

and If the above formula gives non-integer values, the next larger integer will be used for . When the formula gives values above 100 or below 2, this simulation condition will be removed from the design. This is done because prediction models are in practice only multivariable models (), but at the same time the number of predictors is rarely larger than (Kreuzberger2020; Seker2020; Wynants2020). The exception are studies considering complex data, such as images, omics, or text data which are not the focus here. The values are chosen to cover scenarios with small to large number of covariates (cf.  vanSmeden2018).

Collinearity in

We distinguish between no, low, medium and high collinearity. The diagonal elements of are given by and the off-diagonal elements are set to , . These values cover the typical (positive) range of correlations.

Baseline prevalence

Different baseline prevalences are considered, reflecting a reasonable range of prevalences for rare to common diseases/adverse events.

Test data

In order to test the out-of-sample predictive performance, we generate a test data set of data points in each simulation .

a.3 Estimands

We will estimate different quantities to evaluate overall predictive performance, calibration, and discrimination, respectively. All methods will be evaluated on independently generated test data.

a.3.1 Primary estimand

  • Brier score. We compute the Brier score as


    . Lower values indicate better predictive performance in terms of calibration and sharpness. A prediction is well-calibrated if the observed proportion of events is close to the predicted probabilities. Sharpness refers to how concentrated a predictive distribution is (

    e.g., how wide/narrow a prediction interval is), and the predictive goal is to maximize sharpness subject to calibration (Gneiting2008). The Brier score is a proper scoring rule, meaning that it is minimized if a predicted distribution is equal to the data-generating distribution (Gneiting2007). Proper scoring rules thus encourage honest predictions. The Brier score is therefore a principled choice for our primary estimand.

a.3.2 Secondary estimands

  • Scaled Brier score. The scaled Brier score (also known as Brier skill score) is computed as

    with and the observed prevalence in the data set. The scaled Brier score takes into account that the prevalence varies across simulation conditions. Hence, the scaled Brier score can be compared between conditions (Schmid2005; steyerberg2019clinical).

  • Log-score. We compute the log-score on independently generated test data,

    will be used as a secondary measure of overall predictive performance. Lower values indicate better predictive performance in terms of calibration and sharpness. The log-score is a strictly proper scoring rule, however, it is more sensitive to extreme predicted probabilities compared to the Brier score (Gneiting2007).

  • AUC. The AUC is given by


    where and denote case and non-case, respectively. The AUC is related to the area under the receiver-operating-characteristic (ROC) curve (steyerberg2019clinical). It will be used as a measure of discrimination and values closer to one indicate better discriminative ability. Discrimination describes the ability of a prediction model to discriminate between cases and non-cases. Other discrimination measures, such as accuracy, sensitivity, specificity, etc., are not considered because we want to evaluate predictive performance in terms of probabilistic predictions instead of point predictions/classification.

  • Calibration slope . The calibration slope is obtained by regressing the test data outcomes

    on the models’ predicted logits

    , i.e.,

    This measure will be used to assess calibration and deviations of from one indicate miscalibration (steyerberg2019clinical).

  • Calibration in the large . We inspect calibration in the large on independently generated test data, from the model

    This measure will also be used to assess calibration and deviations of from zero indicate miscalibration (steyerberg2019clinical).

To facilitate comparison between simulation conditions, all estimands will also be corrected by the oracle version of the estimand, e.g., the Brier score will be computed from the ground truth parameters and the simulated data , subsequently the oracle Brier score will be subtracted from the estimated Brier score.

a.4 Methods

a.4.1 ainet

We now present the mock-method and give a superficial motivation why it could lead to improved predictive performance: Choosing the vector of penalization weights in the adaptive LASSO becomes difficult in high-dimensional settings. For instance, using absolute LASSO estimates as penalization weights omits the importance of several predictors by not selecting them, especially in the case of highly correlated predictors (Algamal2015). The adaptive importance elastic net (ainet) circumvents this problem by employing a random forest to estimate the penalization weights via an a priori chosen variable importance measure. In this way, the importance of all variables enter the penalization weights simultaneously.

The penalized log-likelihood for ainet for a single observation is defined as


denotes the log-likelihood of a binomial GLM and is derived from a random forest variable importance measure as

where we transform to be non-negative via

and is a hyperparameter for the influence of the weights similar to hyperparameter of the adaptive elastic net. ainet is fitted by maximizing its penalized log-likelihood assuming i.i.d. observations , i.e.,

Per default, we choose mean decrease in the Gini coefficient for . Hyperparameters of the random forest are not tuned, but kept at their default values (e.g.mtry, ntree). The hyperparameter will stay constant for all simulations.

ainet is supposed to seem like a reasonable method at first glance. However, ainet cannot be expected to share desirable theoretical properties with the usual adaptive LASSO, such as oracle estimation (Zou2006). This is because the penalization weights do not meet the required consistency assumption. Also in terms of prediction performance, ainet is not expected to outperform methods of comparable complexity.

a.4.2 Benchmark methods

  • Binary logistic regression (mccullagh2019generalized)

    with and without ridge penalty for high- and low-dimensional settings, respectively. In case a ridge penalty is needed, it is tuned via 5-fold cross-validation by following the “one standard error” rule as implemented in

    glmnet (Friedman2010).

  • Elastic net (Zou2005), for which the penalized log-likelihood is given by

    Here, and are tuned via 5-fold cross-validation by following the “one standard error” rule.

  • Adaptive elastic net (Zou2006)

    , with penalized loss function

    Here, the penalty weights are inverse coefficient estimates from a binary logistic regression

    where and are tuned via 5-fold cross-validation by following the “one standard error” rule. The hyperparameter will stay constant for all simulations. In case , we estimate the penalty weights using a ridge penalty, tuned via an additional nested 5-fold cross-validation by following the “one standard error” rule.

  • Random forests (Breiman2001) for binary outcomes without hyperparameter tuning. The default parameters of ranger will be used (ranger2017).

a.5 Performance measures

The distribution of all estimands from Section A.3

will be assessed visually with box- and violin-plots that are stratified by method and simulation conditions. We will also compute mean, median, standard deviation, interquartile range, and 95% confidence intervals for each of the estimands. Moreover, instead of “eye-balling” differences in predictive performance across methods and conditions, we will formally assess them by regressing the estimands on the method and simulation conditions

(cf.  Skrondal2000). To do so, we will use a fully interacted model with the interaction between the methods and the 128 simulations conditions, i.e., in R notation: estimand 0 + method:scenario. We will rank pairwise comparison between two methods within a single condition by their -values, to more easily identify conditions where methods show differences in predictive performance. The choice of a significance level at which a method is deemed superior will be determined based on preliminary simulations. We set this level to 5%, where -values will be adjusted using the single-step method (pkg:multcomp) within a single simulation condition for comparisons between ainet and any other method.

a.6 Determining the number of simulations

We determine the number of simulation such that the Monte Carlo standard error of the primary estimand, the mean Brier score

, is sufficiently small. The variance of

is given by

and could be decomposed further (Bradley2008). However, the resulting expression is difficult to evaluate for our data-generating process as it depends on several of the simulation parameters. We therefore follow a similar approach as in Morris2019 and estimate from an initial small simulation run with 100 simulations per condtion to get an upper bound for worst-case variance across all simulation conditions. Therefore, the number of simulations is then given by

Since we decide that we require the Monte Carlo standard error of to be lower than four significant digits, .

The initial simulation run led to an estimated worst case variance of . Therefore, we compute that

replications are required to obtain Brier score estimates with the desired precision.

a.7 Handling exceptions

It is inevitable that convergence issues and other problems will arise in the simulation study. We will handle them as follows:

  • If a method fails to converge, the simulation will be excluded from the analysis. The failing simulations will not be replaced with new simulations that successfully converge as convergence may be impossible for some scenarios.

  • We will report the proportion of simulations with convergence issues for each method and discuss the potential reasons for their emergence.

  • In case of severe convergence issues or other problems (more than 10% of the simulations failing within a setting), we may adjust the simulation parameters post hoc. This will be indicated in the discussion of the results.

  • Convergence may be possible for certain tuning parameters of a method (e.g., cross-validation of LASSO may fail for some values while it could work for others). In this case we will choose a parameter value where the method still converges, as one would usually do with a real data set.

Appendix B Per-protocol results

Here, we describe the outcomes of the preregistered simulations. Overall, the performance of ainet was virtually identical to elastic net regression. The adaptive penalization weights of ainet do not seem to make a difference for the data generating mechanism considered in our simulations. Moreover, since the data were generated under a process equivalent to a logistic regression model, it is no surprise that for reasonably large sample sizes, logistic regression also performed the best. The only exception are conditions with small sample size and low number of events per variable. Here, ainet and elastic net led to more stable and better calibrated predictions than logistic regression. The random forest was outperformed by ainet in most simulation conditions, with exception of very small sample size and prevalence, as well as when a high correlation between covariates was present. Finally, the performance of the adaptive elastic net was generally worse compared to ainet and elastic net. In the following, we summarize the results for each estimand.

b.1 Brier score (primary estimand)

Figure 2 shows the differences in mean Brier score between ainet and the other methods stratified by simulation conditions. We see that there is hardly any difference between ainet and the elastic net (EN) across all simulation conditions meaning that predictive performance of both methods seems to be very similar in the investigated scenarios.

The random forest (RF) shows better predictive performance than ainet in conditions with very low sample size () and prevalence (). For increasing sample size and prevalence, the performance of ainet seems to become more similar or improve over RF when the correlation of the covariates is not too large () especially for low events per variable (). For highly correlated covariates (), the performance of ainet is similar or worse across most simulation conditions.

Logistic regression (GLM) showed better predictive performance compared to ainet in most simulation conditions. An exception are the conditions with small sample size (), medium to large prevalence () and low events per variable (), where ainet performed better than GLM.

The adaptive elastic net (AEN) method performed worse than ainet in almost all simulation conditions. Only in conditions with very large sample size (), very small prevalence (), and high events per variable (), AEN showed predictive performance on par with ainet.

b.2 Scaled Brier score (secondary estimand)

Figure 3 shows the differences in scaled Brier score between ainet and the other methods stratified by simulation conditions. The scaled Brier score is useful to compare the actual values of Brier scores across conditions with different prevalence, but not so much to compare Brier scores of different methods within a simulation condition with fixed prevalence.

We see that for most conditions the plots look like a flipped version of the original Brier scores from Figure 2. Therefore, conclusions are mostly the same. For very small sample sizes coupled with low prevalence and low events per variable (the topleft plots), the scaled Brier score indicates superiority of ainet over RF and GLM, which is opposite the conclusion based on the raw Brier score. We advise to interpret these conditions cautiously since the prevalence prediction which is used for scaling is based on the much larger test data set.

b.3 Log-score (secondary estimand)

Figure 4 shows the differences in log-score between ainet and the other methods stratified by simulation conditions. We see that in certain conditions, the error bars of certain methods are much larger. This is due to the log-score’s sensitivity to extreme predictions, which often happen under the RF (and sometimes under the GLM). Despite the larger variability of the log-score, conclusion regarding the comparison between ainet and the other methods are largely the same as under the Brier score.

b.4 Area under the curve (secondary estimand)

Figure 4 shows the differences in area under the curve (AUC) between ainet and the other methods stratified by simulation conditions. As with the other estimands, ainet shows virtually identical performance as EN regression across all simulation conditions. ainet seems to outperform RF across most simulation conditions, with the exception of a conditions with low sample size (), medium prevalence (), and low events per variable (). GLM, typically outperforms ainet conditions with small to medium sample size (), and also in conditions with larger sample size when the events per variable is normal to high () and the prevalence is small (). Finally, the AEN is worse with respect to AUC than ainet across all simulation conditions.

b.5 Calibration slope (secondary estimand)

Figure 5 shows boxplots of calibration slopes stratified by simulation condition and method. For each condition the percentage of simulations where no estimate could be obtained is indicated. This usually happened because of extreme (close to zero or one) predictions, or non-convergence of the method itself. We caution against interpretation of the random forest (RF) calibration slopes because this method often resulted in predicted probabilities of zero or one, so that a calibration slope could not be fitted.

We see that logistic regression (GLM) shows on average optimal calibration slopes in most simulation condition. In cases where it is off one, its calibration slopes are usually too small indicating overoptimistic predictions. In general, worse calibration slopes are obtained for lower event per variable (EPV).

The penalized methods (ainet, EN, AEN) show a more stable behavior, and on average larger calibration slopes than GLM. This is likely confounded by the simulation conditions in which no GLM calibration slope can be estimated, but estimation of the penalized methods’ calibration slope is still possible. Among the penalized method’s ainet and EN shows relatively similar calibration slopes whereas the AEN shows worse calibration slopes that are more off the value of one.

b.6 Calibration in the large (secondary estimand)

Figure 6 shows boxplots of calibration in the large estimates stratified by simulation condition and method. For each condition also the percentage of simulations where no estimate could be obtained is indicated. This usually happened because of extreme (close to zero or one) predictions.

We see that the number of simulations with non-estimable calibration is substantially larger when the sample size is small, whereas it decreases for larger sample sizes. An exception is the RF where the number of non-estimable calibrations stays high across most conditions.

While all methods seem to be marginally well calibrated, the penalized methods (ainet, EN, and AEN) show lower numbers of simulations with non-estimable calibration compared to GLM, especially for low to medium sample sizes and low events per variables.

Figure 2: Tie-fighter plot for the difference in Brier score between any method on the -axis and ainet. The 95% confidence intervals are adjusted per simulation condition using the single-step method. Lower values indicate better performance of ainet.
Figure 3: Tie-fighter plot for the difference in scaled Brier score between any method on the -axis and ainet. The 95% confidence intervals are adjusted per simulation condition using the single-step method. Larger values indicate better performance of ainet.
Figure 4: Tie-fighter plot for the difference in log-score between any method on the -axis and ainet. The 95% confidence intervals are adjusted per simulation condition using the single-step method. Lower values indicate better performance of ainet.
Figure 5: Boxplots of calibration slopes stratified by method and simulation conditions. Mean calibration slope is indicated by a cross. A value of one indicates optimal calibration. Percentage of simulations where calibration slope could not be estimated (due to extreme predictions or complete separation) are also indicated.
Figure 6: Boxplots of calibration in the large stratified by method and simulation conditions. Mean calibration in the large is indicated by a cross. A value of zero indicates optimal calibration in the large. Percentage of simulations where calibration in the large could not be estimated (due to extreme predictions or complete separation) are also indicated.