1 Introduction
Evaluating the comparative efficacy of alternative health care interventions lies at the heart of health technology assessments (HTAs), such as those commissioned by the National Institute of Health and Care Excellence (NICE), the body responsible for providing guidance on whether health care technologies should be publicly funded in England and Wales [49]. The randomized controlled trial (RCT) is the most reliable design for estimating the relative effectiveness of new treatments [18]. However, new treatments are typically compared against placebo or standard of care before the licensing stage, but not necessarily against other active interventions — a comparison that is required for HTAs. In the absence of data from headtohead RCTs, indirect treatment comparisons (ITCs) are at the top of the hierarchy of evidence when assessing the relative effectiveness of interventions and can inform treatment and reimbursement decisions [14].
Standard ITC techniques, such as network metaanalysis, are useful when there is a common comparator arm between RCTs, or more generally a connected network of studies [14, 8]. These methods can be used with individual patient data (IPD) or aggregatelevel data (ALD), with IPD considered the gold standard [48]. However, standard ITCs assume that there are no crosstrial differences in the distribution of effectmodifying variables (constant relative effects) and produce biased estimates when these exist [34]. Popular balancing methods such as propensity score matching [2] can account for these differences but require access to IPD for all the studies being compared [17].
In many HTA processes, there are: (1) no headtohead trials comparing the interventions of interest; (2) IPD available for at least one intervention (e.g. from the submitting company’s own trial), but only published ALD for the relevant comparator(s); and (3) crosstrial differences in effectmodifying baseline characteristics, implying that the relative effects are not constant. Several methods, labeled populationadjusted indirect comparisons, have been introduced to estimate relative treatment effects in this scenario. These include matchingadjusted indirect comparison (MAIC) [42, 43, 41] and simulated treatment comparison (STC) [10] and generally require access to IPD from at least one of the trials.
The NICE Decision Support Unit has published formal submission guidelines for population adjustment with limited access to IPD [34, 33]. Various reviews [34, 33, 21, 47] define the relevant terminology and assess the theoretical validity of these methodologies but do not express a preference. Questions remain about the correct application of the methods and their validity in HTA [34, 51, 33]. Thus, Phillippo et al. [34] state that current guidance can only be provisional, as more thorough understanding of the properties of populationadjusted indirect comparisons is required.
Consequently, several simulation studies have been published since the release of the NICE guidance [23, 32, 12, 19, 5, 24]. These have primarily assessed the performance of MAIC relative to standard ITCs in a limited number of simulation scenarios. In general, the studies set relatively low effect modifier imbalances and do not vary these, even though MAIC may lead to large reductions in effective sample size and imprecise estimates of the treatment effect when imbalances are high [35].
In this paper, we carry out a comprehensive simulation study to benchmark the performance of MAIC and STC against the standard ITC. The simulation study evaluates these methods in a wide range of settings; varying the trial sample size, effectmodifying strength of covariates, explanatory power of covariates, covariate imbalance and correlation levels. A total of 162 simulation scenarios are considered, providing the most extensive evaluation of population adjustment methods to date. A complementary objective of the simulation study is to inform the circumstances under which population adjustment should be applied and which specific method is preferable in a given situation.
In Section 2, we establish the context and data requirements for populationadjusted indirect comparisons. In Section 3, MAIC and STC are presented. Section 4 describes a simulation study, which extensively evaluates the properties of these approaches under a variety of conditions. Section 5 presents results from the simulation study. An extended discussion of our findings and their implications is provided in Section 6. Finally, we make some concluding remarks in Section 7.
2 Context
HTA typically takes place late in the drug development process, after a new medical technology has obtained regulatory approval, typically based on a twoarm RCT that compares the new intervention to the placebo or standard or care. In this case, the question of interest is whether or not the drug is effective. In HTA, the relevant policy question is: “given that there are finite resources available to finance or reimburse health care, which is the best treatment of all available options in the market?”. In order to answer this question, one must evaluate the relative efficacy of interventions that may not have been trialed against each other.
Indirect treatment comparison methods are used when we wish to compare the relative effect of interventions and
for a specific outcome, but no headtohead trials are currently available. Typically, it is assumed that the comparison is undertaken using additive effects for a given linear predictor, e.g. log hazard ratio for timetoevent outcomes or logodds ratio for binary outcomes. Indirect comparisons are typically performed on this scale
[14, 8]. In addition, we assume that the comparison is “anchored”, i.e., a connected treatment network is available through a common comparator , e.g. placebo or standard of care. We note that comparisons can be unanchored, e.g. using singlearm trials or disconnected treatment networks, but this requires much stronger assumptions [34]. The NICE Decision Support Unit discourages the use of unanchored comparisons when there is connected evidence and labels these as problematic [34, 33]. Hence, we do not present the methodology behind these.A manufacturer submitting evidence for reimbursement to HTA bodies has access to patientlevel data from its own trial that compares its product against standard intervention . However, as disclosure of proprietary, confidential patientlevel data from industrysponsored clinical trials is rare, IPD for the competitor’s trial, comparing its treatment against , are, almost invariably, unavailable. We consider, without loss of generality, that IPD are available for a trial comparing intervention to intervention (denoted ) and published ALD are available for a trial comparing to ().
Standard methods for indirect comparisons such as the Bucher method [8], a special case of network metaanalysis, allow for the use of ALD and estimate the vs. treatment effect as , where is the relative treatment effect of vs. (observed in the population), and is the relative effect of vs. (in the population). The treatment effect
and its variance can be estimated from the available IPD. The treatment effect
and its variance may be directly published or derived from aggregate outcomes made available in the literature. As the indirect comparison is based on relative treatment effects observed in separate RCTs, the withintrial randomization of the originally assigned patient groups is preserved. The withintrial relative effects are statistically independent of each other; hence, their variances are simply summed (or derived empirically from a Bayesian analysis) to obtain the variance of the vs. treatment effect.Standard indirect comparisons assume that there are no crosstrial differences in the distribution of effectmodifying variables. That is, the relative treatment effect of vs. in the population (indicated as ) is assumed equivalent to the treatment effect that would occur in the population (denoted ) — throughout the paper the asterisk superscript indicates that the variable of interest, in this case the relative treatment effect observed in one population (), has been “mapped” into the effect that would have been observed in a different population, in this case .
Often, treatment effects are influenced by variables that interact with treatment on a specific scale (e.g. the linear predictor), altering the effect of treatment on outcomes. If these effect modifiers are distributed differently across and , relative treatment effects differ in the trial populations and the assumptions of the Bucher method are broken. In this case, a standard ITC between and is liable to bias and may produce overly precise efficacy estimates [45]. From the economic modelling point of view, these features are undesirable, as they impact negatively on the “probabilistic sensitivity analysis” [3], the (often mandatory) process used to characterise the impact of the uncertainty in the model inputs on the decisionmaking process.
The use of population adjustment in HTA, both in published literature as well as in submissions for reimbursement, and its acceptability by national HTA bodies, e.g. in England and Wales, Scotland, Canada and Australia [51], is increasing across diverse therapeutic areas [51, 53, 35, 29]. As of April 11, 2020, a search among titles, abstracts and keywords for “matchingadjusted indirect comparison” and “simulated treatment comparison” in Scopus, reveals at least 89 peerreviewed applications of MAIC and STC and conceptual papers about the methods. In addition, at least 30 technology appraisals (TAs) published by NICE use MAIC or STC — of these, 23 have been published since 2017. Figure 1 shows the rapid growth of peerreviewed publications and NICE TAs featuring MAIC or STC since the introduction of these methods in 2010. MAIC and STC are predominantly applied in the evaluation of cancer drugs, as 26 of the 30 NICE TAs using population adjustment have been in oncology.
3 Methodology
We shall assume that the following data are available for the th subject () in the trial, of size :

A treatment indicator . Without loss of generality, we assume here for simplicity that for the common comparator and active treatment, respectively;

An observed outcome , e.g. a timetoevent or binary indicator for some clinical measurement.
Given this information, we can estimate the vs. treatment effect in the population, which we indicate as , and derive its variance.
For the trial, data available are:

Published summary values
, e.g. proportions or means and standard deviations for the baseline characteristics. We shall assume that these are available for
covariates. 
Estimates of the vs. treatment effect in the population, , and its variance, either published directly or derived from aggregate outcomes from the literature.
The baseline characteristics can be classed as prognostic variables (covariates that affect outcome), effect modifiers (covariates that interact with treatment to affect outcome), both or none. For simplicity in the notation, it is assumed that all available baseline characteristics are prognostic of the outcome and that a subset of these, , are selected as effect modifiers on the linear predictor scale. Similarly, for the published summary values, . Appendix A of the Supplementary Material provides an indepth discussion of the assumptions made by population adjustment methods.
3.1 Matchingadjusted indirect comparison
Matchingadjusted indirect comparison (MAIC) is a population adjustment method based on inverse propensity score weighting [38]. IPD from the trial are weighted so that the means of specified covariates match those in the trial. The mean outcomes under treatment (for and , respectively) in the population are predicted as the weighted average:
where denotes the outcome for patient receiving treatment in the patientlevel data and represents the number of patients in arm of the trial. Each individual is assigned a weight , which represents the odds of being enrolled in the trial as opposed to being enrolled in the trial and is defined as a function of the effectmodifying baseline characteristics . Note that we have summary data from the trial to estimate absolute outcomes under . However, in this instance, we wish to use the predicted outcomes to generate a relative effect for vs. in the population.
Typically, the weights are estimated using a propensity score logistic regression model such that
. However, the regression parameters cannot be derived using conventional methods (e.g. maximumlikelihood), because IPD are not available for . Signorovitch et al. [42]propose using a method of moments to estimate the model parameters by setting the weights so that the mean covariates are exactly balanced across the two trial populations. One can also balance higher order terms, e.g. by including squared covariates in the method of moments to match variances. However, this decreases the degrees of freedom and may increase finite sample bias
[55]. Matching both means and variances (as opposed to means only) appears to result in more biased and less accurate treatment effect estimates when covariate variances differ across trials [32, 19].In the original MAIC approach, covariates are matched for active treatment and control arms combined and standard errors are computed using a robust sandwich estimator
[42, 54]. This estimator does not rely upon strong assumptions about the MAIC weights. It is empirically derived from the data and accounts for the uncertainty in the estimation of the weights rather than assuming that these are fixed and known. Proposed modifications to MAIC are based on entropy balancing [32, 5], performing the matching separately for active and control arms [32, 5], and using the bootstrap [16, 15] to compute standard errors [44].As MAIC is a reweighting procedure, it will reduce the effective sample size (ESS) of the trial. The approximate ESS of the trial is estimated as , where is the weight estimated for individual using the method of moments. For relative effects to be conditionally constant and eventually produce an unbiased indirect comparison, one needs to include all effect modifiers in the weighting procedure. However, including too many covariates or large covariate imbalance (poor overlap in the covariate distributions) can lead to extreme weights and give treatment effect estimates with poor precision [37]. In general, this is a pervasive problem in NICE TAs as most of the reported ESSs are small with a large percentage reduction from the original sample size [35].
3.2 Simulated treatment comparison
While MAIC is a reweighting method, simulated treatment comparison (STC) is based on regression adjustment [56]. IPD from the trial are used to fit a regression model between the outcome and the baseline characteristics. In STC, the following linear predictor is fitted to the IPD:
(1) 
where is the expected outcome (in the population) of subject ,
is an appropriate link function (e.g. logit for binary outcomes),
is the intercept, is a vector of regression coefficients for the prognostic variables, is a vector of interaction coefficients for the effect modifiers and is the vs. treatment coefficient. The prognostic variables and effect modifiers are centered at the published mean values from the population, and , respectively. Hence, the estimated is directly interpreted as the vs. treatment effect in the population.Similarly to MAIC, for relative effects to be conditionally constant, one needs to include all effect modifiers in the model. However, it is optional to include (and to center) variables that are purely prognostic. As with any regression adjustment method, STC requires a strong fit of the outcome model. NICE guidance [33] suggests adding purely prognostic variables if they increase the precision of the model and account for more of its underlying variance, as reported by model selection criteria (e.g. residual deviance or information criteria). However, such tools should not guide decisions on effect modifier status, which must be defined prior to fitting the outcome model. As effectmodifying covariates are likely to be good predictors of outcome, the inclusion of appropriate effect modifiers should provide an acceptable fit.
A major issue with the implementation of STC presented above is that it is systematically biased in the case of nonlinear outcomes [21]
, e.g. binary outcomes modelled using logistic regression or timetoevent outcomes. In linear models, the arithmetic mean of the predicted outcome (the expected outcome for patients sampled under the centered covariates) coincides with its geometric mean (the outcome evaluated at the expectation of the centered covariates). However, this is not the case for models that are nonlinear. Initially, this hindered the widespread adoption of STC as most applications of population adjustment in HTA are in oncology and are typically concerned with survival outcomes, or rate outcomes such as response rates.
To address nonlinearity bias, Ishak et al. [21, 22] propose simulating individualized profiles based on the summary values of the covariates and the covariance matrix observed in the trial. Ideally, the
population should be characterized by the full joint distribution of covariates. However, the restriction of limited IPD makes it unlikely that the joint distribution of the
covariates is available. Such distribution must be approximated to avoid bias arising from the incomplete specification of thepopulation. Ishak et al. do so by simulating continuous covariates at the individual level from a multivariate normal distribution with the
means and correlation structure.In this simulationbased approach, a regression of the outcome on the covariates is fitted to the IPD. This regression is similar to that in Equation 1, although now the prognostic variables and effect modifiers are not centered at the mean covariates,
(2) 
is the (unadjusted) expected outcome of subject and represents the baseline vs. treatment effect (when the values of the prognostic variables and effect modifiers are zero). The coefficients of this regression are applied to the covariates simulated for each patient , including the effect modifiers , to produce individual linear predictors (e.g. the predicted log hazard or logodds) in the population:
where , , and are the regression parameters estimated in Equation 2. The individual linear predictors under each treatment (for and respectively) and their variances are averaged such that the vs. treatment effect is estimated as , and a measure of the variance is provided by , where denotes the fitted standard error of the linear prediction. See [11, 36, 52] for recent applications of this methodology.
4 Simulation Study
4.1 Aims
The objectives of the simulation study are to compare MAIC, STC and the Bucher method across a wide range of scenarios that may be encountered in practice. For each estimator, we assess the following properties [28]: (1) unbiasedness; (2) variance unbiasedness; (3) randomization validity;^{1}^{1}1In a sufficiently large number of repetitions,
% confidence intervals based on normal distributions should contain the true value
% of the time, for a nominal significance level . and (4) precision and efficiency. The selected performance measures evaluate these criteria specifically (see 4.5). The simulation study is reported following the ADEMP (Aims, Datagenerating mechanisms, Estimands, Methods, Performance measures) structure [28]. All simulations and analyses were performed using R software version 3.6.3 [50].^{2}^{2}2The files required to run the simulations are available at http://github.com/remiroazocar/population_adjustment_simstudy. Appendix B of the Supplementary Material presents a more detailed description of the simulation study design and Appendix C lists the specific settings of each simulation scenario.4.2 Datagenerating mechanisms
As most applications of MAIC and STC are in oncology, the most prevalent outcome types are survival or timetoevent outcomes (e.g. overall or progressionfree survival) [35]. Hence we consider these using the log hazard ratio as the measure of effect.
Following Bender et al. [6], we simulate CoxWeibull survival times for subject for trials and according to the formula:
(3) 
where
is a uniformly distributed random variable,
. We set the inverse scale of the Weibull distribution to and the shape to as these parameters produce a functional form and treatment effect reflecting frequently observed mortality trends in metastatic cancer patients [19]. Four correlated or uncorrelated binary covariates are generated per subject, following Leisch et al. [25]. Two of these are purely prognostic variables; the other two () are effect modifiers and prognostic variables.We introduce random right censoring to simulate loss to followup within each trial. Censoring times
are generated from the exponential distribution
, where the rate parameter is selected to achieve a censoring rate of 35% under the active treatment at baseline (with the values of the prognostic variables and effect modifiers set to zero), considered moderate censoring [1]. We fix the value of before generating the datasets, by simulating survival times for 1,000,000 subjects with Equation 3 and using the R function optim (Brent’s method [7]) to minimize the difference between the observed and targeted censoring proportion.For the trial, the individuallevel covariates and outcomes are aggregated to obtain summaries. The treatment effect and its variance are estimated through a Cox proportional hazards regression.
The simulation study examines five factors in a fully factorial arrangement with scenarios to explore the interaction between factors. The simulation scenarios are defined by the values of the following parameters:

The number of patients in the trial, under a 1:1 active intervention vs. control allocation ratio. The sample sizes correspond to typical values for a Phase III RCT [46].

The strength of the association between the prognostic variables and the outcome, (moderate, strong and very strong prognostic variable effect). These regression coefficients correspond to fixing the conditional hazard ratios for the effect of each prognostic variable at 1.5, 2 and 3, respectively.

The strength of interaction of the effect modifiers, (moderate, strong and very strong interaction effect). These parameters have a material impact on the vs. treatment effect, biasing the log hazard ratio by magnitudes of 0.24, 0.42 and 0.67, respectively, when covariate overlap is moderate (see below). Hence, population adjustment is warranted in order to remove the induced bias.

The level of correlation between covariates, (no correlation and moderate correlation).

The degree of covariate imbalance, with for the trial (while, for the trial, simultaneously varying , respectively). This yields strong, moderate and poor covariate overlap, with crosstrial differences in proportions of 0.2, 0.3 and 0.4, respectively.
Each active intervention has a very strong baseline treatment effect versus the common comparator. The number of subjects in the trial is 600, under a 1:1 active treatment vs. control allocation ratio. The prognostic variables and effect modifiers may represent comorbidities, which are associated with shorter survival and, in the case of the effect modifiers, which interact with treatment to render it less effective.
4.3 Estimands
The estimand of interest is the true marginal vs. treatment effect. As the true relative effect is identical for both vs. and vs. , the true vs. treatment effect . We generate and analyze 1,000 Monte Carlo replicates of trial data per simulation scenario. Let denote the estimator (the adjusted vs. treatment effect, in the population) for the th Monte Carlo replicate and let denote its mean across the 1,000 simulations.
4.4 Methods
Each simulated dataset is analyzed using the following methods:

Matchingadjusted indirect comparison, as originally proposed by Signorovitch et al. [42], where covariates are matched for active treatment and control arms combined and weights are estimated using the method of moments. A weighted Cox proportional hazards model is fitted to the IPD using the R package survey [27], with the weights specified as sampling weights of a survey design. Standard errors for the vs. treatment effect are computed using a robust sandwichtype variance estimator based on Taylor series linearization [27].

Simulated treatment comparison: a Cox proportional hazards regression on survival time is fitted to the IPD, including all effect modifiers and prognostic variables. covariates are simulated from a multivariate Gaussian copula [30]
, using Bernoullidistributed marginals with
means, and the pairwise correlations of the IPD. We simulate covariates for 100,000 patients, a value assumed large enough to minimize the variability in sampling from the aggregate values, and keep the same allocation ratio of the original trial. The vs. treatment effect (in the population) is derived from the log hazards predicted for the simulated patients. The original “plugin” approach to STC, where a regression is fitted with the IPD covariates centered at the mean values, is evaluated in the Supplementary Material. We choose to include all of the prognostic variables in the regression but only center the effect modifiers. 
The Bucher method [8] gives the standard indirect comparison. We know that this will be biased as it does not adjust for the bias induced by the imbalance in effect modifiers.
In all methods, the variances of each withintrial relative effect are summed to estimate the variance of the vs. treatment effect, . Confidence intervals are constructed using normal distributions: , assuming relatively large .
4.5 Performance measures
The following criteria are considered jointly to assess the methods’ performances:

To assess aim 1, we compute the bias in the estimated treatment effect
As , the bias is equal to the average treatment effect across the simulations.

To assess aim 2, we calculate the variability ratio of the treatment effect estimate, defined [26] as the ratio of the average standard error and the observed standard deviation (empirical standard error):
suggests that, on average, standard errors overestimate (or underestimate) the variability of the treatment effect estimate.

Aim 3 is assessed using the coverage of confidence intervals, estimated as the proportion of times that the true treatment effect is enclosed in the confidence interval of the estimated treatment effect, where is the nominal significance level.

We use empirical standard error (ESE) to assess aim 4 as it measures the precision or longrun variability of the treatment effect estimate:

The mean square error (MSE) of the estimated treatment effect
provides a summary value of overall accuracy (efficiency), integrating elements both bias (aim 1) and variability (aim 4).
5 Results
The performance measures across all 162 simulation scenarios are illustrated in Figures 2 to 6 using nested loop plots [39], which arrange all scenarios into a lexicographical order, looping through nested factors. In the nested sequence of loops, we consider first the parameters with the largest perceived influence on the performance metric. Notice that a different order is selected for bias than for the other performance measures. Monte Carlo standard errors quantifying the simulation uncertainty of each performance metric are reported in Appendix D of the Supplementary Material. Additional performance measures are considered in Appendix E of the Supplementary Material.
5.1 Unbiasedness of treatment effect
Figure 2 shows the bias for the methods across all scenarios. In a review of missing data methods, Schafer and Graham [40] consider bias to be troublesome if its absolute size is greater than about one half of the estimate’s empirical standard error. Under this rule of thumb, MAIC does not produce problematic biases in any of the simulation scenarios. On the other hand, STC and the Bucher method generate problematic biases in 101 of 162 scenarios, and in all 162 scenarios, respectively. The application of this rule is somewhat arbitrary as the empirical standard error is dependent on several factors such as the number of simulations, also set to 1,000 by Schafer and Graham. Most importantly, the impact of the biases will depend on the uncertainty in the estimated treatment effect [40, 9]. We consider standardizing the biases [9] by computing these as a percentage of the empirical standard error in order to assess their impact.
In the missing data literature, standardized biases of magnitude greater than 40% have been observed to considerably affect the precision, coverage and error rates of estimates. For MAIC, no standardized biases are above such threshold in either direction, the maximum absolute value being 14.7% in a simulation scenario with very strong prognostic and interaction effects. For STC, the magnitude of the standardized biases is above the 40% threshold in 54 of 162 scenarios. The biases in MAIC do not appear to have any practical significance, as they do not degrade coverage and efficiency. MAIC is the least biased method, followed by STC and the Bucher method.
For STC, we adopted a “covariate simulation” approach, proposed by Ishak et al. [21] to avoid the nonlinearity bias arising when the mean effect modifiers are substituted in Equation 1. However, we find that simulating the covariates at the individual level leads to biases that are virtually identical to those of the “plugin” approach (see Appendix D and Appendix E of the Supplementary Material). Both implementations display the same degree of systematic bias arising from the nonlinear link function in the Cox regression. In some cases, e.g. under very strong prognostic variable effects and moderate effectmodifying interactions, STC even has increased bias compared to the Bucher method.
In the scenarios considered in this simulation study, STC produces negative bias when the interaction effects are moderate and positive bias when they are very strong. In addition, biases vary more widely when prognostic effects are larger. When interaction effects are weaker, stronger prognostic effects shift the bias negatively. It is worth noting that conclusions arising from the interpretation of patterns in Figure 2 for STC may be byproducts of the nonlinearity bias and should be interpreted with caution.
As expected, the strength of interaction effects is an important driver of bias in the Bucher method and the incurred bias increases with poorer covariate overlap. This is because the more substantial the imbalance in effect modifiers and the greater their interaction with treatment, the larger the bias of the unadjusted comparison. The impact of these factors on the bias appears to be slightly reduced when prognostic effects are stronger and contribute more “explanatory power” to the outcome. Varying the number of patients in the trial does not seem to have any discernible impact on the bias for any method. Biases in MAIC seem to be unaffected when varying the degree of covariate overlap.
5.2 Unbiasedness of variance of treatment effect
In MAIC and the Bucher method, the variability ratio of treatment effect estimates is close to one under all simulation scenarios (Figure 3). This suggests that standard error estimates for the methods are unbiased.
In STC, variability ratios are well above one in all 162 scenarios. Hence, there is a consistent overestimation of the variability of the treatment effect estimate. This overestimation can be attributed to the “covariate simulation” approach, where adjusted vs. treatment effects are derived by averaging the predicted log hazards for individual patients. The fitted standard error of each individual linear prediction is the standard error of its estimated mean value at a given value of the predictors. If this error is averaged over the simulated patients, it represents the expected error in the mean linear predictor for a random patient. However, the treatment effect and its error should be taken averaging over the whole population, characterized by the approximate joint distribution of its covariates, as opposed to averaging predictions for a randomly chosen individual.
Variability ratios for the original “plugin” approach to STC are very close to one across all scenarios, suggesting that any bias in the estimated variances is negligible. The covariate simulation artificially inflates the uncertainty of the estimated treatment effect, without addressing the nonlinearity bias of the plugin approach. The overstated uncertainty is an important issue, as it will be propagated through the costeffectiveness analysis and may lead to inappropriate decisionmaking [13].
5.3 Randomization validity
Figure 4
shows the empirical coverage rates for the methods across all scenarios. The empirical coverage rate should be approximately equal to the nominal coverage rate, in this case 0.95 for 95% confidence intervals, to obtain appropriate type I error rates for testing a “no effect” null hypothesis. Theoretically, the empirical coverage rate is statistically significantly different to 0.95 if, roughly, it is less than 0.9365 or more than 0.9635, assuming 1,000 independent simulations per scenario. These values differ by approximately two standard errors from the nominal coverage rate.
In general, empirical coverage rates for MAIC are not significantly different from the advertised nominal coverage rate; only three scenarios have a rate below 0.9365 and eleven empirical coverage rates are above 0.9635. Almost all empirical coverage rates appear to be appropriate under MAIC: generally, these are within simulation error of 95% and never fall below 90%, i.e., never at least double the nominal rate of error. As mentioned in subsection 5.1, any bias does not appear to degrade the coverage rates in this method.
On the other hand, overcoverage is a pervasive problem in STC, for which 145 of 162 scenarios have empirical coverage rates above 0.9635. Poor coverage rates are a decomposition of both the bias and the standard error used to compute the width of the confidence intervals. In the case of STC, overcoverage is attributed to the additional variation introduced by covariate simualtion discussed in subsection 5.2. The coverage rates only fall under the most important determinants of bias, e.g. moderate effectmodifying interactions and very strong prognostic variable effects. Under these conditions, the bias of the STC treatment effect is high enough to shift the coverage rates negatively and may pull these closer to 0.95. However, this is a result of compounding two problems: high bias and incorrectly constructed interval estimates. In the “plugin” approach to STC, variances of the treatment effect are generally unbiased and poor coverage is exclusively undercoverage, induced by the bias in the treatment effect and not by the construction of the confidence intervals.
From a frequentist viewpoint [31], confidence intervals and values from the Bucher method are not confidence valid, i.e., the 95% confidence intervals are not guaranteed to include the true treatment effect at least 95% of the time, for all scenarios. Coverage rates deteriorate markedly under the most important determinants of bias. When there is less overlap between covariate distributions and when interaction effects are stronger, the induced bias is larger and coverage rates are degraded. Under very strong interactions with treatment, empirical coverage may drop below 50%. Therefore, the Bucher method will incorrectly detect significant results a large proportion of times in these scenarios. Such overconfidence will lead to very high type I error rates for testing a “no effect” null hypothesis. This is problematic as indirect comparisons are typically used in the context of decisionmaking as opposed to prediction.
5.4 Precision and efficiency
Several trends are revealed upon visual inspection of the empirical standard error across scenarios (Figure 5). As expected, the ESE decreases for all methods (i.e., the estimate is more precise) as the number of subjects in the trial increases. The strengths of interaction effects and of prognostic variable effects appear to have a negligible impact on the precision of population adjustment methods.
The degree of covariate imbalance has an important influence on the ESE and population adjustment methods incur losses of precision when covariate overlap is poor. This is expected as regression adjustment methods such as STC require greater extrapolation when the covariate imbalance is larger [33]. In reweighting methods such as MAIC, extrapolation is not even possible. Some observations in the patientlevel data are assigned extreme weights when covariate overlap is poor. Effective sample sizes are reduced as a relatively small number of observations dominate the reweighted sample and precision deteriorates. In MAIC, the presence of correlation mitigates the effect of increasing covariate imbalance on a consistent basis. ESE for the Bucher method does not vary across different degrees of covariate overlap, as these are not considered by the method, and overprecise estimates are produced.
Figure 6 is inspected in order to explore patterns in the mean square error, with a particular focus on MAIC. Contrary to ESE, MSE also takes into account the true value of the estimand as it incorporates the bias. Hence, main drivers of bias and ESE are generally key properties for MSE. For instance, estimates are less accurate for MAIC when prognostic variable effects are stronger, sample sizes are smaller and covariate overlap is poorer. As bias is negligible for MAIC, precision is the driver of accuracy. On the contrary, as the Bucher method is systematically biased and overprecise, the driver of accuracy is bias. Poor accuracy in STC is driven by both bias and lack of precision, particularly under low sample sizes and strong prognostic variable effects. STC was consistently less accurate than MAIC, with larger mean square errors in all simulation scenarios. In some cases where the STC bias was strong, e.g. very strong prognostic variable effects and moderate effectmodifying interactions, STC even increased the MSE compared to the Bucher method. Both implementations of STC produced virtually identical ESEs and MSEs.
In accordance with the trends observed for the ESE, the MSE is also very sensitive to the value of and decreases for all methods as increases. We highlight that the number of subjects in the trial (not varied in this simulation study) is a less important performance driver than the number of subjects in ; while it contributes to sampling variability, the reweighting or regressions are performed in the patientlevel data.
6 Discussion
In this section, we discuss the implications of, and recommendations for, performing population adjustment, based on the simulation study. We focus on MAIC as a population adjustment method, given the systematic bias and potentially underestimated precision in STC. Finally, we highlight potential limitations of the simulation study, primarily relating to the extrapolation of its results to practical guidance.
Before performing population adjustment, it is important to assess the magnitude of the bias induced by effect modifier imbalances. Such bias depends on the degree of covariate overlap and on the strength of interaction effects, i.e., the effect modifier status of the covariates. The combination of these two factors determines the level of bias reduction that would be achieved with population adjustment.
We know that the accuracy or efficiency of an estimate, as measured by the mean square error, is a composite of unbiasedness and precision. Inevitably, due to biasvariance tradeoffs, the increase in variability that we are willing to accept with population adjustment depends on the magnitude of the bias that would be corrected. Such variability is largely driven by the degree of covariate overlap and by the sample size. Hence, while the potential extent of bias correction increases with greater covariate imbalance, so does the potential imprecision of the treatment effect estimate.
In our simulation study, this tradeoff always favours the bias correction offered by MAIC over the precision of the Bucher method, implying that the reductions in ESS based on unstable weights are merited, even under lower covariate imbalance. Across scenarios, the relative accuracy of MAIC with respect to that of the Bucher method improves under greater degrees of covariate imbalance. It is worth noting that, even in scenarios where the Bucher method is relatively accurate, it is still a flawed method in the context of decisionmaking due to overprecision and undercoverage.
The magnitude of the bias that would be corrected with population adjustment also depends on the strength of interaction effects, i.e., the effect modifier status of the covariates. In the simulation study, the lowest effectmodifying interaction coefficient was . Given two effect modifiers, it induced biases in the treatment effect of relatively low magnitudes: 0.16, 0.24 and 0.32, for strong, moderate and poor covariate overlap, respectively. Despite this, MAIC was consistently more efficient than the Bucher method in these scenarios. Larger interaction effects warrant greater bias reduction but do not degrade the precision of the populationadjusted estimate. Hence, the relative accuracy of MAIC with respect to the Bucher method improves further as the effectmodifying coefficients increase.
In the simulation study, we know that population adjustment is required as we set the crosstrial imbalances between covariates and have specified some of these as effect modifiers. Most applications of population adjustment present evidence of the former, e.g. through tables of baseline characteristics with covariate means and proportions. However, quantitative evidence justifying the effect modifier status of the selected covariates is rarely brought forward. Presenting this type of supporting evidence is of capital importance when justifying the use of population adjustment.
Typically, the selection of effect modifiers is supported by clinical expert opinion. However, clinical expert judgment and subjectmatter knowledge are fallible when determining effect modifier status because (1) the therapies being evaluated are often novel, and (2) effect modifier status is scalespecific; clinical experts may not have the mathematical intuition to assess whether covariates are effect modifiers on the linear predictor scale (as opposed to the natural outcome scale).
Therefore, applications of population adjustment often balance all available covariates on the grounds of expert opinion, probably because the clinical experts cannot rule out interactions with treatment for any of the baseline characteristics. Presenting quantitative evidence along with clinical expert opinion would help establish whether population adjustment is necessary. As proposed by Phillippo et al.
[34], we encourage the analyst to fit regression models with interaction terms to the IPD for an exploratory assessment of effect modifier status.It is worth noting that the conclusions of this simulation study are dependent on the outcome and model type. We have considered survival outcomes and Cox proportional hazards models, as these are the most prevalent outcome and modelling framework in MAIC and STC applications. However, further simulation studies are required with alternative outcomes and models. For example, exploratory simulations with binary outcomes and logistic regression have found that the precision of MAIC is more affected by low effective sample sizes than seen for survival outcomes.
Furthermore, we highlight that we have only considered and adjusted for two effect modifiers that induce bias in the same direction. In real applications of population adjustment, it is not uncommon to see more than 10 being balanced [35]. As this simulation study considered percentage reductions in effective sample size for MAIC that are representative of scenarios encountered in NICE TAs (see Appendix B of the Supplementary Material), real applications will have imbalances for each individual covariate that are smaller than those considered in this study. In addition, the effectmodifying interactions for each covariate could be in opposite directions and thus, the induced biases could cancel out.
This simulation study also assumed that there is no model misspecification. For instance, it assumed that 1) complete information on effect modifiers is available from both trials and that all effect modifiers have been accounted for, and 2) effect modifiers interact with treatment in the same way in trial and trial .
In practice, the first assumption may be problematic because it is difficult to ascertain the effect modifier status of covariates, particularly for new treatments with limited prior empirical evidence and clinical domain knowledge. Hence, the analyst may select the effect modifiers incorrectly. In addition, effect modifier status could be determined on the wrong scale or information on some effect modifiers could be unavailable or unpublished for one of the trials. The second assumption may also be hard to meet if the competing interventions do not belong to the same class, and have dissimilar mechanisms of action or clinical properties.
Another source of misspecification in population adjustment is the incorrect specification of the joint distribution of covariates. In MAIC, it is implicitly assumed that correlations between covariates are identical across trials or that the joint distribution of the baseline characteristics is the product of the published marginal distributions [33]. In the “covariate simulation” approach to STC, assumptions are more explicit: it is assumed that (1) the joint distribution of the covariates is specified correctly, by the combination of the specified marginal distributions and correlation structure, and (2) that the pairwise correlations and the form of the covariates’ marginal distributions are identical across trials.
More generally, the simulation study uses the correct datagenerating mechanism, i.e., it is known that there is a linear relationship between the effects and the log hazard ratio, and that the covariates have Bernoullidistributed marginals. In real applications, the datagenerating process is not known.
Regression adjustment methods such as STC, which specify an outcome model explicitly, are prone to this type of misspecification. However, there is also an implicit outcome model in MAIC; treatment and effect modifiers are assumed to be additive on the log hazard ratio scale. Hence, all population adjustment methods are susceptible to bias when effect modifiers have a nonlinear effect on the log hazard, e.g. age in cardiovascular disease. Future simulation studies should assess the robustness of population adjustment to failures in assumptions under different degrees of data availability and model misspecification.
Finally, it is worth noting that, while this article focuses on anchored indirect comparisons, most applications of population adjustment in HTA are in the unanchored setting [35]. We stress that RCTs deliver the gold standard for evidence on effectiveness and that unanchored comparisons make very strong assumptions which are largely considered impossible to meet (absolute effects are conditionally constant as opposed to relative effects being conditionally constant) [34, 33]. Unanchored comparisons effectively assume that absolute outcomes can be predicted from the covariates and that all effect modifiers and prognostic variables are accounted for.
However, the number of unanchored comparisons is likely to continue growing as regulators are, increasingly and particularly in oncology, approving new treatments on the basis of singlearmed evidence or disconnected networks [20, 4]. As pharmaceutical companies use this type of evidence to an increasing extent to obtain regulatory approval, reimbursement agencies will, in turn, be increasingly asked to evaluate interventions where only this type of evidence is available. Therefore, further examinations of the performance of population adjustment methods must be performed in the unanchored setting.
7 Concluding remarks
In the performance measures we considered, MAIC was the least biased, most accurate and only randomizationvalid method. We therefore recommend its use for survival outcomes. STC was systematically biased and should be avoided in settings with a nonlinear link function. The Bucher method is systematically biased and overprecise when there are imbalances in effect modifiers and interaction effects that induce bias in the treatment effect. Future simulation studies should assess population adjustment methods with different outcome types and under model misspecification.
Acknowledgments
The authors thank Anthony Hatswell for discussions that contributed to the quality of the manuscript and acknowledge Andreas Karabis for his advice and expertise in MAIC. This article is based on research supported by Antonio RemiroAzócar’s PhD scholarship from the Engineering and Physical Sciences Research Council of the United Kingdom. Gianluca Baio is partially funded by a research grant sponsored by Mapi/ICON at University College London. Anna Heath was funded through an Innovative Clinical Trials Multiyear Grant from the Canadian Institutes of Health Research (funding reference number MYG151207; 2017  2020).
Financial disclosure
Funding agreements ensure the authors’ independence in designing the simulation study, interpreting the results, writing, and publishing the article.
Conflict of interest
The authors declare no potential conflict of interests.
References
 [1] (2004) Bias due to aggregation of individual covariates in the cox regression model. American journal of epidemiology 160 (7), pp. 696–706. Cited by: §4.2.
 [2] (2011) An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research 46 (3), pp. 399–424. Cited by: §1.
 [3] (2015) Probabilistic sensitivity analysis in health economics. Statistical methods in medical research 24 (6), pp. 615–634. Cited by: §2.
 [4] (2018) A 25year experience of us food and drug administration accelerated approval of malignant hematology and oncology drugs and biologics: a review. JAMA oncology 4 (6), pp. 849–856. Cited by: §6.
 [5] (2015) Inclusion of multiple studies in matching adjusted indirect comparisons (maic). Value in Health 18 (3), pp. A33. Cited by: §1, §3.1.
 [6] (2005) Generating survival times to simulate cox proportional hazards models. Statistics in medicine 24 (11), pp. 1713–1723. Cited by: §4.2.
 [7] (1971) An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal 14 (4), pp. 422–425. Cited by: §4.2.
 [8] (1997) The results of direct and indirect treatment comparisons in metaanalysis of randomized controlled trials. Journal of clinical epidemiology 50 (6), pp. 683–691. Cited by: §1, §2, §2, 3rd item.
 [9] (2006) The design of simulation studies in medical statistics. Statistics in medicine 25 (24), pp. 4279–4292. Cited by: §5.1.
 [10] (2010) No headtohead trial? simulate the missing arms. Pharmacoeconomics 28 (10), pp. 957–967. Cited by: §1.
 [11] (2019) Comparative effectiveness of nivolumab versus clinical practice for advanced gastric or gastroesophageal junction cancer. Journal of Comparative Effectiveness Research (00). Cited by: §3.2.
 [12] (2019) The statistical performance of matchingadjusted indirect comparisons. arXiv preprint arXiv:1910.06449. Cited by: §1.
 [13] (2005) Probabilistic sensitivity analysis for nice technology assessment: not an optional extra. Health economics 14 (4), pp. 339–347. Cited by: §5.2.
 [14] (2013) Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network metaanalysis of randomized controlled trials. Medical Decision Making 33 (5), pp. 607–617. Cited by: §1, §1, §2.
 [15] (1994) An introduction to the bootstrap. CRC press. Cited by: §3.1.
 [16] (1992) Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp. 569–593. Cited by: §3.1.
 [17] (2015) NICE dsu technical support document 17: the use of observational data to inform estimates of treatment effectiveness for technology appraisal: methods for comparative individual patient data. Sheffield: NICE Decision Support Unit. Cited by: §1.
 [18] (2005) Indirect comparisons of competing interventions. Cited by: §1.
 [19] (2020) The effects of model misspecification in unanchored matching adjusted indirect comparison (maic); results of a simulation study. Value in Health Accepted for publication. Cited by: §1, §3.1, §4.2.
 [20] (2016) Regulatory approval of pharmaceuticals without a randomised controlled study: analysis of ema and fda approvals 1999–2014. BMJ open 6 (6). Cited by: §6.
 [21] (2015) Simulation and matchingbased approaches for indirect comparison of treatments. Pharmacoeconomics 33 (6), pp. 537–549. Cited by: §1, §3.2, §3.2, §5.1.
 [22] (2015) Simulated treatment comparison of timetoevent (and other nonlinear) outcomes. Value in Health 18 (7), pp. A719. Cited by: §3.2.
 [23] (2017) Evaluation of adjusted and unadjusted indirect comparison methods in benefit assessment. Methods of information in medicine 56 (03), pp. 261–267. Cited by: §1.

[24]
(2019)
Assessing the impact of a matchingadjusted indirect comparison in a bayesian network metaanalysis
. Research synthesis methods. Cited by: §1.  [25] (1998) On the generation of correlated artificial binary data. Cited by: §4.2.
 [26] (2014) Propensity score methods for estimating relative risks in cluster randomized trials with lowincidence binary outcomes and selection bias. Statistics in medicine 33 (20), pp. 3556–3575. Cited by: 2nd item.
 [27] (2004) Analysis of complex survey samples. J Stat Softw 9 (1), pp. 1–19. Cited by: 1st item.
 [28] (2019) Using simulation studies to evaluate statistical methods. Statistics in medicine 38 (11), pp. 2074–2102. Cited by: §4.1.
 [29] (2016) Trends in the use of matchingadjusted indirect comparisons in published literature and nice technology assessments: a systematic review. Value in Health 19 (3), pp. A99–A100. Cited by: §2.
 [30] (2007) An introduction to copulas. Springer Science & Business Media. Cited by: 2nd item.
 [31] (1934) On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society 97 (4), pp. 558–625. Cited by: §5.3.
 [32] (2019) Alternative weighting approaches for anchored matchingadjusted indirect comparisons via a common comparator. Value in Health 22 (1), pp. 85–91. Cited by: §1, §3.1, §3.1.
 [33] (2016) NICE dsu technical support document 18: methods for populationadjusted indirect comparisons in submissions to nice. Cited by: §1, §2, §3.2, §5.4, §6, §6.
 [34] (2018) Methods for populationadjusted indirect comparisons in health technology appraisal. Medical Decision Making 38 (2), pp. 200–211. Cited by: §1, §1, §2, §6, §6.
 [35] (2019) Population adjustment methods for indirect comparisons: a review of national institute for health and care excellence technology appraisals. International journal of technology assessment in health care, pp. 1–8. Cited by: §1, §2, §3.1, §4.2, §6, §6.
 [36] (2019) Indirect treatment comparison of inotuzumab ozogamicin versus blinatumomab for relapsed or refractory acute lymphoblastic leukemia. Advances in therapy 36 (8), pp. 2147–2160. Cited by: §3.2.
 [37] (2012) Evaluating treatment effectiveness in patient subgroups: a comparison of propensity score methods with an automated matching approach. The international journal of biostatistics 8 (1). Cited by: §3.1.
 [38] (1987) Modelbased direct adjustment. Journal of the American Statistical Association 82 (398), pp. 387–394. Cited by: §3.1.
 [39] (2014) Presenting simulation results in a nested loop plot. BMC medical research methodology 14 (1), pp. 129. Cited by: §5.
 [40] (2002) Missing data: our view of the state of the art.. Psychological methods 7 (2), pp. 147. Cited by: §5.1.
 [41] (2012) Matchingadjusted indirect comparisons: a new tool for timely comparative effectiveness research. Value in Health 15 (6), pp. 940–947. Cited by: §1.
 [42] (2010) Comparative effectiveness without headtohead trials. Pharmacoeconomics 28 (10), pp. 935–945. Cited by: §1, §3.1, §3.1, 1st item.
 [43] (2012) Comparative effectiveness research using matchingadjusted indirect comparison: an application to treatment with guanfacine extended release or atomoxetine in children with attentiondeficit/hyperactivity disorder and comorbid oppositional defiant disorder. pharmacoepidemiology and drug safety 21, pp. 130–137. Cited by: §1.
 [44] (2013) Comparative efficacy of guanfacine extended release versus atomoxetine for the treatment of attentiondeficit/hyperactivity disorder in children and adolescents: applying matchingadjusted indirect comparison methodology. CNS drugs 27 (11), pp. 943–953. Cited by: §3.1.
 [45] (2003) Validity of indirect comparison for estimating efficacy of competing interventions: empirical evidence from published metaanalyses. Bmj 326 (7387), pp. 472. Cited by: §2.
 [46] (2007) Design of randomized controlled trials. Circulation 115 (9), pp. 1164–1169. Cited by: 1st item.
 [47] (2018) A review of methods for comparing treatments evaluated in studies that form disconnected networks of evidence. Research synthesis methods 9 (2), pp. 148–162. Cited by: §1.
 [48] (2002) To ipd or not to ipd? advantages and disadvantages of systematic reviews using individual patient data. Evaluation & the health professions 25 (1), pp. 76–97. Cited by: §1.
 [49] (2008) Use of indirect and mixed treatment comparisons for technology assessment. Pharmacoeconomics 26 (9), pp. 753–767. Cited by: §1.
 [50] (2013) R: a language and environment for statistical computing. Cited by: §4.1.
 [51] (2016) Matching adjusted indirect comparisons to assess comparative effectiveness of therapies: usage in scientific literature and health technology appraisals. Value in Health 19 (3), pp. A100–A101. Cited by: §1, §2.
 [52] (2019) Overall survival of glasdegib in combination with lowdose cytarabine, azacitidine, and decitabine among adult patients with previously untreated aml: comparative effectiveness using simulated treatment comparisons. ClinicoEconomics and Outcomes Research: CEOR 11, pp. 551. Cited by: §3.2.
 [53] (2016) A scoping review of indirect comparison methods and applications using individual patient data. BMC medical research methodology 16 (1), pp. 47. Cited by: §2.
 [54] (1980) A heteroskedasticityconsistent covariance matrix estimator and a direct test for heteroskedasticity. econometrica 48 (4), pp. 817–838. Cited by: §3.1.
 [55] (2005) A finite sample correction for the variance of linear efficient twostep gmm estimators. Journal of econometrics 126 (1), pp. 25–51. Cited by: §3.1.
 [56] (2009) Covariateadjusted putative placebo analysis in activecontrolled clinical trials. Statistics in Biopharmaceutical Research 1 (3), pp. 279–290. Cited by: §3.2.
Comments
There are no comments yet.