1 Study designs for transporting trial findings
The most natural study design for transporting inferences from trial participants to a new target population, is a trial nested within a cohort of eligible individuals, including those who refuse to participate in the randomized component of the study. In this design, investigators collect baseline covariate information from all cohort members, but often collect treatment and outcome information only from randomized individuals. For example, this setup occurs in comprehensive cohort studies [16] and in trials embedded in healthcare systems where data are routinely collected from all members of the system [17].
Another design for transporting inferences to a new target population uses an artificial composite dataset created by appending the data from a completed trial to a separately obtained sample from the target population [6]. As in the trialnestedwithincohort design, baseline covariate information is available from all patients, but treatment and outcome information is only available from randomized participants (e.g., when the interventions examined in the trial are not commercially available and no observational data can be collected). This setup arises often in drug development and regulatory settings, because exposure and outcome data are available only from a small number of clinical trial participants prior to drug approval, but baseline covariate data can be collected from large samples of untreated individuals who would be eligible for participation in the trials. The setup also arises in public policy research, when randomized trials are conducted in selected samples but target population data are available from administrative databases or surveys.
The methods we describe in this tutorial can be used both in trials nested within cohorts of eligible individuals and composite datasets.
2 Causal contrasts
The data generated from the designs described in the previous section consist of independent observations, indexed by , on baseline covariates ; treatment ; outcome ; and a trial participation indicator that takes the value 1 for trial participants or 0 for nonparticipants. The data exhibit a special missingness pattern: for trial participants we have data on but for nonparticipants we only have data on . Table 1 shows the observed data structure for the special case of binary treatment.
We use the random variables
to denote the potential (counterfactual) outcome under intervention to set treatment to , possibly contrary to fact [18, 19]. We only consider discrete treatments in this tutorial; extensions to continuous treatments are straightforward. For any two treatments, , a causal effect of interest is the average treatment effect in the target population of eligible nonparticipants,The average treatment effect among nonparticipants is not equal to the average treatment effect among trial participants, , when the effect of treatment varies over baseline covariates that are differentially distributed between trial participants and nonparticipants [20, 21].
When using the trialsnestedincohorts design, another causal effect, the average effect in the population of all eligible individuals,
, may also be of interest, and we have discussed its identification and estimation in previous work
[22]. Even if the primary target of inference for the trialnestedincohort design is , investigators are typically also interested in efficiently estimating the effect among nonparticipants, . Comparing the effect among eligible nonparticipants against the effect among trial participants is often of substantial scientific and policy interest. Importantly, when using the composite dataset design, is not a meaningful estimand because the population implied by the composite dataset design is a mixture of the population of trial participants and the target population, with arbitrary mixing proportions (determined by the sample size of the trial and target sample). Thus, for composite datasets, is the more reasonable target of inference.In the remainder of this tutorial we focus on the identification and estimation of the components of the average treatment effect among eligible nonparticipants, , under either of the designs discussed in Section 1.
3 Identifiability conditions & identification
3.1 Identifiability conditions
We now discuss sufficient conditions for identifying the mean of the potential outcomes under each treatment in in the population of trialeligible nonparticipants, .
(1) Consistency of potential outcomes: The observed outcome for the th individual who received treatment equals that individual’s potential outcome under the same treatment, that is, if , then .
(2) Mean exchangeability (over ): Among trial participants, the potential outcome mean under treatment is independent of treatment, conditional on baseline covariates,
We expect conditional exchangeability in the trial to hold, regardless of whether the randomization was unconditional or conditional on covariates. The mean exchangeability assumption is weaker than the assumption of exchangeability in distribution, .
(3) Positivity of treatment assignment: In the trial, the probability of being assigned to each of the treatments being compared, conditional on the covariates needed for exchangeability, is bounded away from zero: for each .
(4) Mean transportability (exchangeability over ): The potential outcome mean is independent of trial participation, conditional on baseline covariates:
Again, this assumption is weaker than the ofteninvoked assumption of conditional transportability in distribution, .
(5) Positivity of trial participation: The probability of participating in the trial, conditional on the covariates needed to ensure mean transportability, is positive,
Consistency, mean exchangeability, and positivity of treatment assignment are expected to hold in (marginally or conditionally) randomized clinical trials of welldefined interventions. In contrast, mean transportability and positivity of trial participation are strong and largely untestable assumptions, and their plausibility needs to be assessed on the basis of substantive knowledge when transporting inferences from a trial to a new target population.
3.2 Identification
Under the above assumptions, as shown in the Appendix, for each treatment , the potential outcome mean in the target population of nonparticipants can be identified using the observed data,
These potential outcome means are inherently interesting. Furthermore, the treatment effect among trialeligible nonparticipants can be identified as
In the next section we review methods for estimating .
4 Estimating potential outcome means in the target population
We consider three approaches for estimating the potential outcome mean in the target population of nonparticipants: (1) outcome modeling followed by standardization; (2) probability of trial participation modeling
followed by inverse odds weighting; and (3)
doubly robust approaches that combine outcome and trial participation modeling. Under the consistency assumption, these methods can be thought of as solutions to a missing data problem for the outcome, where the missing data indicator is the product , and is the indicator function. In the Appendix, we show that the three approaches provide consistent estimators (i.e., converge in probability) for provided the identifiability conditions hold and models are correctly specified; here, we focus on developing intuition about the methods. Throughout, we assume that the models for the probability of participation and the expectation of the outcome are parametric (finitedimensional), as is usually the case in applied work (we address more flexible models in the Discussion section).4.1 Outcome modeling followed by standardization
The first approach transports inferences from trial participants to the population of nonparticipants by “extrapolating” an outcome regression model fit among the former to a sample of the latter [23]. In essence, we use data from trial participants to estimate models for the expectation of the outcome and then standardize the model predictions to the baseline covariate distribution of the nonparticipants. The estimator for the potential outcome mean among eligible nonparticipants under treatment is
(1) 
where is a predicted value from a model (with finitedimensional parameter ) for . We usually estimate separate outcome models in each treatment group of the trial, to allow for all possible treatment covariate interactions. When the identifiability conditions hold and the model is correctly specified the estimator is consistent for the potential outcome mean among eligible nonparticipants.
4.2 Trial participation modeling and odds weighting
The second approach transports inferences from trial participants to the population of nonparticipants using a model for the probability of trial participation [6, 15]. In essence, we are treating the randomized trial participants as a sample from the target population with sampling probabilities that depend on baseline covariates and need to be estimated, an idea that connects this approach to survey sampling [24]. Specifically, we estimate the potential outcome mean under treatment among trial nonparticipants as
(2) 
with defined as
Here, is a predicted value from a model (with finitedimensional parameter ) for the probability of participation in the trial , and is a predicted value from a model (with finitedimensional parameter ) for the probability of being assigned to treatment among trial participants
. The probability of participation in the trial is, in general, unknown and has to be estimated (e.g., by fitting a logistic regression model). In contrast, the probability of treatment in the randomized trial is known (determined by the investigators) and the true values can be used instead of estimated values (estimating the probability may, however, lead to smaller standard errors). We refer to the estimator in (
2) as an “odds of participation weighted” estimator and the weights as “odds of participation” weights because is the inverse of the estimated odds of trial participation conditional on baseline covariates.An alternative odds of participation weighted estimator normalizes the weights to sum to 1,
(3) 
In survey research, this estimator is referred to as the ratio estimator [25]; it can be obtained as the solution of the estimating equations of weighted least squares regression of the outcome on treatment, using weights equal to for trial participants and 0 for nonparticipants.
When the identifiability conditions hold and the model for the probability of participation is correctly specified both odds weighting estimators are consistent for the potential outcome mean among eligible nonparticipants. The small difference in the normalization of the weights between IOW1 and IOW2 can have a big effect when weights are highly variable [26], because estimator (2) is unbounded (i.e., it may produce estimates that fall outside the support of the outcome variable), whereas estimator (3) is bounded by the range of the observed outcome.
4.3 Doubly robust estimators
In practical applications, background knowledge is typically inadequate to ensure correct specification of the working models for the probability of participation or the expectation of the outcome, and misspecification of these models can lead to estimator inconsistency. We can gain some robustness to misspecification and increase efficiency by combining the two models to obtain doubly robust estimators that are consistent when either model is correctly specified [27, 28, 29]. Here, we examine three doubly robust estimators that are easy to implement in standard statistical software.
Insample onestep doubly robust estimator:
The first doubly robust estimator we consider relies on estimating models for the conditional expectation of the outcome, ; the probability of trial participation, ; and (optionally) the probability of treatment among trial participants, . Predicted values from these models are then combined to obtain the unbounded estimator
(4) 
where was defined in the previous section.
Insample onestep doubly robust estimator with normalized weights:
Using the same strategy of normalizing the weights as for IOW2, an alternative, bounded (provided the outcome model is wellchosen) variant of DR1 is
(5) 
Weighted regression doubly robust estimator:
A third doubly robust estimator involves fitting a model for the outcome conditional on covariates among trial participants, using a weighted regression with the weights as defined above, and then standardizing the predicted values, ,to the covariate distribution of eligible nonparticipants,
(6) 
where
is the vector of estimated parameters from the weighted outcome regression. This estimator is bounded provided the outcome model is wellchosen and doubly robust when the outcome is modeled with a linear exponential family quasilikelihood
[30] and the canonical link function [26, 31].Provided the identification conditions hold, doubly robust estimators are consistent and asymptotically normal when either the model for the probability of participation or the expectation of the outcome is correctly specified [32]
. When both models are correctly specified the largesample variance of the doubly robust estimators is less than or equal the variance of the inverse odds weighting estimators
[28, 32]. When one of the two models is incorrectly specified, the asymptotic distribution of the doubly robust estimators remains normal (and centered on the true value) but their variance is increased. In rare cases, when misspecification of the outcome model is combined with highly variable weights, doubly robust estimators can perform worse than nondoubly robust estimators that use the same misspecified outcome model [33, 26].4.4 Inference
When using parametric working models, all the estimators described above can be viewed as partial Mestimators [34] and it is possible to employ the usual “sandwich” approach to obtain their sampling variances (e.g., [35, 36]). Inference based on the nonparametric bootstrap [37], however, is easy to obtain with modern software and will often be preferred in practice.
5 Simulation study
We conducted a simulation study to examine the finitesample performance of different estimators for the average treatment effect in the target population of eligible nonrandomized patients.
5.1 Data generation
We run a factorial experiment using 3 trial sample sizes () 3 target population sample sizes () 2 magnitudes of departure from additive effects in the outcome model () 2 magnitudes of selection (), resulting in a total of 36 simulation scenarios.
We considered of 250, 500, or 1000 randomized participants and of 2,500, 5,000 or 10,000 nonrandomized individuals. We generated baseline covariates for randomized trial participants (), as and ; ; . We then generated baseline covariates for the sample of eligible nonparticipants (), with ; ; . The difference in the means of the covariate distributions of trial participants and nonparticipants represents selection into the trial based on baseline covariates. The parameter controls selection on ; we used values 0 and 1, representing no and strong selection. Because the distribution of baseline covariates is homoskedastic over , a logistic regression model of on is correctly specified [38]. We generated outcomes using the linear model
where is the main treatment effect, determines the magnitude of effect modification by ; , , and . We examined scenarios with different levels of effect modification by setting to 0 or 1; we set the “main” treatment effect to in all scenarios.
For each simulated dataset, we applied the estimators in equations (1) through (6), and also obtained a trialonly estimator of the treatment effect. All working models required for the different estimators were correctly specified, in the sense that the true models were nested within the parametric working models on which the estimators relied. Specifically, outcome models included main effects for all covariates and were fit separately in each arm; logistic regression models for trial participation and treatment included the main effects of all covariates; all models had intercept terms. We estimated the bias and variance for each estimator over 10,000 runs for each scenario.
5.2 Simulation results
summarize simulation results from selected simulation scenarios for continuous, normally distributed outcomes and linear outcome models. Additional simulation results are presented in Appendix Tables
A1 and A2.When all models were correctly specified, all estimators were approximately unbiased, even with fairly small trial and target population sample sizes. The outcomemodel based estimator had the lowest variance, followed closely by the three doubly robust estimators. The probability of participationbased estimators had substantially larger variance than all other estimators; that variance, though, became smaller with increasing trial sample sizes. When trial sample size was much smaller than the target sample size, in the presence of strong selection on covariates, estimators that used weights normalized to sum to one (IOW2 and DR2 and DR3) had smaller variance compared to estimators that used unnormalized weights (IOW1 and DR1). As expected, in the presence of effect modification, the trialonly estimator gave different results compared to the estimators in equations (1) through (6). The trialonly estimator is biased for when selection into the trial depends on the effect modifier, but is, of course, unbiased for under very general conditions.
5.3 Code to implement the methods
In the Appendix, we provide R [39] code implementing the methods compared in the simulation study. Specifically, we provide a collection of basic standalone functions, one for each estimator in equations (1) through (6), using parametric working models estimated by standard maximum likelihood methods. Readers can modify the functions to incorporate alternative estimation approaches and to obtain bootstrapbased inference. To allow inference with standard errors obtained with the sandwich method, we also provide an implementation of the estimators using the R package geex [40]. Lastly, we provide Stata code to reproduce the simulation study.
6 Transportability analyses for the Coronary Artery Surgery Study
The Coronary Artery Surgery Study (CASS) included a randomized trial nested within a cohort study, comparing coronary artery surgery plus medical therapy (henceforth, “surgery”) versus medical therapy alone for patients with chronic coronary artery disease. Of the 2099 eligible patients, 780 consented to randomization and 1319 declined. We excluded six patients for consistency with prior CASS analyses [41, 42] and in accordance with CASS data release notes; in total, we used data from a total of 2093 patients. Details about the design of the CASS are available elsewhere [43, 44]. Here, we focus on estimating the survival probability and treatment effects among eligible nonparticipants and comparing them against estimates obtained among trial participants.
We implemented the methods described in Section 4 to estimate the 10year risk (cumulative incidence proportion) of death from any cause in the surgery and medical therapy groups, the risk difference, and the relative risk for the population of eligible patients who did not consent to randomization. Risks are reasonable measures of incidence in CASS because no patients were censored during the first 10 years of followup. The working models for the outcome, the probability of participation in the trial, and the probability of treatment were logistic regression models with the following covariates: age, severity of angina, history of previous myocardial infarction, percent obstruction of the proximal left anterior descending artery, left ventricular wall motion score, number of diseased vessels, and ejection fraction. We selected variables for inclusion in the models based on a previous analysis of the CASS data [42]; age and ejection fraction were modeled using restricted cubic splines with 5 knots [45]
. We used bootstrap resampling (10,000 samples of as many observations as in the dataset) to obtain percentile 95% confidence intervals.
Of the 2093 patients in the CASS dataset, 1686 had complete data on all baseline covariates (731 randomized, 368 to surgery and 363 to medical therapy; 955 nonrandomized, 430 receiving surgery and 525 medical therapy); for simplicity, we only report analyses restricted to patients with complete data. Table 4 summarizes baseline covariates in trial participants (by treatment group) and nonparticipants. Figure 1 presents the kernel density of the estimated probability of trial participation for trial participants and nonparticipants and a kernel density of the estimated weights for trial participants. The sample proportion of nonparticipants divided by the sample average of the inverse odds of trial participation among trial participants was approximately 1.001.
Estimates of the 10year risk (by treatment group), risk difference, and risk ratio are shown in Table 5. The outcome modelbased estimator (OM), the inverse odds of participation estimators (IOW1 and IOW2), and the doubly robust estimators (DR1, DR2 and DR3) produced similar results, suggesting that findings are not driven by model specification decisions [29].
7 Transportability analyses in practice
We now discuss practical issues related to variable selection, other aspects of model specification, and positivity violations and highly variable weights, all of which arise in transportability analyses using the methods described above.
7.1 Practical considerations
Variable selection:
Throughout, we have used to signify baseline covariates measured both among randomized trial participants and the sample from the target population. In principle, investigators can use any subset of the available covariates that satisfies the mean transportability assumption. When investigators are interested in estimating the potential outcome mean under each treatment (not only the average treatment effect), outcome predictors that are also associated with trial participation should be included in models for the outcome and the probability of participation (or both, when using doubly robust estimators). Including outcome predictors that are not associated with trial participation in models for the expectation of the outcome will often improve the precision of the outcome modelbased and doubly robust estimators; including strong predictors of trial participation that are not associated with the outcome in regressions for the probability of participation will generally increase the variance of the odds weighted and doubly robust estimators without improving transportability. When investigators are primarily interested in the average treatment effect (instead of the potential outcome mean under each treatment), only effect modifiers (on the mean difference scale) need to be modeled [12, 8]. Because background knowledge about effect modification is typically very limited, even when interest is centered on treatment effect estimation, in practice it is probably best to include as many outcome predictors as possible in regression models for the expectation of the outcome or the probability of trial participation. We followed this strategy in our CASS reanalysis: we selected covariates for “adjustment” based on prior work on outcome modeling and used the same covariates when modeling trial participation, the outcome, and treatment in the trial.
Other aspects of model specification:
Models for the expectation of the outcome and the probability of trial participation need to be flexible in order to approximate the corresponding “true” conditional expectation/probability functions. This will often mean including nonlinear terms (e.g., splines for continuous variables) or interactions between predictors. When modeling the expectation of the outcome (for outcome modelbased or doubly robust estimators) we recommend fitting separate regression models in each treatment group in the trial, as we did in the CASS reanalysis (equivalent to fitting a single regression model that includes all possible treatmentcovariate interactions).
More broadly, model specification for transportability analyses involves trading off bias and variance using informal [46] or formal methods (e.g., [47]). When background knowledge suggests that a large number of covariates need to be modeled, formal model specification search methods can be particularly helpful. In our experience, especially when using composite datasets, the variables measured both among trial participants and the sample of the target population are often few and model specification is not a pressing concern (of course, such cases raise concerns about violations of the mean transportability assumption and necessitate sensitivity analyses, which we address in the discussion). When richer data are available (e.g., when trials are nested in cohorts of eligible individuals [22]), transportability analyses need to be combined with more sophisticated strategies for model specification search (e.g., [48] provides an overview in the context of causal inference for observational studies, but the same principles apply to transportability analyses).
Positivity violations and highly variable weights:
To prevent structural violations of the positivity of trial participation assumption, investigators should ensure that the sample from the target population meets the trial eligibility criteria. For example, if the trial restricted enrollment to patients under 85 years of age, it is prudent to apply the same restriction in the sample of patients from the target population. When positivity is violated, odds weighting estimators are inconsistent, whereas outcome modelbased and doubly robust estimators rely heavily on the specification of the outcome model (to extrapolate from participants to nonparticipants) [49]. Empirical (finitesample) violations of positivity can arise due to chance, particularly when the trial sample size is small or when the mean transportability assumption requires adjustment for continuous covariates or a large number of discrete covariates. Empirical violations of positivity increase bias and variance in a way that depends on the particular estimator being used, model specification, and the underlying data generating mechanism.
It is always a good idea to examine the distribution of the estimated probabilities of trial participation (even if using an outcome modelbased estimator), because values near zero are warning signs for possible positivity violations. Inspection of the estimated probabilities of trial participation can be combined with diagnostics for positivity violations [49]. It is also useful to inspect the distribution of the weights that are used for the odds weighting and doubly robust estimators. By inspecting the distribution of the odds weights, investigators can identify extreme values and visually assess the spread of the weight distribution. A sometimes useful diagnostic is that the sample proportion of nonparticipants divided by the sample average of the estimated inverse odds of trial participation among trial participants, should be approximately equal to one; or, in symbols
Values different from 1, suggest positivity violations or model misspecification.^{1}^{1}1The rationale for the diagnostic is provided by the identity
In applied analyses, we have found that problems with extreme weights can often be addressed by making sensible modeling choices [46] and ensuring that the sample of nonparticipants is properly selected to avoid violations of the positivity of trial participation assumption. Trimming or truncation of extreme weights may also help, but these strategies shift the causal estimand, which is often undesirable.
In our CASS reanalyses, the estimated probabilities of trial participation were far from zero. Their distribution was similar among trial participants and nonparticipants (as shown in Figure 1), reflecting the fairly similar observed covariate distribution in trial participants and nonparticipants and the absence of strong selection into the trial (at least based on available covariates, as shown in Table 4). As noted, the sample proportion of nonparticipants divided by the sample average of the inverse odds of trial participation among trial participants was approximately 1, providing some reassurance that gross violations of positivity were absent.
8 Discussion
In this tutorial, we reviewed methods for transporting inferences about the average effect of a timefixed treatment from a randomized clinical trial to a new target population using baseline covariate data from randomized participants and a sample from the target population, but treatment and outcome data only from the randomized participants. We considered estimation approaches that rely on modeling the probability of trial participation, the expectation of the outcome, or both, and can be implemented easily in all popular statistical software packages.
A major challenge in applying any of the methods discussed in this tutorial is the need to collect adequate covariate information, both from trial participants and nonparticipants, for the mean transportability assumption to hold. Because the transportability assumption is not testable using the observed data, one has to rely on background substantive knowledge to assess its plausibility. Reasoning about the assumption can be facilitated using directed acyclic graphs, including recent graphical identification algorithms for assessing transportability [50, 51, 52]. Because background knowledge is often incomplete, it is often necessary to conduct sensitivity analyses, to examine how violations of the transportability assumption influence study results [53, 54].
Methods related to those discussed in this tutorial have been discussed in a number of recent publications [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] addressing trial transportability (or the related but distinct concept of generalizability [55]). With few exceptions – such as the careful asymptotic study of an estimator closely related to DR1 in [12], or the targeted maximum likelihood estimators in [14] – prior work has focused on weighting [6, 11, 15] or stratificationbased methods [8, 9, 10] that only rely on the probability of trial participation. Theoretical arguments, our simulation results, and practical experience suggest that methods that combine modeling the probability of trial participation with modeling the expectation of the outcome are most promising for applied work for two reasons: first, the double robustness property in effect gives investigators two opportunities for approximately correct inference [29]; second, doubly robust estimators often produce estimates that are more precise than those from methods that exclusively rely on modeling the probability of trial participation, even when the outcome model is misspecified [29, 26, 56, 32]. In our simulation studies, which used correctly specified parametric working models with few covariates, all estimators performed reasonably well in terms of bias. Interestingly, the two inverse odds weighting estimators had very different finitesample performance in the presence of strong selection. Based on this observation, we recommend avoiding weighted estimators that do not normalize the weights to sum to 1.
When data are available on numerous baseline covariates, many of which are continuous, correct specification of parametric models for the probability of trial participation or expectation of the outcome will be impossible. Future research should address estimation using more flexible models (e.g., nonparametric or semiparametric regression) to mitigate model misspecification. Flexible models are particularly appealing when using doubly robust estimators, because the estimators remain
consistent even when estimating the conditional expectation of the outcome or the probability of participation nonparametrically [57, 58]. Further research is also needed to study the behavior of different estimators under misspecification and to develop alternatives that are more robust to misspecification of the outcome model (e.g., along the lines suggested in [59, 60]). Lastly, throughout this tutorial, we have assumed perfect adherence to treatment in the randomized trial, no missing outcome data, and no measurement error. In practice, adherence is often imperfect, outcomes are missing (e.g., due to right censoring in failuretime analyses), and measurement error is a concern (e.g., differential measurement error in effect modifiers when using composite datasets). Established methods to address these issues in the trial data can be combined with the methods described in this tutorial in a modular fashion. For example, adjustment for imperfect adherence via inverse probability of treatment weighting can be combined with inverse odds weighting for transportability. Future work should assess the properties of such combined procedures and evaluate them in practical applications.9 Acknowledgments
The authors thank Dr. Nina Joyce (Brown University) and Dr. John Wong (Tufts Medical Center) for helpful comments on earlier versions of the manuscript.
This work was supported in part through PatientCentered Outcomes Research Institute (PCORI) Methods Research Awards ME130603758 and ME150227794 to I.J. Dahabreh, and ME150328119 to M.A. Hernán. All statements in this paper, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the PCORI, its Board of Governors, or the Methodology Committee.
References
 [1] Peter M Rothwell. External validity of randomised controlled trials: “to whom do the results of this trial apply?”. The Lancet, 365(9453):82–93, 2005.
 [2] Andrew Evans and Lalit Kalra. Are the results of randomized controlled trials on anticoagulation in patients with atrial fibrillation generalizable to clinical practice? Archives of Internal Medicine, 161(11):1443–1447, 2001.
 [3] Linda S Elting, Catherine Cooksley, B Nebiyou Bekele, Michael Frumovitz, Elenir BC Avritscher, Charlotte Sun, and Diane C Bodurka. Generalizability of cancer clinical trial results. Cancer, 106(11):2452–2458, 2006.
 [4] Philippe Gabriel Steg, José LópezSendón, Esteban Lopez de Sa, Shaun G Goodman, Joel M Gore, Frederick A Anderson, Dominique Himbert, Jeanna Allegrone, and Frans Van de Werf. External validity of clinical trials in acute myocardial infarction. Archives of Internal Medicine, 167(1):68–73, 2007.
 [5] Antonio L Dans, Leonila F Dans, Gordon H Guyatt, Scott Richardson, EvidenceBased Medicine Working Group, et al. Users’ guides to the medical literature: XIV. how to decide on the applicability of clinical trial results to your patient. JAMA, 279(7):545–549, 1998.
 [6] Stephen R Cole and Elizabeth A Stuart. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. American Journal of Epidemiology, 172(1):107–115, 2010.
 [7] Eloise E Kaizar. Estimating treatment effect via simple cross design synthesis. Statistics in Medicine, 30(25):2986–3009, 2011.
 [8] Colm O’Muircheartaigh and Larry V Hedges. Generalizing from unrepresentative experiments: a stratified propensity score approach. Journal of the Royal Statistical Society: Series C (Applied Statistics), 63(2):195–210, 2014.
 [9] Elizabeth Tipton. Improving generalizations from experiments using propensity score subclassification assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38(3):239–266, 2012.
 [10] Elizabeth Tipton, Larry Hedges, Michael VadenKiernan, Geoffrey Borman, Kate Sullivan, and Sarah Caverly. Sample selection in randomized experiments: A new method using propensity score stratified sampling. Journal of Research on Educational Effectiveness, 7(1):114–135, 2014.
 [11] Erin Hartman, Richard Grieve, Roland Ramsahai, and Jasjeet S Sekhon. From SATE to PATT: combining experimental with observational studies to estimate population treatment effects. Journal of the Royal Statistical Society Series A (Statistics in Society), 10:1111, 2013.
 [12] Zhiwei Zhang, Lei Nie, Guoxing Soon, and Zonghui Hu. New methods for treatment effect calibration, with applications to noninferiority trials. Biometrics, 72(1):20–29, 2016.
 [13] Ashley L Buchanan, Michael G Hudgens, Stephen R Cole, Katie R Mollan, Paul E Sax, Eric S Daar, Adaora A Adimora, Joseph J Eron, and Michael J Mugavero. Generalizing evidence from randomized trials using inverse probability of sampling weights. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2016.
 [14] Kara E Rudolph and Mark J Laan. Robust estimation of encouragement design intervention effects transported across sites. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1509–1525, 2017.
 [15] Daniel Westreich, Jessie K Edwards, Catherine R Lesko, Elizabeth Stuart, and Stephen R Cole. Transportability of trial results using inverse odds of sampling weights. American Journal of Epidemiology, 2017.
 [16] M Olschewski, H Scheurlen, et al. Comprehensive cohort study: an alternative to randomized consent design in a breast preservation trial. Methods Archive, 24:131–134, 1985.
 [17] Louis D Fiore and Philip W Lavori. Integrating randomized comparative effectiveness research with patient care. New England Journal of Medicine, 374(22):2152–2158, 2016.

[18]
Jerzy SplawaNeyman.
On the application of probability theory to agricultural experiments. essay on principles. section 9.
Statistical Science, 5(4):465–472, 1990.  [19] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974.
 [20] Issa J Dahabreh, Rodney Hayward, and David M Kent. Using group data to treat individuals: understanding heterogeneous treatment effects in the age of precision medicine and patientcentred evidence. International Journal of Epidemiology, 45(6):2184–2193, 2016.
 [21] Issa J Dahabreh, Thomas A Trikalinos, David M Kent, and Christopher H Schmid. Heterogeneity of treatment effects. Methods in Comparative Effectiveness Research, page 227, 2017.
 [22] Issa J Dahabreh, Sarah E Robertson, Elizabeth A Stuart, and Miguel A Hernán. Extending inferences from randomized participants to all eligible individuals using trials nested within cohort studies. arXiv preprint arXiv:1709.04589, 2017.
 [23] James M Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–1512, 1986.
 [24] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
 [25] J Hájek. Comment on “An essay on the logical foundations of survey sampling by D. Basu”. In V P Godambe and D A Sprott, editors, Foundations of Statistical Inference. 1971.
 [26] James M Robins, Mariela Sued, Quanhong LeiGomez, and Andrea Rotnitzky. Comment: Performance of doublerobust estimators when “inverse probability” weights are highly variable. Statistical Science, pages 544–559, 2007.
 [27] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866, 1994.
 [28] James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
 [29] James M Robins and Andrea Rotnitzky. Comments. Statistica Sinica, 11(4):920–936, 2001.
 [30] Christian Gourieroux, Alain Monfort, and Alain Trognon. Pseudo maximum likelihood methods: Theory. Econometrica: Journal of the Econometric Society, pages 681–700, 1984.
 [31] Jeffrey M Wooldridge. Inverse probability weighted estimation for general missing data problems. Journal of Econometrics, 141(2):1281–1301, 2007.
 [32] Anastasios Tsiatis. Semiparametric theory and missing data. Springer Science & Business Media, 2007.
 [33] Joseph DY Kang and Joseph L Schafer. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, pages 523–539, 2007.
 [34] Leonard A Stefanski and Dennis D Boos. The calculus of mestimation. The American Statistician, 56(1):29–38, 2002.
 [35] Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19):2937–2960, 2004.
 [36] Ziyue Chen and Eloise Kaizar. On variance estimation for generalizing from a trial to a target population. arXiv preprint arXiv:1704.07789, 2017.
 [37] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.
 [38] Bradley Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352):892–898, 1975.
 [39] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015.
 [40] Bradley C Saul and Michael G Hudgens. The calculus of Mestimation in R with geex. arXiv preprint arXiv:1709.01413, 2017.
 [41] Bernard R Chaitman, Thomas J Ryan, Richard A Kronmal, Eric D Foster, Peter L Frommer, Thomas Killip, CASS Investigators, et al. Coronary artery surgery study (CASS): comparability of 10 year survival in randomized and randomizable patients. Journal of the American College of Cardiology, 16(5):1071–1078, 1990.
 [42] Manfred Olschewski, Martin Schumacher, and Kathryn B Davis. Analysis of randomized and nonrandomized patients in clinical trials using the comprehensive cohort followup study design. Controlled Clinical Trials, 13(3):226–239, 1992.
 [43] J William, R Russell, T Nicholas, et al. Coronary artery surgery study (CASS): a randomized trial of coronary artery bypass surgery. Circulation, 68(5):939–950, 1983.
 [44] CASS Principal Investigators. Coronary artery surgery study (CASS): a randomized trial of coronary artery bypass surgery: comparability of entry characteristics and survival in randomized patients and nonrandomized patients meeting randomization criteria. Journal of the American College of Cardiology, 3(1):114–128, 1984.
 [45] Frank Harrell. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, 2015.
 [46] Stephen R Cole and Miguel A Hernán. Constructing inverse probability weights for marginal structural models. American journal of epidemiology, 168(6):656–664, 2008.
 [47] M Alan Brookhart and Mark J Van Der Laan. A semiparametric model selection criterion with applications to the marginal structural model. Computational statistics & data analysis, 50(2):475–498, 2006.
 [48] Stijn Vansteelandt, Maarten Bekaert, and Gerda Claeskens. On model selection and model misspecification in causal inference. Statistical methods in medical research, 21(1):7–30, 2012.
 [49] Maya L Petersen, Kristin E Porter, Susan Gruber, Yue Wang, and Mark J van der Laan. Diagnosing and responding to violations in the positivity assumption. Statistical methods in medical research, 21(1):31–54, 2012.

[50]
Elias Bareinboim and Judea Pearl.
Metatransportability of causal effects: A formal approach.
In
Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS)
, pages 135–143, 2013.  [51] Judea Pearl, Elias Bareinboim, et al. External validity: from docalculus to transportability across populations. Statistical Science, 29(4):579–595, 2014.
 [52] Elias Bareinboim and Judea Pearl. Transportability of causal effects: Completeness results. In AAAI, pages 698–704, 2012.
 [53] Andrea Rotnitzky, James M Robins, and Daniel O Scharfstein. Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association, 93(444):1321–1339, 1998.
 [54] James M Robins, Andrea Rotnitzky, and Daniel O Scharfstein. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In Statistical models in epidemiology, the environment, and clinical trials, pages 1–94. Springer, 2000.
 [55] MA Hernán. Discussion of “Perils and potentials of selfselected entry to epidemiological studies and surveys”. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2):346–347, 2016.
 [56] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.

[57]
James M Robins and Ya’acov Ritov.
Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semiparametric models.
Statistics in Medicine, 16(3):285–319, 1997.  [58] Ashley I Naimi and Edward H Kennedy. Nonparametric double robustness. arXiv preprint arXiv:1711.07137, 2017.
 [59] Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
 [60] Karel Vermeulen and Stijn Vansteelandt. Biasreduced doubly robust estimation. Journal of the American Statistical Association, 110(511):1024–1036, 2015.
Tables
Unit  

1  0  
1  
1  0  
1  1  
1  1  
0  
0 
Trial  OM  IOW1  IOW2  DR1  DR2  DR3  

250  2500  0.998  0.002  0.008  0.081  0.001  0.001  0.002 
250  5000  1.003  0.002  0.007  0.068  0.006  0.007  0.008 
250  10000  0.999  0.005  0.000  0.078  0.001  0.002  0.003 
500  2500  1.000  0.000  0.022  0.067  0.001  0.000  0.000 
500  5000  1.003  0.000  0.004  0.054  0.003  0.002  0.001 
500  10000  1.000  0.001  0.013  0.045  0.004  0.003  0.003 
1000  2500  1.000  0.001  0.010  0.031  0.001  0.001  0.002 
1000  5000  1.000  0.001  0.005  0.021  0.004  0.004  0.005 
1000  10000  1.000  0.000  0.011  0.014  0.005  0.003  0.003 
Trial  OM  IOW1  IOW2  DR1  DR2  DR3  

250  2500  0.087  0.067  4.698  1.341  0.300  0.164  0.172 
250  5000  0.089  0.068  5.324  1.387  0.342  0.166  0.176 
250  10000  0.090  0.066  3.295  1.399  0.291  0.163  0.173 
500  2500  0.044  0.034  2.515  0.962  0.169  0.100  0.095 
500  5000  0.043  0.033  2.016  0.962  0.143  0.096  0.092 
500  10000  0.044  0.032  2.321  0.945  0.161  0.096  0.093 
1000  2500  0.022  0.016  1.094  0.617  0.087  0.055  0.049 
1000  5000  0.023  0.016  1.070  0.644  0.088  0.057  0.050 
1000  10000  0.022  0.016  1.088  0.667  0.081  0.056  0.049 
Variable  Level 

Trial participants  

Surgery  Medical  
N  955  368  363  
Age, years  50.9 (7.7)  51.4 (7.2)  50.9 (7.4)  
Angina  None  195 (20.4)  83 (22.6)  81 (22.3)  
Present  760 (79.6)  285 (77.4)  282 (77.7)  
History of MI  No  406 (42.5)  159 (43.2)  135 (37.2)  
Yes  549 (57.5)  209 (56.8)  228 (62.8)  
LAD % obstruction  39.1 (38.7)  36.4 (38.0)  34.9 (37.0)  
Left ventricular score  7.1 (2.7)  7.4 (2.9)  7.3 (2.8)  
Diseased vessels  0  347 (36.3)  146 (39.7)  133 (36.6)  
608 (63.7)  222 (60.3)  230 (63.4)  
Ejection fraction, %  60.2 (12.3)  60.9 (13.1)  59.8 (12.8) 
CASS = Coronary Artery Surgery Study; LAD = left anterior descending coronary artery; MI = myocardial infarction; SD = standard deviation.
Estimator 


Risk difference  Risk ratio  

Trialonly  17.4% (13.6%, 21.4%)  20.4% (16.3%, 24.6%)  3.0% (8.7%, 2.7%)  0.85 (0.62, 1.15)  
OR  18.9% (13.9%, 22.7%)  20.1% (15.9%, 24.5%)  1.3% (7.9%, 4.2%)  0.94 (0.65, 1.24)  
IOW1  18.2% (13.9%, 22.7%)  20.1% (15.9%, 24.4%)  1.9% (7.8%, 4.2%)  0.91 (0.66, 1.24)  
IOW2  18.2% (14.6%, 23.5%)  20.1% (16.0%, 24.4%)  1.9% (7.2%, 4.8%)  0.91 (0.69, 1.28)  
DR1  18.7% (14.5%, 23.3%)  20.1% (16.0%, 24.4%)  1.4% (7.3%, 4.7%)  0.93 (0.68, 1.27)  
DR2  18.7% (14.5%, 23.3%)  20.1% (16.0%, 24.4%)  1.4% (7.3%, 4.7%)  0.93 (0.68, 1.27)  
DR3  18.7% (14.4%, 23.2%)  20.0% (15.9%, 24.3%)  1.4% (7.3%, 4.6%)  0.93 (0.68, 1.27) 
Figure
Appendix
In this Appendix we collect results regarding the consistency and double robustness properties of estimators discussed in the tutorial.
Identification
Under the assumptions in Section 3, the potential outcome mean among nonparticipants can be identified by the observed data:
Consistency and double robustness
In what follows, we assume that the model for the probability of treatment among trial participants is correctly specified (in fact, the “true” probability can be used instead).
Om:
When the model for the expectation of the outcome is correctly specified,
Consistency of follows because the numerator of the last expression above can be rewritten as
where the last equality follows from the identification result above.
Iow1:
When the model for the probability of trial participation is correctly specified,
Consistency of follows because the numerator of the last expression above can be rewritten as
Iow2:
When the model for the probability of trial participation is correctly specified, IOW2 converges to the same limit as IOW1 because
(A.1) 
Influence function for the observed data functional:
Write the observed data functional as , where is the observed data law. The first order influence function for is
where the 0 subscript indicates the “true” law. The above influence function implies the following insample onestep estimator for :
where is the estimated proportion of nonparticipants, and , , and are generic estimators for , , and , respectively. Of note, this result suggests that, unless one is willing to make assumptions beyond those in Section 3, estimators of should ignore treatment and outcome data in the sample of trialeligible nonparticipants from the target population (i.e., individuals with ), even if available (see [12] for a similar observation). Estimator DR1 in the main text of the tutorial is obtained from , using parametric models to estimate conditional probabilities and expectations.
Double robustness of DR1:
Provided the limiting values for the working models exist (even if the models are misspecified), DR1 converges to
(A.2) 
We now consider two cases with respect to potential misspecification of the working models for the probability of treatment and the expectation of the outcome.

[style=unboxed,leftmargin=0cm]
 correctly specified; incorrectly specified:

following the argument for estimator IOW1, the first term in the bracketed part of (A.2) converges to ; by iterated expectation, the sum of the other two terms converges to 0.
 correctly specified; incorrectly specified:

following the argument for the OM estimator, the last term in the bracketed part of (A.2) converges to ; by iterated expectation, the sum of the other two terms converges to 0.
Dr2:
When the model for the probability of trial participation is correctly specified, DR2 converges to the same limit as DR1 because of (A.1). When the model for the expectation of the outcome is correctly specified, DR2 converges to the same limit as DR1 because, by iterated expectation, the first term of the estimator in equation (5) of the main text converges to 0.
Double robustness of DR3:
Complete simulation results
Trial  OM  IOW1  IOW2  DR1  DR2  DR3  
250  2500  0  0  0.001  0.000  0.009  0.011  0.002  0.002  0.003 
250  2500  0  1  0.002  0.003  0.010  0.004  0.001  0.001  0.000 
250  2500  1  0  0.007  0.001  0.020  0.009  0.003  0.002  0.002 
250  2500  1  1  0.998  0.002  0.008  0.081  0.001  0.001  0.002 
250  5000  0  0  0.003  0.000  0.011  0.005  0.001  0.000  0.001 
250  5000  0  1  0.000  0.003  0.023  0.001  0.012  0.008  0.005 
250  5000  1  0  0.001  0.001  0.005  0.004  0.003  0.003  0.004 
250  5000  1  1  1.003  0.002  0.007  0.068  0.006  0.007  0.008 
250  10000  0  0  0.000  0.002  0.005  0.004  0.000  0.002  0.002 
250  10000  0  1  0.002  0.001  0.019  0.003  0.004  0.005  0.003 
250  10000  1  0  0.005  0.004  0.007  0.002  0.002  0.002  0.002 
250  10000  1  1  0.999  0.005  0.000  0.078  0.001  0.002  0.003 
500  2500  0  0  0.000  0.000  0.010  0.008  0.000  0.001  0.001 
500  2500  0  1  0.003  0.002  0.008  0.001  0.001  0.003  0.003 
500  2500  1  0  0.002  0.001  0.005  0.003  0.001  0.001  0.001 
500  2500  1  1  1.000  0.000  0.022  0.067  0.001  0.000  0.000 
500  5000  0  0  0.001  0.000  0.005  0.006  0.002  0.002  0.001 
500  5000  0  1  0.001  0.001  0.012  0.005  0.007  0.005  0.003 
500  5000  1  0  0.000  0.003  0.006  0.004  0.002  0.002  0.002 
500  5000  1  1  1.003  0.000  0.004  0.054  0.003  0.002  0.001 
500  10000  0  0  0.001  0.001  0.001  0.001  0.001  0.001  0.001 
500  10000  0  1  0.003  0.001  0.001  0.002  0.001  0.000  0.000 
500  10000  1  0  0.002  0.001  0.016  0.013  0.002  0.002  0.002 
500  10000  1  1  1.000  0.001  0.013  0.045  0.004  0.003  0.003 
1000  2500  0  0  0.001  0.001  0.003  0.001  0.001  0.000  0.000 
1000  2500  0  1  0.001  0.001  0.014  0.003  0.000  0.001  0.002 
1000  2500  1  0  0.000  0.001  0.004  0.002  0.001  0.001  0.001 
1000  2500  1  1  1.000  0.001  0.010  0.031  0.001  0.001  0.002 
1000  5000  0  0  0.000  0.001  0.000  0.000  0.000  0.000  0.000 
1000  5000  0  1  0.001  0.003  0.005  0.006  0.000  0.001  0.002 
1000  5000  1  0  0.001  0.002  0.001  0.002  0.002  0.002  0.003 
1000  5000  1  1  1.000  0.001  0.005  0.021  0.004  0.004  0.005 
1000  10000  0  0  0.001  0.001  0.003  0.002  0.002  0.002  0.002 
1000  10000  0  1  0.000  0.001  0.013  0.010  0.004  0.003  0.002 
1000  10000  1  0  0.000  0.000  0.004  0.002  0.003  0.003  0.002 
1000  10000  1  1  1.000  0.000  0.011  0.014  0.005  0.003  0.003 
Trial  OM  IOW1  IOW2  DR1  DR2  DR3  
250  2500  0  0  0.066  0.050  0.691  0.504  0.114  0.094  0.093 
250  2500  0  1  0.063  0.067  4.032  0.999  0.328  0.164  0.177 
250  2500  1  0  0.088  0.051  1.303  0.629  0.125  0.097  0.094 
250  2500  1  1  0.087  0.067  4.698  1.341  0.300  0.164  0.172 
250  5000  0  0  0.064  0.050  1.009  0.508  0.115  0.094  0.094 
250  5000  0  1  0.063  0.067  2.760  0.961  0.276  0.164  0.174 
250  5000  1  0  0.090  0.050  0.952  0.637  0.122  0.096  0.094 
250  5000  1  1  0.089  0.068  5.324  1.387  0.342  0.166  0.176 
250  10000  0  0  0.067  0.051  0.902  0.513  0.123  0.097  0.095 
250  10000  0  1  0.063  0.067  2.897  0.988  0.298  0.162  0.174 
250  10000  1  0  0.086  0.050  0.854  0.629  0.111  0.091  0.092 
250  10000  1  1  0.090  0.066  3.295  1.399  0.291  0.163  0.173 
500  2500  0  0  0.032  0.024  0.435  0.319  0.061  0.051  0.047 
500  2500  0  1  0.032  0.034  1.572  0.712  0.169  0.098  0.092 
500  2500  1  0  0.044  0.025  0.437  0.369  0.062  0.052  0.049 
500  2500  1  1  0.044  0.034  2.515  0.962  0.169  0.100  0.095 
500  5000  0  0  0.032  0.024  0.389  0.311  0.059  0.050  0.046 
500  5000  0  1  0.032  0.033  1.443  0.699  0.157  0.098  0.094 
500  5000  1  0  0.044  0.024  0.450  0.363  0.059  0.051  0.047 
500  5000  1  1  0.043  0.033  2.016  0.962  0.143  0.096  0.092 
500  10000  0  0  0.032  0.025  0.372  0.311  0.057  0.050  0.047 
500  10000  0  1  0.032  0.033  1.431  0.700  0.150  0.099  0.093 
500  10000  1  0  0.045  0.024  0.509  0.367  0.060  0.051  0.048 
500  10000  1  1  0.044  0.032  2.321  0.945  0.161  0.096  0.093 
1000  2500  0  0  0.016  0.012  0.182  0.176  0.030  0.027  0.025 
1000  2500  0  1  0.016  0.016  1.134  0.457  0.077  0.055  0.049 
1000  2500  1  0  0.021  0.013  0.227  0.211  0.030  0.028  0.025 
1000  2500  1  1  0.022  0.016  1.094  0.617  0.087  0.055  0.049 
1000  5000  0  0  0.016  0.012  0.174  0.173  0.029  0.027  0.025 
1000  5000  0  1  0.016  0.016  0.658  0.451  0.075  0.056  0.049 
1000  5000  1  0  0.022  0.013  0.240  0.219  0.031  0.028  0.025 
1000  5000  1  1  0.023  0.016  1.070  0.644  0.088  0.057  0.050 
1000  10000  0  0  0.016  0.012  0.170  0.176  0.030  0.027  0.025 
1000  10000  0  1  0.016  0.016  0.798  0.462  0.077  0.057  0.050 
1000  10000  1  0  0.022  0.012  0.222  0.211  0.030  0.027  0.025 
1000  10000  1  1  0.022  0.016  1.088  0.667  0.081  0.056  0.049 
24h60m60s..32 24h60m60s
transportability_odds, Date: August 24, 2019 Revision: 15.0