Study designs for extending causal inferences from a randomized trial to a target population

05/19/2019
by   Issa J. Dahabreh, et al.
Brown University
0

We examine study designs for extending (generalizing or transporting) causal inferences from a randomized trial to a target population. Specifically, we consider nested trial designs, where randomized individuals are nested within a sample from the target population, and non-nested trial designs, including composite dataset designs, where a randomized trial is combined with a separately obtained sample of non-randomized individuals from the target population. We show that the causal quantities that can be identified in each study design depend on what is known about the probability of sampling non-randomized individuals. For each study design, we examine identification of potential outcome means via the g-formula and inverse probability weighting. Last, we explore the implications of the sampling properties underlying the designs for the identification and estimation of the probability of trial participation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/16/2019

Generalizing trial findings using nested trial designs with sub-sampling of non-randomized individuals

To generalize inferences from a randomized trial to the target populatio...
05/25/2019

Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population

Extending (generalizing or transporting) causal inferences from a random...
06/26/2019

Generalizing causal inferences from randomized trials: counterfactual and graphical identification

When engagement with a randomized trial is driven by factors that affect...
02/16/2019

Generalizing trial findings in nested trial designs with sub-sampling of non-randomized individuals

To generalize inferences from a randomized trial to the target populatio...
04/17/2020

Multiplicity for a Group Sequential Trial with Biomarker Subpopulations

Biomarker subpopulations have become increasingly important for drug dev...
08/19/2021

Robust Designs for Prospective Randomized Trials Surveying Sensitive Topics

We consider the problem of designing a prospective randomized trial in w...
08/06/2020

Conceptualising Natural and Quasi Experiments in Public Health

Background: Natural or quasi experiments are appealing for public health...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Sampling properties and the observed data

For a well-defined causal question, investigators can specify a set of eligibility criteria that define an actual population of individuals to whom research findings would be applicable, in the sense that we can in principle identify all individuals who meet the criteria. For instance, when designing a randomized trial, the trial eligibility criteria define an actual population of all trial-eligible individuals. In this paper, we view the actual population as a simple random sample from an (infinite) superpopulation of individuals; we refer to this superpopulation as the target population [10]. We are interested in causal quantities that pertain to the target population or to its subsets (e.g., defined by trial participation status).

To introduce some notation, let

denote a vector of

baseline covariates; the treatment assignment indicator; the observed outcome; and the trial participation indicator, with for randomized individuals and for non-randomized individuals (individuals who are either not invited to participate in the trial or who are invited but decline). To capture the notion that some non-randomized individuals in the actual population () may not be sampled, let be an indicator for whether an individual in the actual population is sampled and contributes data to the analyses, with for sampled individuals and for non-sampled individuals..

We can now describe the sampling properties that underlie nested and non-nested study designs. These properties describe how the observed study sample relates to the actual population; the underlying actual population and (hypothetical) target population are the same. Figure 1 illustrates the conceptual relationships between designs, their sampling properties, and the observed data.

In the main text of this paper, we consider simple random samples, with known or unknown sampling probabilities, from the actual population or from the non-randomized subset of the actual population. As we discuss below, our main results, with minor modifications, hold when the sampling probability is a known function of auxiliary baseline covariates rather than a known constant (i.e., when we have random sampling, not simple random sampling). Allowing the sampling probabilities to depend on auxiliary covariates, however, does not lead to additional insights regarding study design [11]; for this reason, in the main text, we assume that the sampling probability does not depend on covariates.

1.1 Nested trial designs

We consider two variants of the nested trials design: when a census of the actual population is taken and when the non-randomized individuals are sub-sampled.

Census of the actual population: In this variant of the nested trial design, the individuals contributing data to the analysis are assumed to be a census of the actual population, that is,

thus, nested trial designs can be viewed as simple random samples from the superpopulation. In this design, it is common to define the target population implicitly, based on the actual population in which the trial is nested. For example, in comprehensive cohort studies [12], investigators nest a trial within a cohort of all individuals who met the trial’s eligibility criteria and were invited to participate in the trial. They then define the target population as the population from which cohort members (i.e., the actual population of trial-eligible individuals invited to participate in the trials) could have been a simple random sample. Thus, in this design, investigators need to ensure that the cohort represents the target population they are interested in; that is, the trial eligibility criteria need to be broad enough to address the research question and the individuals invited to participate in the trial (who form the cohort in which the trial is nested) need to represent the target population of interest.

Sub-sampling of non-randomized individuals: In this variant, we collect data from all randomized individuals in the actual population but only collect baseline covariate data from a sub-sample of the non-randomized individuals in the actual population, with sampling probability that is a known constant. The sampling properties can be summarized as

where is a known constant, with . Note that the nested trial design with a census of the actual population can be viewed as a special case of the sub-sampling design, with . Using is statistically less efficient than using , but may improve research economy, for example, if the collection of covariate data among non-randomized individuals is expensive [11]. Furthermore, as noted, a variant of the nested trial design with sub-sampling allows the selection of non-randomized individuals to depend on auxiliary baseline covariates; we show how our results extend to that case in the Appendix.

1.2 Non-nested trial designs

In non-nested trial designs, data from randomized and non-randomized individuals are obtained separately. Investigators assume that data from all randomized individuals can be combined with data from a simple random sample of non-randomized individuals from the actual population, with sampling probability that was an unknown constant (e.g., [4]). The sampling properties can be summarized as

An example of non-nested trial design is the composite dataset design [4, 7]. Here, investigators append the data from a randomized trial to data from a convenience sample of non-randomized individuals, often obtained from routinely collected data sources (such as claims or electronic medical records databases, or prospective cohort studies). The assumption, often left unstated, is that the sample of non-randomized individuals is a simple random sample from the population of non-randomized individuals (or a well-defined subset thereof) to whom the investigators wish to extend the trial results [7, 4].

1.3 The observed data

In both nested and non-nested designs, we collect data on baseline covariates, treatment, and outcome from randomized individuals; in contrast, as we show in Section 3, only baseline covariate data are needed from non-randomized individuals.

More specifically, for nested designs the observed data consist of realizations of

Because all randomized individuals are sampled, we have that . No covariate, treatment, or outcome data are available for non-sampled non-randomized individuals, . Note also that in nested trial designs with a census of the actual population, the subset does not exist.

In non-nested trial designs (e.g., composite dataset designs), we typically do not know the number of non-sampled non-randomized individuals, thus the observed data consist of realizations only of

2 Causal quantities of interest and identifiability conditions

2.1 Causal quantities of interest

In order to define causal quantities, let be the potential (counterfactual) outcome under intervention to set treatment to [13, 14]. We are interested in the mean of these potential outcomes in the target population or in the non-randomized subset of the target population . For example, captures the outcome under the strategy of treating all individuals in the target population with . And it is often scientifically and methodologically interesting to compare against , to examine whether the potential outcome mean under treatment differs among trial-participants and non-participants in the target population [3].

2.2 Identifiability conditions

For all study designs, the following identifiability conditions are sufficient to extend inferences from a clinical trial to a target population [3, 7]:

(1) Consistency of potential outcomes: interventions are well-defined, so that if then . Implicit in this notation is that the offer to participate in the trial and trial participation itself do not have an effect on the outcome except through treatment assignment (e.g., there are no Hawthorne effects).

(2) Mean exchangeability among trial participants: . This condition is expected to hold because of randomization (marginal or conditional on ).

(3) Positivity of treatment assignment in the trial: for each and each with positive density among randomized individuals, . This condition is also expected to hold because of randomization.

(4) Mean generalizability (exchangeability over ): for each . For binary , this condition implies the mean transportability condition (provided both conditional expectations are well-defined).

(5) Positivity of trial participation: for each with positive density in the target population, .

In the conditions listed above, we have used generically to denote baseline covariates. It is possible, however, that strict subsets of are adequate to satisfy different exchangeability conditions. For example, in a marginally randomized trial the mean exchangeability among trial participants holds unconditionally. Furthermore, to focus on issues related to selective trial participation, we will assume complete adherence to the assigned treatment and no loss-to-follow-up.

2.3 Trial eligibility criteria and choice of target population

Now that we have specified the causal quantities of interest and listed identifiability conditions, we can consider the choice of target population in more detail. As noted, the target population should be determined by the scientific question investigators hope to address. In many cases, when using the methods described in this paper, it is sensible to limit the target population to the population of individuals meeting the trial eligibility criteria or to a subset of that population. To the extent that the variables used to define the trial eligibility criteria are needed to satisfy the mean generalizability condition, the restriction to trial-eligible individuals is needed for the positivity of trial participation condition to hold – individuals not meeting the criteria are not allowed to enter the trial. In some cases, however, investigators may be able to argue that only a subset of the variables used to determine trial eligibility are necessary for the mean generalizability condition to hold. In such cases, the target population can be broader than the population of trial eligible individuals. The essential requirement is that the distributions of covariates needed to satisfy the mean generalizability condition among randomized and non-randomized individuals should have common support.

3 Identification via the g-formula

We begin by considering identification by the g-formula [15]. Using the identifiability conditions of Section 2, it is straightforward to show that the potential outcome mean in the target population [3] can be re-expressed as

(1)

where denotes the distribution of in the target population.

The potential outcome mean among non-randomized individuals in the target population [7] can be re-expressed as

(2)

where denotes the distribution of among non-randomized individuals in the target population (i.e., the subset with ).

First, we note that both results involve the conditional expectation of the outcome among trial participants assigned to treatment , . Because both nested and non-nested designs assume that all randomized individuals are sampled, this expectation is identifiable in both designs.

Next, we turn out attention to the identification of and , which are necessary to identify and , respectively. There are interesting differences between the designs when it comes to identifying these distributions and we consider each design individually below.

3.1 Nested trial designs

Census of the actual population: Identification is most straightforward in this design, because data are available from all members of the actual population (both randomized and non-randomized) and the actual population is a simple random sample from the target population. Thus, is identifiable. Furthermore, in this design, every subgroup of the actual population defined on the basis of baseline covariates or trial participation is a simple random sample from the corresponding subgroup in the target population. Thus, the distribution of covariates among non-randomized individuals can also be identified. It follows that all the components on the right-hand-sides of (1) and (2) are identifiable, establishing that and are identifiable.

Sub-sampling of non-randomized individuals: For this design, identification of the marginal distribution of is slightly more involved because the non-randomized individuals contributing data to the analysis are a sub-sample from the non-randomized individuals in the actual population.

First, by the law of total probability, for binary

,

Clearly, , for is identifiable because the randomized and non-randomized sampled individuals are simple random samples of the target population subsets with and , respectively. The only difficulty, then, is identification of the marginal probability of trial participation, , because . As we show in the Appendix, under the sampling properties of the nested trial design with sub-sampling of non-randomized individuals,

(3)

The odds of non-participation in the trial among sampled individuals,

, are identifiable; and, as defined in Section 1.1, is a known constant. It follows that is identifiable and, consequently, is also identifiable because all the components of the integral on the right-hand-side of (1) are identifiable.

Turning our attention to , we note that it is identifiable because the sampled non-randomized individuals are a simple random sample from the non-randomized individuals in the actual population. It follows that is identifiable because all the components of the integral on the right-hand-side of (2) are identifiable.

3.2 Non-nested trial designs

Using an argument parallel to that for nested trial designs with sub-sampling, when the probability of sampling a non-randomized individual is unknown, the probability of trial participation, , can be expressed in the form of equation (3), substituting the for ,

Because, as defined in Section 1.2, is an unknown constant, is not identifiable and consequently is also not identifiable.

Turning our attention to , we see that it is identifiable because the non-randomized individuals contributing data to the analysis are a simple random sample from the non-randomized individuals in the actual population (even though the sampling probability is unknown). It follows that is identifiable in non-nested trial designs because all the components of the integral in (2) are identifiable.

4 Identification via IP weighting

There has been considerable recent interest [1, 3, 2, 4, 7] in using weighting methods to identify the potential outcome means in equations (1) and (2), because the specification of models for the probability of trial participation is often considered a somewhat easier task than the specification of models for the outcome among trial participants.

First, consider , which we argued is identifiable in nested trials. As shown in [3], we can re-express the right-hand-side of (1) as

(4)

where denotes the indicator function.

Now, consider , which we argued is identifiable by the g-formula in both nested and non-nested trials. As shown in [7], we can re-express the right-hand-side of (2) as

(5)

The probability of treatment among trial participants, is under the control of the investigators and does not pose any difficulties for identification of either functional. Now, for each design, we focus our attention on the conditional probability or the conditional odds of trial participation, which appear in expressions (4) and (5), respectively.

4.1 Nested trial designs

Census of the actual population: Identification of in this design is an obvious consequence of the fact that individuals contributing data to the analysis are a simple random sample from the target population. In other words, because we have sampled all individuals in the actual population, which is a simple random sample of the target population, .

Sub-sampling of non-randomized individuals: Identification of is only a little more difficult when we sample non-randomized individuals from the actual population. As we show in the Appendix, under the sampling properties of this design,

(6)

where the conditional odds of trial participation among sampled individuals, , are identifiable and is a known constant defined in Section 1.1. It follows that is identifiable and the odds of trial participation can be written as

(7)

In sum, the IP weighting re-expressions of the functionals of interest are identifiable in nested trial designs.

4.2 Non-nested trial designs

We can use an argument parallel to that for nested trial designs with sub-sampling, to establish that, when the sampling probability for non-randomized individuals is unknown, the probability of trial participation, , can be expressed as,

(8)

Because, as defined in Section 1.2, is unknown, the conditional probability of trial participation, which appears on the right hand side of (4), is not identifiable; this confirms our earlier result that cannot be identified in non-nested trials.

Furthermore, the conditional odds of trial participation are also not identifiable because they depend on . In fact, using equation (7), substituting for , we see that the odds of trial participation in the target population are, up to an unknown multiplicative constant, equal to the odds of trial participation among sampled individuals,

(9)

We have come to an apparent conflict: the right hand-side of (5) involves the conditional odds of trial participation, a quantity that is not identifiable in non-nested designs. Yet, we argued in the previous section that the left-hand-side of (5) is identifiable. The conflict can be easily resolved by noting that, because both the numerator and the denominator of (5) are multiplied by the unknown constant , which cancels out, identification via IP weighting is possible (see the appendix of [7] for technical details).

Table 1 summarizes the sampling properties and identification results for each study design.

5 Estimating the probability of trial participation

In realistic analyses, the dimension of will be fairly large, necessitating some modeling assumptions about or [16]. In this section we discuss the relationship between study design and model specification and estimation approaches.

5.1 Nested trial designs

Census of the actual population: In this type of nested trial design, it is straightforward to estimate the probability of trial participation, , in the sense that we can use the model we believe is most likely to be correctly specified for the target population.

For concreteness, suppose that we are willing to assume a parametric model,

for the probability of trial participation in the target population, , with a finite dimensional parameter. In the nested-trial designs with a census of non-randomized individuals, we typically estimate the parameters by maximizing the likelihood function

where , and is the number of individuals in the study (i.e., the actual population). Under reasonable technical conditions [17], the usual maximum likelihood methods use a sample-size normalized objective function that converges uniformly in probability to

(10)

For example, when is a logistic model,

is the large sample limit of the sample-size normalized log-likelihood function for logistic regression.

Sub-sampling of non-randomized individuals: When we sub-sample of the non-randomized individuals in the actual population, it is not possible to maximize the likelihood function above, because data are not available from all non-randomized individuals in the actual population. A natural idea is to use equation (6), which provides an explicit formula for identifying the conditional probability of trial participation, using the probability of trial participation among sampled individuals, , and the sampling probability for non-randomized individuals, . When modeling the probability of trial participation among sampled individuals, however, the following difficulty arises: in general, when sampling depends on trial participation status, the correctly specified model for trial participation does not have the same form as the correctly specified model in the target population, with the notable exception of the logistic regression model [18, 19]. This implies that naive estimation of the parameters of the model for trial participation among sampled individuals will typically be inconsistent for the population model.

Because the sampling probability of non-randomized individuals is known, we can use the following weighted pseudo-likelihood function, which only uses data from sampled individuals [18, 20],

with . Weighted maximum likelihood methods use a sample-size normalized objective function that converges uniformly in probability to

(11)

which is restricted to sampled individuals ().

As we show in the Appendix, under the sampling properties for this design, the large sample limits of the objective functions in (10) and (11) are equal, . It follows that, under reasonable technical conditions [17], weighted likelihood estimation of in the nested trial design with sub-sampling of non-randomized individuals converges in probability to the same parameter as unweighted regression in the actual population.

In practical terms, as long as a reasonable parametric model can be specified for the target population, the model parameters can be estimated using weighted maximum likelihood methods [18] on data from sampled individuals, with individual level weights equal to 1 for randomized individuals, ; for sampled non-randomized individuals, ; and 0 for unsampled individuals, .

5.2 Non-nested trial designs

In non-nested trial designs, the weighting approach described above is not applicable because the sampling probability of non-randomized individuals is unknown. Provided, however, that the sampling probability does not depend on (i.e., the assumed sampling property), we can show that, if a logistic model for trial participation is correctly specified in the target population, then a logistic model is correctly specified in the non-nested trial design. To see this, suppose that we are willing to assume a logistic regression model in the population, such that

Using the result in (8) and taking logarithms, we have that

Equating the right-hand-sides of the last two equations, we obtain

(12)

where , a well-known result in the context of case-referent studies [21]. Thus, if a logistic model is correctly specified in the target population, then a model of the same functional form is correctly specified in the non-nested trial design. In fact, the coefficients in the two models are equal, and only the intercept differs. Because , : the sub-sampling of non-randomized patients simply results in an intercept that is “shifted” upwards. As we have shown in the section on IP weighting, the resulting shift in the odds of participation does not affect identification of the potential outcome mean in the non-randomized individuals, , which is the parameter of interest in non-nested trial designs with unknown sampling probability of non-randomized individuals.

The above result is also important for estimation of the model parameters: combined with the results in [22, 23], it implies that the unconstrained and unweighted maximum likelihood estimator for the logistic model in (12), fit among sampled individuals, is the efficient estimator for , .

6 Discussion

We presented a unified description of study designs for extending inferences from randomized trials to a well-defined target population and showed that commonly invoked identifiability conditions need to be combined with the sampling properties of each study design in order to determine which causal quantities can be identified. Our approach uses a superpopulation framework, which is a natural choice for extending trial findings beyond the sample of randomized individuals [24].

In non-nested trial designs, where the sampling probability for non-randomized individuals is unknown, the marginal potential outcome means in the target population are not identifiable, but the potential outcome means in the sub-population of non-randomized individuals are identifiable. This restriction may be less severe than it appears: for most trials, we want to estimate the effect of applying the interventions to a new population, which can be represented by a well-chosen sample of non-randomized individuals [7]. In any case, when available, knowledge of the sampling probability of non-randomized individuals can be used to mitigate these limitations, without requiring the collection of covariate information from all non-randomized individuals in the actual population. Thus, in general, nested trial designs will often be the preferred approach for generalizing trial findings when it is possible to define and sample the actual population when a randomized trial is planned. Such nested trial designs will typically have broad (pragmatic [25]) eligibility criteria and define the target population as the population of individuals meeting the trial eligibility criteria. When that is not possible, as is the case in already completed randomized trials, non-nested trial designs might be a reasonable alternative. For example, in non-nested trial designs, the comparison of estimates for the potential outcome means among randomized, , and non-randomized individuals, , is of practical interest: provided the identifiability conditions hold, if , we may conclude that the trial results are likely generalizable (up to sampling variability); in contrast, if the estimates are different, trial results may not be generalizable.

We also showed that the different study designs have implications for identifying and estimating the conditional probability of trial participation. This probability is of inherent interest because it captures aspects of decision-making related to trial participation [26, 27]. We showed that the probability is identifiable in nested trial designs, but not in non-nested trial designs (e.g., composite dataset designs). Indeed, any reasonable parametric model for the probability of participation in the population can be identified in nested trial designs. In nested trial designs with sampling of non-randomized individuals, estimation of model parameters can be facilitated by the use of weighted maximum likelihood estimation where randomized patients are given weight 1 and non-randomized patients are given weight equal to the inverse of the probability of being sampled among non-randomized individuals in the actual population. In non-nested trial designs, model specification is complicated by the fact that, when sampling depends on trial participation status, the model for the probability of trial participation among sampled individuals is not of the same form as the model in the population (the logistic regression model being a notable exception [18]).

The probability of trial participation in the target population is also important for identification and estimation using inverse probability (or odds) weighting methods. Our argument about the odds of participation after selection of non-randomized individuals being equal to the odds of participation in the target population up to an unknown multiplicative constant, clarifies how the validity of estimators when using composite datasets designs [4, 7] depends critically on the assumed sampling properties.

Astute readers will have noticed the many connections between our results and the theory of case-referent (case-control) studies [28, 21, 18, 20, 19]. Indeed, our approach can be placed in the case-base paradigm, viewing randomized individuals as “cases” in cumulative incidence case-referent study [28] nested in the “cohort” of the actual population. An interesting parallel with case-referent studies: the difficulty in specifying the population of non-randomized individuals that should be sampled in composite dataset designs is similar in nature to the validity issues of case-referent studies with a secondary base [29, 30, 31].

In this paper, for simplicity, we focused on causal quantities that are most meaningful for point treatments with complete adherence and no loss to follow-up. Our results can be extended to address time-varying treatments using well-known extensions of the identifiability conditions for randomized trials [15, 32, 24], without any changes to the sampling properties or the modeling assumptions about the probability of trial participation. Perhaps, then, the most consequential causal assumption that we required was that the invitation to participate in the trial and trial participation itself do not have an effect on the outcome except through treatment assignment. Unless investigators are willing to contemplate much more complex study designs involving multistage data collection about (and possibly randomization of) the invitation to participate, trial participation itself, and treatment assignment [33], our results are best viewed as applying to trials where the not-through-treatment effects of the invitation to participate in the trial and of trial participation are negligible compared to the effects of treatment. For example, they are applicable to pragmatic randomized trials embedded in large health-care systems or registries, where trial procedures other than treatment assignment can be assumed to be similar to usual medical practice [34, 25, 35].

7 Figure

Figure 1: Conceptual graph depicting the sampling designs for studies extending inferences from a randomized trial to a target population.

8 Table

Study design Sampling probabilities
Marginal probability
of trial participation
Conditional probability
of trial participation
Identifiable potential
outcome means
(when the conditions
in Section 2 hold)
Nested-trial
and
and
and
,
is a known constant
and
Non-nested trial
and
is an unknown constant
Not identifiable Not identifiable
Note that the formulas for the marginal and conditional probability of trial participation in nested-trial designs with a census of the actual population can be obtained from the formulas for the nested trial design with known sampling probabilities by setting .
Table 1: Summary of sampling properties and identification results by study design.

References

  • [1] S. R. Cole and E. A. Stuart, “Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial,” American Journal of Epidemiology, vol. 172, no. 1, pp. 107–115, 2010.
  • [2] A. L. Buchanan, M. G. Hudgens, S. R. Cole, K. R. Mollan, P. E. Sax, E. S. Daar, A. A. Adimora, J. J. Eron, and M. J. Mugavero, “Generalizing evidence from randomized trials using inverse probability of sampling weights,” Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 181, no. 4, pp. 1193–1209, 2018.
  • [3] I. J. Dahabreh, S. E. Robertson, E. J. T. Tchetgen, E. A. Stuart, and M. A. Hernán, “Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals,” Biometrics, 2018.
  • [4] D. Westreich, J. K. Edwards, C. R. Lesko, E. Stuart, and S. R. Cole, “Transportability of trial results using inverse odds of sampling weights,” American Journal of Epidemiology, vol. 186, no. 8, pp. 1010–1014, 2017.
  • [5] C. R. Lesko, A. L. Buchanan, D. Westreich, J. K. Edwards, M. G. Hudgens, and S. R. Cole, “Practical considerations when generalizing study results: a potential outcomes perspective,” Epidemiology, vol. 28, no. 4, pp. 553–561, 2017.
  • [6] K. E. Rudolph and M. J. van der Laan, “Robust estimation of encouragement design intervention effects transported across sites,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), vol. 79, no. 5, pp. 1509–1525, 2017.
  • [7] I. J. Dahabreh, S. E. Robertson, E. A. Stuart, and M. A. Hernán, “Transporting inferences from a randomized trial to a new target population,” arXiv preprint arXiv:1805.00550, 2018.
  • [8] N. Keiding and T. A. Louis, “Perils and potentials of self-selected entry to epidemiological studies and surveys,” Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 179, no. 2, pp. 319–376, 2016.
  • [9] M. Hernán, “Discussion of “Perils and potentials of self-selected entry to epidemiological studies and surveys”,” Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 179, no. 2, pp. 346–347, 2016.
  • [10]

    J. M. Robins, “Confidence intervals for causal parameters,”

    Statistics in Medicine, vol. 7, no. 7, pp. 773–785, 1988.
  • [11] I. J. Dahabreh, M. A. Hernán, S. E. Robertson, A. Buchanan, and J. A. Steingrimsson, “Generalizing trial findings in nested trial designs with sub-sampling of non-randomized individuals,” arXiv preprint arXiv:1902.06080, 2019.
  • [12] M. Olschewski and H. Scheurlen, “Comprehensive cohort study: an alternative to randomized consent design in a breast preservation trial.,” Methods of Information in Medicine, vol. 24, pp. 131–134, 1985.
  • [13] D. B. Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies.,” Journal of Educational Psychology, vol. 66, no. 5, p. 688, 1974.
  • [14] J. M. Robins and S. Greenland, “Causal inference without counterfactuals: comment,” Journal of the American Statistical Association, vol. 95, no. 450, pp. 431–435, 2000.
  • [15] J. M. Robins, “A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect,” Mathematical Modelling, vol. 7, no. 9, pp. 1393–1512, 1986.
  • [16]

    J. M. Robins and Y. Ritov, “Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models,”

    Statistics in Medicine, vol. 16, no. 3, pp. 285–319, 1997.
  • [17] W. K. Newey and D. McFadden, “Large sample estimation and hypothesis testing,” Handbook of econometrics, vol. 4, pp. 2111–2245, 1994.
  • [18] C. F. Manski and S. R. Lerman, “The estimation of choice probabilities from choice based samples,” Econometrica: Journal of the Econometric Society, vol. 45, no. 8, pp. 1977–1988, 1977.
  • [19] A. J. Scott and C. Wild, “Fitting logistic models under case-control or choice based sampling,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 48, no. 2, pp. 170–182, 1986.
  • [20] S. R. Cosslett, “Maximum likelihood estimator for choice-based samples,” Econometrica: Journal of the Econometric Society, vol. 49, no. 5, pp. 1289–1316, 1981.
  • [21] N. Mantel, “Synthetic retrospective studies and related topics,” Biometrics, vol. 23, no. 3, pp. 479–486, 1973.
  • [22] R. L. Prentice and R. Pyke, “Logistic disease incidence models and case-control studies,” Biometrika, vol. 66, no. 3, pp. 403–411, 1979.
  • [23] N. E. Breslow, J. M. Robins, J. A. Wellner, et al., “On the semi-parametric efficiency of logistic regression under case-control sampling,” Bernoulli, vol. 6, no. 3, pp. 447–455, 2000.
  • [24] M. A. Hernán and J. M. Robins, Causal inference (forthcoming). Boca Raton, FL: Chapman & Hall/CRC, 2019.
  • [25] I. Ford and J. Norrie, “Pragmatic trials,” New England Journal of Medicine, vol. 375, no. 5, pp. 454–463, 2016.
  • [26]

    D. McFadden, “Conditional logit analysis of qualitative choice behavior,” in

    Frontiers in econometrics (P. Zarembka, ed.), ch. 4, pp. 105–142, Berkeley, CA: Institute of Urban and Regional Development, University of California Berkeley, CA, 1973.
  • [27] E. A. Stuart, S. R. Cole, C. P. Bradshaw, and P. J. Leaf, “The use of propensity scores to assess the generalizability of results from randomized trials,” Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 174, no. 2, pp. 369–386, 2011.
  • [28] O. S. Miettinen, “Estimability and estimation in case-referent studies,” American Journal of Epidemiology, vol. 103, no. 2, pp. 226–235, 1976.
  • [29] O. S. Miettinen, “The “case-control” study: valid selection of subjects,” Journal of Chronic Diseases, vol. 38, no. 7, pp. 543–548, 1985.
  • [30] O. S. Miettinen, “Response: The concept of secondary base,” Journal of Clinical Epidemiology, vol. 43, no. 9, pp. 1017–1020, 1990.
  • [31] S. Wacholder, J. K. McLaughlin, D. T. Silverman, and J. S. Mandel, “Selection of controls in case-control studies: I. principles,” American Journal of Epidemiology, vol. 135, no. 9, pp. 1019–1028, 1992.
  • [32] J. M. Robins, “Marginal structural models versus structural nested models as tools for causal inference,” in Statistical models in epidemiology, the environment, and clinical trials, pp. 95–133, Springer, 2000.
  • [33] J. J. Heckman, “Randomization and social policy evaluation,” Tech. Rep. 107, National Bureau of Economic Research, Cambridge, Mass., USA, 1991.
  • [34] T.-P. van Staa, L. Dyson, G. McCann, S. Padmanabhan, R. Belatri, B. Goldacre, J. Cassell, M. Pirmohamed, D. Torgerson, S. Ronaldson, et al., “The opportunities and challenges of pragmatic point-of-care randomised trials using routinely collected electronic records: evaluations of two exemplar trials,” Health Technology Assessment, vol. 18, no. 43, pp. 1–146, 2014.
  • [35] N. K. Choudhry, “Randomized, controlled trials in health insurance systems,” New England Journal of Medicine, vol. 377, no. 10, pp. 957–964, 2017.

Appendix A Identification of the probability of trial participation in nested trial designs with sub-sampling

a.1 Identification of the marginal probability of trial participation

Using the definition of conditional probability and re-arranging,

Taking the ratio of the above expressions and exploiting the sampling properties for non-nested trial designs,

With a bit of algebra, the above expression can be re-arranged to show that

By setting we see that in the nested-trial design with a census of non-randomized individuals .

a.2 Identification of the conditional probability of trial participation

The argument for the conditional probability is parallel to the one presented above for the marginal probability. Again, using the definition of conditional probability,

Taking the ratio of the above expressions and exploiting the sampling properties for non-nested trial designs,

The above expression can be re-arranged to show that

By setting we see that in the nested-trial design with a census of non-randomized individuals .

Appendix B Estimating the conditional probability of trial participation

We outline the proof for the convergence in probability of the estimators for the conditional probability of trial participation described in the main text, without delving into the technical conditions needed to make the arguments rigorous.

Consider the likelihood function for the nested trial design with a census of the actual population,

and, the pseudo-likelihood function for the nested trial design with known sampling probability of the non-randomized individuals,

For , the sample size-normalized objective function to be maximized is

Provided the technical conditions for the uniform law of large numbers obtain, the above objective function converges uniformly in probability, in the sense of the definition in Section 2.1 of

[17], to

By Theorem 2.1 of [17], if is uniquely maximized at , the parameter space is compact, and is continuous, then the estimator obtained by maximizing , converges in probability to , that is, .

For , the sample size-normalized objective function to be maximized is

Because is assumed to be bounded away from 0, and provided the technical conditions for the uniform weak law of large numbers obtain, the above objective function converges uniformly in probability to

We will now show that .

By design, if , then ; if , then . Thus, to establish the result we only need to show that

Starting from the right-hand-side,

which establishes the result.

Because , it follows that the maximizer of , , converges in probability to , that is, .

To obtain the asymptotic distribution of the estimators, we need additional technical conditions as in Theorem 3.1 of [17]; provided these conditions hold, and are asymptotically normal.

Appendix C Nested trial design with covariate-dependent sampling probabilities

c.1 Sampling properties

As noted in the main text, a more general version of the nested trial design assumes that the sampling probabilities for non-randomized individuals depend on baseline auxiliary covariates. Let , where represents baseline auxiliary covariates that are available on all members of the actual population in which the trial is nested, and represents covariates that are only measured among randomized individuals () and sampled non-randomized individuals ().

The identifiability conditions and identification results remain the same as in the main text; but the sampling properties of this design are

where is a known function that only depends on , allowing the sampling of non-randomized individuals to depend on the auxiliary covariates that are available from all members of the actual population.

c.2 Identification of the conditional probability of trial participation

Using an argument similar to the case when the sampling probability for non-randomized individuals does not depend on covariates, we obtain

which is identifiable because the inverse of the conditional odds of trial participation in the sampled data, , are identifiable, and is known, by design.

c.3 Estimating the probability of trial participation by weighted regression

As before, we assume a model for with finite-dimensional parameter . The weighted pseudo-likelihood function becomes

Note that the only difference between and is that the weights in the former depend on . The sample size-normalized objective function to be maximized is

Because is assumed to be bounded away from 0, and provided the technical conditions for the uniform weak law of large numbers obtain, the above objective function converges uniformly in probability to

We will now show that .

As noted above, by design, if , then ; if , then . Thus, to establish the result we only need to show that

Starting from the right-hand-side,

which establishes the result.

Because , it follows that the maximizer of , , converges in probability to , that is, .

In practical terms, this result suggests that the conditional probability of trial participation in the target population can be estimated using a weighted regression of on among sampled patients, using weights equal to 1 for randomized patients (all of whom are sampled); for sampled non-randomized individuals; for non-sampled non-randomized individuals.

As above, provided the technical conditions of Theorem 3.1 of [17] hold, is asymptotically normal.

24h60m60s..32 24h60m60s

transportability_study_design, Date: August 24, 2019   Revision: 31.0