# Parameterizing and Simulating from Causal Models

Many statistical problems in causal inference involve a probability distribution other than the one from which data are actually observed; as an additional complication, the object of interest is often a marginal quantity of this other probability distribution. This creates many practical complications for statistical inference, even where the problem is non-parametrically identified. Naïve attempts to specify a model parametrically can lead to unwanted consequences such as incompatible parametric assumptions or the so-called g-null paradox'. As a consequence it is difficult to perform likelihood-based inference, or even to simulate from the model in a general way. We introduce the frugal parameterization', which places the causal effect of interest at its centre, and then build the rest of the model around it. We do this in a way that provides a recipe for constructing a regular, non-redundant parameterization using causal quantities of interest. In the case of discrete variables we use odds ratios to complete the parameterization, while in the continuous case we use copulas. Our methods allow us to construct and simulate from models with parametrically specified causal distributions, and fit them using likelihood-based methods, including fully Bayesian approaches. Models we can fit and simulate from exactly include marginal structural models and structural nested models. Our proposal includes parameterizations for the average causal effect and effect of treatment on the treated, as well as other causal quantities of interest. Our results will allow practitioners to assess their methods against the best possible estimators for correctly specified models, in a way which has previously been impossible.

## Authors

• 10 publications
• 11 publications
02/02/2022

### Causal Inference Through the Structural Causal Marginal Problem

We introduce an approach to counterfactual inference based on merging in...
05/11/2021

### An integral equation for the identification of causal effects in nonlinear models

When the causal relationship between X and Y is specified by a structura...
02/10/2020

### Simulating longitudinal data from marginal structural models using the additive hazard model

Observational longitudinal data on treatments and covariates are increas...
11/09/2020

### DoWhy: An End-to-End Library for Causal Inference

In addition to efficient statistical estimators of a treatment's effect,...
10/27/2021

### Doubly Robust Criterion for Causal Inference

The semiparametric estimation approach, which includes inverse-probabili...
12/12/2018

### Causal inference, social networks, and chain graphs

Traditionally, statistical and causal inference on human subjects relies...
12/06/2021

### A stableness of resistance model for nonresponse adjustment with callback data

The survey world is rife with nonresponse and in many situations the mis...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many multivariate statistical problems, inferential interest lies in properties of specific functionals of the joint distribution, such as marginal or conditional distributions; this means it is generally desirable to specify a model for these functionals directly, with other parts of the distribution often being regarded as nuisance parameters. In

causal inference problems, the target of inference may be a probability distribution other than the one that generates the observed data, but one which corresponds to some sort of experimental intervention on that system.

###### Example 1.1.

Consider the causal system represented by the graph in Figure 1(a), and suppose we are interested in the causal effect of on . This can be formulated as a prediction problem: “what would happen if we performed an experiment in which we set by external intervention?” Let the variables be distributed according to a distribution with some density . Under the causal DAG assumptions of Spirtes et al. (2000) and Pearl (2009), the conditional distribution of and after an experiment to fix is

 P∗(Z=z,Y=y∣X=x)≡P(Z=z)⋅P(Y=y∣Z=z,X=x).

Note that the idealized intervention on removes any dependence of on the confounder , but preserves the marginal distribution of , and the conditional distribution of given . This distribution is Markov with respect to the graph in Figure 1(b). Interest may then lie in the marginal effect on just ,

 P∗(Y=y∣X=x)=∑zP(Z=z)⋅P(Y=y∣Z=z,X=x), (1)

sometimes denoted , or as the distribution of the potential outcome . Models of this quantity are known as marginal structural models (or MSMs, Robins, 2000).

For the purposes of simulation and likelihood-based inference it is often necessary to work with the joint distribution directly, and it may be difficult to specify this joint so that it is compatible with a particular marginal model on (1). Indeed, providing a model for the joint distribution parametrically may lead to a situation in which (1) cannot logically be independent of the value of , unless we impose the much stronger condition that ; in this case our model implicitly assumes the non-existence of the causal effect whose presence we are trying to infer. More generally, specifying separate models for joint and marginal quantities—and ignoring the information that is shared between them—can lead to incompatible or incoherent models, non-regular estimators, and severe misspecification problems.

### 1.1 Contribution of this Paper

We will show that one can break down a joint distribution into three pieces: the distribution of ‘the past’, ; the causal quantity of interest, ; and a conditional odds ratio, copula or other dependence measure between and given . We use a star (e.g.  or ) to denote that the distribution or parameter is from the causal or interventional distribution, and omit the star if the distribution or parameter is from the observational regime.

Note that the causal quantity may be much more general than just ; see Section 2.1. Suppose that the respective parameterizations for these quantities are called , and (with some abuse of notation) ; we call a frugal parameterization. The terminology is chosen because there is no redundancy and the parameterization is saturated, so any distribution with a positive joint density can be decomposed in this manner. If we use smooth and regularThat is, such that the model is differentiable in quadratic mean and has positive definite Fisher Information Matrix. See Appendix A for a formal statement. parameterizations of the three pieces then the resulting parameterization of the joint model is also smooth and regular.

Note that, in addition to providing a parameterization, the parameters and are always variation independent; we can also choose to be variation independent of the other two parameters. As an example of the benefits of this property, we add a dependence for on covariates via link a function:

 logitP∗(Y=1|X=x,C=c) =μ+αx+βc+γxc, for all c.

Now we can be certain that—regardless of the values of and —there is a coherent joint distribution satisfying this requirement. This would allow us, for example, to model the causal effect of alcohol () on blood pressure () conditional on a person’s genes (), but marginally over factors such as socio-economic status ().

We start with a very simple example, to illustrate exactly what we propose to do.

###### Example 1.2.

Suppose that

follow a multivariate Gaussian distribution with zero mean, and that we wish to specify that

is normal with mean

and variance

. To complete the frugal parameterization we must specify ‘the past’ (i.e. ) and a dependence measure between and conditional upon (). We therefore take and to be normal with mean 0 and variances respectively and correlation , and assume the regression parameter for on (in the regression that includes ) is ; note that we could alternatively specify the covariance or partial correlation between and . Using this information, one can directly compute the distribution of after the intervention is

and consequently the observational joint distribution of is:

 ⎛⎜⎝ZXY⎞⎟⎠∼N3⎛⎜⎝0,⎛⎜⎝τ2ρτυατ2+βρτυρτυυ2βυ2+αρτυατ2+βρτυβυ2+αρτυσ2+β2υ2+2ρτυαβ⎞⎟⎠⎞⎟⎠.

We may do this for any value of , and provided that , and indeed we can obtain any trivariate Gaussian distribution from these parameters.

Once we are able to construct the joint distribution, simulation is trivial. We take the Cholesky decomposition of the covariance matrix and apply the lower triangular part to independent standard normals. Likelihood-based inference is also straightforward once the covariance is known.

In this example we took our three pieces, (a bivariate normal),

(a linear regression) and

(a regression parameter), and used them to obtain . Note that our parameterization was chosen so that every quantity of interest is specified precisely once, and the overall model is saturated (i.e. any multivariate Gaussian distribution can be deconstructed in this manner, simply by varying the parameters). This contrasts with the alternative of specifying directly, as this does not give a transparent model for the causal effect.

The above example may seem somewhat trivial, but the main contribution of this paper is that we will do this in a much more general fashion, enabling simulation from almost arbitrary models.

###### Example 1.3.

Now suppose that and are binary with still continuous, and we continue to work with the model in Figure 1(a). This time we specify that

 logitE[Y|do(X=x)]=β0+β1x;

in addition suppose , that , and that the log odds ratio between and conditional on is (we could also allow to vary with ).

The joint distribution in this example is considerably more difficult to write in a closed form than the one in Example 1.2

. However, in this paper we will show that we may: (i) specify this model using the parameters just given; (ii) simulate samples from the distribution described; and (iii) give a map for numerically evaluating the joint density and fit such a model to data using likelihood-based methods. Furthermore, we can do all this (almost) as easily as with the multivariate Gaussian distribution. Note that because logistic regression is not collapsible, this model illustrates why we should not just provide

to compute the joint distribution: doing so would lead to a very different marginal model for than the one we chose.

As we show, the method is particularly applicable to survival models and dynamic treatment models where we marginalize over the time-varying confounders; both of these are widely used but are difficult to simulate from (Havercroft and Didelez, 2012; Young and Tchetgen Tchetgen, 2014). In addition, it allows Bayesian and other likelihood-based methods to be applied coherently to marginal causal models (Saarela et al., 2015).

Though Examples 1.11.3

are presented for univariate Gaussian or discrete variables, in fact the results are entirely general and can be adapted to vectors of arbitrary cardinality and general continuous or mixed variables; implementation does become more complicated in such situations, however. As noted by

Robins (2000), calculation of the likelihood becomes a ‘computational nightmare’ for marginal structural models with continuous variables, but we show that copulas can be used to overcome this problem. In the sequel we denote the observational joint density by with, for example, meaning the conditional density of given . In the discrete case, this is just the probability mass function.

### 1.2 Existing Work

A commonly used alternative to likelihood-based approaches are generalized estimating equations (GEEs) or semiparametric methods, as these do not require full specification of the joint distribution (Diggle et al., 2002). However, neither method allows for simulation from the model, and they may be less powerful than likelihood-based methods.

Robins (1992) provides an algorithm for simulating from a Structural Nested Accelerated Failure Time Model (SNAFTM), a survival model in which one models survival time as an exponential variable whose parameter varies with treatment). This is adapted by Young et al. (2008) to simulate from a Cox MSM model. Young et al. (2009) consider a special case of a Cox MSM that approximates a SNAFTM and also a SNCFTM (a special case of the structural nested model—see Section 5). Havercroft and Didelez (2012) consider the problem of specifying models such that bias due to selection effects and blocking mediation effects will be strong if a naïve approach is used.

More recently, Richardson et al. (2017) give a variation independent parameterization for fully conditional models by using the odds product; this also allows for fully-likelihood based methods. This is extended by Wang et al. (2017) to the Structural Nested Mean Model, which we will meet in Section 5. The main difference between this work and ours is that it is much harder to extend their approach to other models and to continuous variables.

Much of the trend in causal inference is towards ‘fully conditional’ models specified by structural equations. For instance, in Example 1.3 this would have meant specifying , which would not have allowed us to directly model . In particular, the work of Pearl generally assumes that causal distributions should be conditional on all previous variables, while allowing for some conditional independence constraints (see, for example, Pearl, 2009; Peters et al., 2017). We certainly do not wish to single these authors out for criticism (indeed the authors of this paper have often considered such approaches), but they do seem to be less useful in epidemiological or other medical contexts, in which conditional independences are often—though not always—implausible assumptions. In such a context, one has to specify distributions conditional on the entire past, which may be very difficult if there are a large number of relevant variables.

### 1.3 Causal Models

Throughout the paper we will have a running example based on Figure 2; each of these examples is labelled with a prefix ‘R’.

Example R1. The model in Figure 2 arises in dynamic treatment models and is studied in Havercroft and Didelez (2012). The variables and represent two treatments, with the second treatment depending on both the first () and an intermediate outcome . The variable is ‘hidden’ or latent, and therefore identifiable quantities are functions of . A typical quantity of interest is the distribution of the outcome after interventions on the two treatments and . This is identified by the g-formula of Robins (1986) as

 pY|AB(y|do(a,b)) :=∫pY|ALB(y|a,ℓ,b)⋅pL|A(ℓ|a)dℓ. (2)

Havercroft and Didelez note that after specifying a model for , it is difficult to parameterize and simulate from the full joint distribution, partly because of the complexity of the relationship (2). They are only able to simulate from the special case of Figure 2 in which has no direct effect on , so any dependence is entirely due to the latent variable.

For related reasons, the model in Figure 2 is also the subject of the so-called g-null paradox (Robins and Wasserman, 1997) when testing the hypothesis of whether depends upon . This arises because seemingly innocuous parameterizations of the conditional distributions and

lead to situations where the null hypothesis can never hold for any plausible parameter setting: that is, it is impossible for

not to depend upon unless one of the two conditional distributions above is trivial. The reason for the ‘paradox’ can be understood as a problem of attempting to specify the relationship between and in two different and potentially incompatible ways.

Note that the g-null paradox is not the same as the presence of singularities§§§See Appendix A for a formal definition. or non-collapsibility, but rather it is a result of non-collapsibility over a marginal model that leads to singularities.

Example R2. Considering Figure 2 again, suppose that we choose to depend linearly on , and (including any interactions we wish), and that is binary and we use a logistic parameterization for its dependence upon . Then, if takes four or more distinct values, it is essentially impossible for to hold in such a distribution, even if doesn’t depend directly upon , or . This is because

 E[Y|do(a,b)] =1∑ℓ=0pL|A(ℓ∣a)⋅E[Y∣a,ℓ,b] =β0+β1a+β3b+expit(θ0+θ1a)β2,

so the only way for this quantity to be independent of an with at least four levels is for and either or . This ‘union’ model is singular at , and being in it implies a much stronger null hypothesis (that either or in addition to the causal independence) than the one we are interested in investigating.

Generally speaking, if we try to state a model for as well as requiring that does not depend on , we effectively try to specify the - relationship in two different margins; in the case above these margins are incompatible, leading to the singularity. This is avoided by constructing a smooth, regular and variation independent parameterization, avoiding any redundancy. We show that, in fact, a frugal parameterization of the joint distribution exists that separates into variation independent parameterizations of the quantities

 pALB(a,ℓ,b), pY|AB(y|do(a,b)),

This entirely avoids the g-null paradox when considering hypotheses about , since variation independence means that it may be freely specified. In addition this parameterization is saturated, so one can logically specify any model in this manner.

Note that the example above does not give a separate specification of the dependence of on that is causal, and the spurious dependence due to the latent parent : both kinds of dependence are tied up in the association parameter . An alternative is to explicitly include in the model, leaving us with

 pUALB(u,a,ℓ,b), pY|AB(y|do(a,b)), and ϕ∗UL,Y|AB(u,ℓ;y|a,b),

where has to model the dependence between and , conditional on . Of course, some of these quantities will be unidentifiable,Specifically, and . but we will want to be able to simulate how well the effects of and on are estimated in the presence of unobserved confounding of various strengths.

###### Remark 1.4.

Statistical causality is represented using a number of different overlapping frameworks, including potential outcomes (Rubin, 1974), causal directed graphs (e.g. Spirtes et al., 2000), decision theory (Dawid and Didelez, 2010), non-parametric structural equation models (e.g. Pearl, 2009), Finest Fully Randomized Causally Interpretable Structure Tree Graphs (Robins, 1986) and their implementation as Single World Intervention Graphs (Richardson and Robins, 2013). The discussions in this paper are broadly applicable to any of these frameworks. For notational purposes we choose to use Pearl’s ‘’ operator to indicate interventions. For example, refers to the conditional distribution of given under an experiment where is fixed by intervention to the value . We generally abbreviate this to . The same quantity in the potential outcomes framework would generally be denoted by .

Though slightly more verbose, the notation has the advantage that the quantity is more immediately seen to be a conditional distribution indexed by both and , which is critical to our method. In fact, the interventions we consider are generally ‘soft’, meaning that we change the distribution of rather than setting it to a particular value. However, since most of the distributions we actually consider are conditional upon , this marginal distribution does not play a significant role; therefore, we will continue to use the notation. Note also that it is ambiguous from notation alone whether is identifiable or not, since it depends upon both the causal model being postulated and the available data; this problem also arises with the other frameworks.

The remainder of the paper is structured as follows: in Section 2 we provide our main assumptions and theoretical results, including the parameterization. In Section 3 we describe how to simulate from our models and give a series of examples, and in Section 4 we show how to fit these models using maximum likelihood estimation. Section 4.1 contains an analysis of real data on the relationship between fibre intake, a polygenic risk score for obesity and children’s BMI. Section 5 contains an extension to models in which the causal parameter is different for distinct levels of , and we conclude with a discussion in Section 6.

## 2 The Frugal Parameterization

Here we present a formalization of the ideas in the introduction. Suppose we have three random vectors , where is an outcome (or set of outcomes) of interest, and consist of relevant variables that are considered to be causally prior to . There is no restriction on the state-spaces of these variables provided they admit a joint density with respect to a product measure , and satisfy standard statistical regularity conditions. In particular, each of , and may be finite-dimensional vector valued, and either continuous, discrete or a mixture of the two. The fact that each of these variables may be vector valued means the method is considerably more flexible than it might at first appear.

Throughout this paper we use the notation

to denote the marginal density of the random variable

, and to denote the parameter in a model for this distribution; similarly and relate to the distribution of conditional upon . We will want to consider marginal and conditional distributions that are not obtained by the usual operations; for example, a marginal distribution taken by averaging over a population with a different distribution of covariates. We will typically denote such non-standard distributions by indexing with a star: e.g.  or .

We use to denote parameters that describe the dependence structure of a joint distribution; specifically, such that when combined with the relevant marginal distributions they allow us to recover an entire joint distribution. Examples include odds ratios or the parameters of a particular copula. We also consider quantities that provide such a dependence structure conditional on a third variable, and denote this as . Again, if the dependence is in then we will write this quantity as .

We will assume that we have three separate, smooth and regular parametric models for

, and , with corresponding parameters , and . In this sense our model can be equated with , and for this reason we will often refer to as ‘the model’. For convenience, we will refer to as the observational distribution, and as the causal distribution; we do this even though will not always correspond to any standard causal intervention on .

### 2.1 Cognate Distributions

The set of quantities we will parameterize is based on distributions that are ‘cognate’ to ordinary conditional distributions (see Definition 2.1 below). Cognate distributions tend to arise if we assume some sort of transportability or generalizability from one population to another: that is, suppose that there are two populations in which the mechanistic dependence of on is the same, but whose distributions over differ (for example, a different age distribution). This is related to the notions discussed by Pearl and Bareinboim (2014), but their perspective falls within the ‘fully conditional’ view of causality which we try to avoid in this paper. See the discussion in Section 6.

Note that such quantities do not necessarily have a causal interpretation in terms of experimental interventions. Standardized Mortality Ratios use this approach, for example, as populations are reweighted by age and sex for purposes of comparison (see, for example Keiding and Clayton, 2014).

We say that a function of two arguments is a kernel if and for all . In other words, it behaves like a conditional density for given .

###### Definition 2.1.

We say that is a cognate distribution to within if it is of the form

 p∗Y|X(y|x) =∫ZpY|ZX(y|z,x)⋅w(z|x)dz, x∈X,y∈Y, (3)

for some kernel .

It is straightforward to check that is itself a kernel for given . One may think of as being a conditional distribution taken from the larger distribution , where

 p∗ZXY(z,x,y) =pZXY(z,x,y)⋅p∗ZX(z,x)pZX(z,x) (4) =p∗X(x)⋅w(z|x)⋅pY|ZX(y|z,x).

Note that and share a conditional distribution for given —only the marginal distribution of and has been altered. The marginal distribution is essentially arbitrary, though we may need it to satisfy some of Assumptions A4–A7 in order to apply our results. Notice that the factorization used in the cognate distribution is not the usual causal one; this is a deliberate choice since it allows a wider range of quantities of interest to be parameterized.

###### Example 2.2.

Our definition for a cognate distribution includes the ordinary conditional distribution as a special case, since setting we obtain

If is a Dirac measure with mass at , then

Common causal quantities obtained by re-weighting are also cognate; for example, given the causal model implied by Figure 1(a) we have

 pY|X(y|do(x))≡∫ZpY|ZX(y|z,x)⋅pZ(z)dz,

so picking shows that is also cognate to . In other words, this formulation allows for adjustment by a subset of the previous variables. The effect of treatment on the treated (ETT) is also a function of a distribution cognate to ; for the model in Figure 1(a) it contrasts which,Recall that is the potential outcome for when . under causal consistency, is , with

 E[Y0∣X=1] =pY|X(1|X=1,do(X=0))

The kernel in this case is for .

### 2.2 Assumptions

Take a statistical model indexed by , and let be two parameters of the model. Denote by the range of the parameter : that is, the image of the map. We say that and are variation independent if . In other words, if the image of the pair of functions corresponds to the Cartesian product of all pairs of values obtained by the parameters individually. A variation independent parameterization helps to ensure that the parameters characterize separate, non-overlapping aspects of the the joint distribution. Note that we may sometimes refer to distributions being variation independent, and in this case we are referring to their parameterizations.

We are now ready to formally state our assumptions. Suppose that we have three parametric models, , and .

Assumptions.

1. There is a smooth and regular parameterization of the joint distribution , given by three parametric models .

2. and are smooth and regular (and variation independent) parameterizations of the distributions of and .

3. The parameter is variation independent of the marginal parameters and .

###### Definition 2.3.

Let be a cognate distribution defined by using the kernel in (3); then, with as in (4), we define

Similarly we let

In other words, are parameters isomorphic to , but they apply to the interventional distribution rather than to .

###### Remark 2.4.

It is important to note that, though the parameterizations and are defined to be largely isomorphic and both describe , they generally do not induce the same model. This is because imposes a model on , whereas imposes (the same) model on ; in general, therefore, they will not induce the same family of distributions over . Indeed, in a case such as Example 1.3 almost all distributions obtained from are not contained within the model induced by .

### 2.3 Choices of the Association Parameter

Now that we have formally set out the assumptions, let us return to the original problem. We want to be able to (i) construct, (ii) simulate from, and (iii) fit a model with a particular parametric cognate distribution. In order to do this we have to make some choices. We take and as given, because they are chosen by the analyst using subject matter considerations; this leaves us to select a parametric family for , and an association parameter within the causal model, .

In the case of binary and the natural choice for such an object is the conditional odds ratio:

This is known to be variation independent of the margins and , and it also has the property that if is multiplied by any function of or it does not change. More specifically, note that , and hence

In other words, the conditional odds ratio for the causal and observational distributions are the same.

This definition and the invariance result extends to any distribution with conditional density via

 ϕYZ|X(y,z|x) y≠y0,z≠z0,

where is an arbitrary baseline value for which and are always strictly positive for all values of , and (Chen, 2007); the values of and may be dependent upon , if it is necessary to avoid a point at which or . In either case, the joint distribution can—in theory—be recovered from the odds ratio and marginal distributions using the iterative proportional fitting (IPF) algorithm (Csiszár, 1975; Ruschendorf, 1995). Other fitting approaches are discussed by Tchetgen Tchetgen et al. (2010). Note that, for general continuous distributions, it is not possible to implement the algorithm in practice in most cases; an obvious exception to this is the multivariate Gaussian distribution.

Alternative possibilities include the risk difference and risk ratio, though these do lack the variation independence in A3 possessed by the odds ratio, unless combined with the odds product as in Richardson et al. (2017). We will use these in Section 5, to parameterize ‘blips’ in a structural nested mean model.

###### Proposition 2.5.

If , and are discrete and have strictly positive conditional distribution , then using the marginal distributions and and odds ratio satisfies assumptions A1–A3. Indeed, can also be a continuous or mixed variable provided that and satisfy A2 (c.f. Example 1.3).

###### Proof.

This follows from the results of Bergsma and Rudas (2002). ∎

###### Example 2.6.

For Gaussian random variables, or other distributions that are defined by their first two moments, the correlation

satisfies the conditions for being an association parameter , in the sense that when combined with the marginal distributions for and one can recover the joint distribution. The log odds ratio parameter for a Gaussian distribution, given by , can equivalently be used (Chen, 2007).

###### Example 2.7.

An alternative to the odds ratio for continuous variables is to use a copula

, which separates out the dependence structure from the margins by rescaling the variables via their univariate cumulative distribution functions. A multivariate copula is a cumulative distribution function with uniform marginals; i.e. a function

which is increasing and right-continuous in each argument, and such that for all and .

Recall that, for a continuous real-valued random variable with CDF , the random variable is uniform on . The copula model for and is then

 CYZ(u,v)≡P(FY(Y)≤u,FZ(Z)≤v), u,v∈[0,1].

There is a one-to-one correspondence between bivariate copulas and bivariate CDFs with uniform marginals. By Sklar’s Theorem (Sklar, 1959, see also Sklar, 1973), any copula can be combined with any pair of continuous margins to give a joint distribution, via

 FYZ(y,z)≡C(F−1Y(y),F−1Z(z)), y,z∈R.

We will assume that the copula is parametric, and then represents the parameters of the particular family of copulas.

###### Proposition 2.8.

If and are continuous with a positive conditional distribution for each , then any smooth and regular parameterization of their marginals and together with a smooth and regular conditional copula satisfies assumptions A1–A3.

###### Proof.

This follows from the results of Sklar (1973). ∎

Note that the use of a copula still gives us a transparent model, since we are simply imposing a marginal structure on the causal distribution . We will return to these examples in Section 3.

### 2.4 Parameterization

We first give the main result outlined in the introduction: given a weight function , any frugal parameterization , of , induces a corresponding isomorphic parameterization of , , where is the cognate distribution to via . In other words, in terms of parameterization there is no essential difference between choosing a model for the cognate distribution or for the ordinary conditional distribution . When we do this, the smoothness and regularity of the parameterization of the observational model () as well as its variation independence to and—possibly—the association parameters, is preserved in the new parameterization of . The next theorem formalizes this.

We first need to introduce a couple of additional assumptions: that is smoothly and regularly parameterized by , a smooth function of , and a relative positivity of the observational distribution.

1. Given , the weight function in Definition 2.1, the product has a smooth and regular parameterization , where is a twice differentiable function with a Jacobian of constant rank.

2. is absolutely continuous with respect to .

We remark at this point that we now have three separate parameterizations: and are both parameterizations of ; we also introduce the reduced parameters as a parameterization of .

###### Theorem 2.9.

Let be a distribution parameterized by , and a kernel satisfying A4; we also assume that A5 holds. Let be the cognate distribution defined by (3).

Then satisfies A1 and A2 if and only if satisfies A1 and a modified version of A2 that replaces with . In addition, satisfies A3 with respect to if and only if does so with respect to .

###### Proof.

From A2 (for either or ) and A4 we know that we can obtain from . Then, by A5, we know that

 (5)

Now, the equivalence of the assumptions A1 for and becomes obvious, because we can obtain from if and only if we can obtain from , and the ratio in (5) is already determined. The equivalence of the other part of A4 just follows from Definition 2.3.

The conclusion about the equivalence of the variation independence between and to and also follows directly from the isomorphism in Definition 2.3. ∎

###### Remark 2.10.

The previous result tells us that, given a suitable dependence measure , we can propose almost arbitrary (i.e. provided that they satisfy the assumptions indicated in the Theorem) separate parametric models for each of the three quantities , and , and be sure that there exists a (unique) joint distribution compatible with that collection of models. Of course, this leaves open the question of how we should compute that joint distribution.

Computation is relatively straightforward in some special cases: if , then it is possible to recover the full conditional from the odds ratio in closed form via the method of Chen (2007). If is truly marginal (i.e. we have actually had to integrate over to compute it), then the IPF algorithm (Csiszár, 1975) can be employed; however, this is impractical for cases other than the discrete and multivariate Gaussian cases. Generally, we use a copula-based approach for continuous variables.

Example R3. Picking up Example R1.3 again and, for now, consider only the observed variables (though see Example R3.4 in Section 3.4 for details on how to simulate from all the variables). Take and , then Theorem 2.9 says that we can parameterize the model using parametric models of the three pieces

 pALB(a,ℓ,b) pY|AB(y|do(a,b)) ϕ∗LY|AB(ℓ,y|a,b). (6)

For convenience, we choose to factorize according to the ordering . Set ,

is conditionally exponentially distributed with mean

, and

 B∣A=a,L=ℓ∼Bernoulli(expit(γ0+γaa+γℓℓ+γaℓaℓ)).

Let us suppose that

, with mean

 E[Y∣do(A=a,B=b)] =β0+βaa+βbb+βabab

and variance . Let be a conditionally bivariate Gaussian copula, with correlation parameter given by some function of and . This parameterization now satisfies A1–A3.

### 2.5 Larger Models

The following corollary of Theorem 2.9 allows us to ‘build up’ a frugal parameterization of the joint distribution using several cognate quantities. Given a collection of variables under some natural ordering (typically a temporal ordering), let denote the predecessors of each .

###### Corollary 2.11.

Let have joint distribution , and let denote the conditional distribution of given for some . Also let be cognate to within . Then there is a smooth and regular, variation independent parameterization of the joint distribution containing each .

###### Proof.

Let . Applying Theorem 2.9 first to we have that

is a smooth and regular parameterization of . The result then follows recursively by applying it to . ∎

###### Example 2.12.

Young and Tchetgen Tchetgen (2014) consider survival models with time varying covariates and treatments. Let be an indicator of survival up to time (with indicating failure). Let be respectively covariates and treatment at time . Young and Tchetgen Tchetgen model the quantities

 P(Yt=1|Yt−1=1,do(a1,…,at−1)), t=1,…,T;

i.e. probability of survival to the next time point given treatment history and survival so far. Under their assumptions these quantities are identifiable via the g-formula as

 p(yt|yt−1,do(¯¯¯at−1))=∑ℓ1,…,ℓtp(yt|yt−1,¯¯¯at−1,¯ℓt)t∏s=1p(ℓs|¯¯¯as−1,¯ℓs−1), (7)

where and similarly for (note that we omit some subscripts for brevity). This is cognate to within , so our previous result tells us that a parameterization exists of the joint distribution that uses these quantities for each . Given the distribution of , the quantities

 p(yt|yt−1,do(¯¯¯at−1)) ϕ∗Yt¯¯¯Lt−1|¯¯¯¯At−1¯¯¯¯Yt−1(yt,¯ℓt−1|¯¯¯at−1,¯¯¯yt−1)

may be used to recover .

## 3 Sampling from a marginal causal model

In this section we will consider how to sample from using a frugal parameterization , sometimes analytically, but more commonly via the method of rejection sampling. Note that, now we have constructed a valid parameterization, we will no longer need to refer to the model on defined by . From this point on, we only discuss the model on parameterized by , and the corresponding model on parameterized by .

We first review how one should go about choosing such a parameterization.

1. Choose the quantity which you wish to model, or of which you wish to model a function, and select a parameterization (this should include the quantity of interest).

2. Determine the kernel over which we need to integrate in order to obtain .

3. Introduce a parameterization of , such that is smoothly and regularly parameterized by a twice differentiable function of .

4. Choose a ‘suitable’ parameterization of the dependence in - conditional upon in the causal distribution .

The three pieces , and will make up the frugal parameterization. To make point 3 more concrete, in Example 1.3 we can set to be the combination , and then take ; this ensures it will have heavier tails than which, as we will see in Section 3.2, is crucial for sampling.

For point 4, the question of suitability of the dependence measure, we would wish to consider: (i) whether the relevant variables can be modelled with the particular dependence measure selected (e.g. odds ratios are suitable for discrete variables, but not so useful in practice for continuous ones); (ii) the computational cost of constructing the joint distribution; (iii) whether we want the dependence measure to be variation independent of its baseline measure; if so that would rule out risk ratios and differences. For a larger model with a vector valued , we might wish to fit different dependence measures for each treatment variable; see Section 5 for an example of this with a Structural Nested Mean Model.

### 3.1 Direct Sampling

For fully discrete or multivariate Gaussian models, it is possible to compute and then ‘reweight’ by to obtain the distribution . As noted in Proposition 2.5, in the discrete case this is straightforward using (conditional) log odds ratios to obtain a frugal parameterization of the distributions. For example, if and are both binary, taking values in , we can use

For further details, including what happens if there are more than two levels to or , see Bergsma and Rudas (2002). As noted in Section 2, a nice property of the odds ratios as the association parameter is that its value in the observational distribution is always the same as its value in the causal distribution.

Example R4. Let us apply this to a discrete version of Example R2.10 from Havercroft and Didelez (2012); we know that the objects in (6) are sufficient to define the model of interest. If all the variables are binary, then we start with a parameterization of and using (conditional) probabilities, and using conditional odds ratios.

Assume, for example, that

 Y∣do(A=a,B=b)∼Bernoulli(expit(−1+a+ab)),

with , and . Then specifying, for instance,