Study design in causal models

11/13/2012
by   Juha Karvanen, et al.
Jyväskylän yliopisto
0

The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing data mechanism together with the causal structure and allow the direct application of causal calculus in the estimation of the causal effects. The flow of the study is visualized by ordering the nodes of the causal diagram in two dimensions by their causal order and the time of the observation. Conclusions whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph. Causal models with design offer a systematic and unifying view scientific inference and increase the clarity and speed of communication. Examples on the causal models for a case-control study, a nested case-control study, a clinical trial and a two-stage case-cohort study are presented.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/05/2014

Estimating complex causal effects from incomplete observational data

Despite the major advances taken in causal modeling, causality is still ...
12/12/2020

From controlled to undisciplined data: estimating causal effects in the era of data science using a potential outcome framework

This paper discusses the fundamental principles of causal inference - th...
05/27/2019

Detecting Individual Level `Always Survivor' Causal Effects Under `Truncation by Death' and Censoring Through Time

The analysis of causal effects when the outcome of interest is possibly ...
05/05/2021

Identification of causal effects in case-control studies

Case-control designs are an important tool in contrasting the effects of...
04/18/2020

Causal Effects of Prenatal Drug Exposure on Birth Defects with Missing by Terathanasia

We investigate the causal effects of drug exposure on birth defects, mot...
05/18/2020

Towards Causal Inference for Spatio-Temporal Data: Conflict and Forest Loss in Colombia

In many data scientific problems, we are interested not only in modeling...
02/04/2019

Causal Effect Identification from Multiple Incomplete Data Sources: A General Search-based Approach

Causal effect identification considers whether an interventional probabi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

The causal assumptions, the study design and the data are the elements required for scientific inference in empirical research. The research is adequately communicated only if all of these elements and their relations are described precisely. Causal models with design describe the study design and the missing data mechanism together with the causal structure and allow the direct application of causal calculus in the estimation of the causal effects. The flow of the study is visualized by ordering the nodes of the causal diagram in two dimensions by their causal order and the time of the observation. Conclusions whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph. Causal models with design offer a systematic and unifying view scientific inference and increase the clarity and speed of communication. Examples on the causal models for a case-control study, a nested case-control study, a clinical trial and a two-stage case-cohort study are presented.

1 Introduction

Causal models are commonly used to describe the true or hypothesized causal relationships between a set of variables. The model is typically presented as a directed acyclic graph (DAG), where the nodes represent the variables and the edges represent the causal relationship so that the arrow shows the direction of the effect. A graphical model serves as a tool for visualizing and discussing causal relationships but even more importantly it is a mathematically well-defined object from where causal conclusions can be drawn in a systematic way. Causal calculus (Pearl, 1995, 2009) can be used to estimate causal effects from observational data providing that the study has been carefully designed (Rubin, 2008).

Causal models are not sufficient for the estimation of causal effects without the data. After specifying the causal model and the objectives of the study, the first questions of the researcher should be “How should the data be collected?” and “How should the data collection be taken into account in the analysis?” (Heckman, 1979; Rosenbaum and Rubin, 1983). In many fields of science, the data are not obtained as a simple random sample of the population. The pressure of cost-efficiency leads to complex study designs where the expensive measurements are made only for a carefully selected subset of individuals (Reilly, 1996; McNamee, 2002; Langholz, 2007; Kulathinal et al., 2007; Van Gestel et al., 2000; Karvanen et al., 2009a). It is therefore crucial to take the study design into account in the estimation of causal effects. The increased complexity of study designs also emphasizes the need for accurate and efficient reporting (von Elm et al., 2007; Vandenbroucke et al., 2007; Schulz et al., 2010; Moher et al., 2010).

An introduction to causal models with design is given through an example in Section 2. The formal definition of the concept is then presented in Section 3. In Section 4 it is shown how the causal effects can be estimated from a case-control study. Examples describing a clinical trial, a nested case-control study and a two-stage case-cohort study as causal models with design are provided in Section 5. Finally, the benefits, the limitations and the implications of the proposed concept are discussed in Section 6.

2 Introductory example

Pearl (2009) considers an example where the causal effect of smoking to the lung cancer is studied. It is assumed that the causal effect is mediated through the tar deposits in the lungs . In addition, there might be an unknown confounder which has a causal effect both to and but not to . Figure 1(a) illustrates the causal model.

Figure 1: Graphical models for the example on the causal effect of smoking to the lung cancer

In numerical calculations, Pearl implicitly assumes that the data are obtained as a simple random sample from the population. This assumption is made explicit in Figure 1(b). Variable , where subscript indexes the individuals, represents an indicator for a finite well-defined closed population . It is defined and . Variable represents the sampling. This indicator variable has value 1 if the individual was selected to the sample and 0 otherwise. The arrow from to describes the fact that the sample is selected from the population, i.e. implies . The value of can be determined by the researcher, which is shown in the graph by using diamond symbols for the these nodes.

Variables , and are related to the underlying population and are not directly observed, which is shown in the visualization with the open circles. Instead, the variables , and are measured from the sample. Because these variables are observed, they are shown as filled circles. The value of is if the individual belongs to the sample, i.e. ; otherwise is not available. This is described in the graph by arrows from and to . In other words, the causal assumptions, the study design and the data are all presented in the same graph where the causal effects are defined consistently regardless of the type of the variable.

Instead of simple random sampling, case-control designs are often used in epidemiology to study rare diseases. Figure 1(c) represents a case-control design where the selection for the risk factor measurement is made on the basis of the lung cancer status. In practice, for instance, 1000 lung cancer cases and 1000 non-cases are selected. The lung cancer status is determined for the sample . Smoking and tar deposits are measured for the case-control set . In the graph, there are arrows from and from to , which indicates that the selection for case-control set depends on the measured lung cancer status.

It is well known that the study design must be taken into account in the analysis of the data from the case-control design. This means that although Figure 1(a) presents the causal model for both situations (b) and (c), the analysis of the case-control study (c) differs from the analysis of the simple random sample (b). This difference is made explicit by combining the study design to the causal model. As these causal models with design are causal models, the actual estimation of causal effects can be carried out applying the rules of causal calculus as demonstrated in Section 4.

3 Causal models with design

The formal definition of causal models with design relies on the definition of causal models as presented by Pearl (2009) and the missing data concept presented by Rubin (1976). The definition of causal models is extended to reflect the elements of inference: the causal assumptions, the study design and the data. The immediate benefit is that the methods of causal calculus are directly applicable for questions related to the study design and estimation. Graphical models with explicit sampling or selection mechanism have been earlier used by Cooper (2000), Geneletti et al. (2009), Didelez et al. (2010) and Bareinboim and Pearl (2012b).

Causal model and probabilistic causal model are defined by Pearl (2009) as follows:

Definition 1 (Structural Causal Model, Pearl 7.1.1)

A causal model is a triple , where

  1. is a set of background variables that are determined by factors outside the model;

  2. is a set of variables, called endogenous, that are determined by variables in the model – that is, variables ; and

  3. is a set of functions such that each is a mapping from (the respective domains of) to where and and the entire set forms a mapping from to . In other words, each in , , assigns a value to that depends on (the values of) a select set of variables in , and the entire set has a unique solution .

Definition 2 (Probabilistic Causal Model, Pearl 7.1.6)

A probabilistic causal model is a pair where is a causal model and

is a probability function defined over the domain of

.

The causal diagram of a causal model is a directed graph where each node corresponds to a variable and the directed edges point from members of and toward .

Causal model with design can be defined as an extension of the probabilistic causal model presented by Pearl where the notation for selection and missing data follows the lines of (Rubin, 1976):

Definition 3 (Causal model with design)

Causal model with design is a probabilistic causal model that fulfills the following conditions:

  1. Each node in the causal diagram is either a causal node, a selection node or a data node. Each node has an information type attribute with possible values: ‘observed’,‘not observed’, ‘determined and known’ and ‘determined and unknown’.

  2. Each selection node represents a binary variable with the possible values 1 and 0. There is always a unique selection node

    (population node) which is an ancestor of all selection nodes and has value .

  3. Each data node has two parents, one causal node and one selection node. A causal node cannot be a parent for more than one data node. For a data node with parents causal node and selection node , it holds

    where NA represents a missing value.

In the first item of Definition 3, the node types are named and the possible values information type attributes are listed. The information type attribute of the variable with the possible values ‘observed’, ‘not observed’ and ‘determined and known’ and ‘determined and unknown’ describes the knowledge of the researcher. In visualizations these types are presented as a filled circle, an open circle, a filled diamond and an open diamond, respectively. In an observational setup, a causal variable is not observed as such; only the corresponding measurement is observed. In an experimental setup, the values of some causal variables can be determined by the researcher. Usually, causal variables determined by the researcher are known but in principle they can be also unknown if the information on the values set for the variable has been lost after the execution of the experiment. The data are by definition always observed. A selection variable can have all four information types. The value of a selection variable is determined when sampling or other selection is applied to the population. The selection variable can be determined and known or determined and unknown. The latter type, ‘determined and unknown’, may occur, for instance, when the sample is drawn from administrative register with personal identifiers but these are later removed from the data and the researcher is not able to tell which individuals of the population are present in the sample. When the missing data can be identified as an empty record, the selection variable is observed. If the missing individuals are not identified at all, as it is the case in left truncation for instance, the selection variable is not observed.

In the second item of Definition 3, the role of the population and the selection variables is specified. Causal assumptions are always made with respect to some finite population known as study source in epidemiology (Miettinen, 2011). There is always only one population node. If there is more than one conceptual population, the population can be defined as the union of the conceptual populations. The conceptual population, for instance, a geographical area, becomes a causal variable in the model. If the causal mechanisms differ by the area, the model contains arrows from the area to the causal nodes where the functions differ between the conceptual populations. This allows defining models where some causal relationships are similar across the areas and some are different. The selection probabilities for the sampling may also differ by the area, which is shown in the model by an arrow from the area to the selection node.

The members of the population can be a priori known or unknown. In the former case, the researcher has a unique identifier, for instance, the social security number, available for each member of the population before the study. In the latter case, the researcher identifies the members of the population only when they enter to the study. A selection node induces the subpopulation , which consists of the selected individuals. The causal effects are typically estimated for the population but, for instance, in epidemiological cohort studies the effects are often estimated only for the cohort , also known as study base (Miettinen, 2011).

In the third item of Definition 3

, the relations of the causal variables, the selection variables and the data are specified. The value of random variable

is measured only if the individual is selected to be measured, which is indicated by the selection variable . This means that the measured value is a random variable which depends on the variables and in a deterministic way. The definition of a univariate random variable is extended so that in addition to real axis, a random variable may also have a special value ‘NA’ which indicates missing data. With this definition, all elements of scientific inference can be expressed as random variables and their causal relationships. If a data node or a selection node has a directed path to a causal node, the measurement or the selection has a causal effect to the underlying causal variable. This may be the case, for instance, in health examination studies where the participation to the study may increase the awareness on the healthy life style and consequently also have an impact to the later measurements of health indicators.

In a causal model, the causal effects define a partial ordering between the variables. In addition to this causal time, the time of observation can be linked to each variable in a causal model with design. Together the causal time and the observational time define the relative location of each node in a visualization where the causal time is presented on x-axis and the observational time on y-axis. To make the visualization more informative, the stages of the study can be used as labels for the y-axis as it is done in the examples of Sections 2 and 5.

Measurement error can be added to a causal model with design by introducing two causal variables: the original variable and the version with measurement error . In the graph there is an arrow from to . Both and are unobserved and only is observed for the sample. Variable is usually unobserved unless some kind of benchmark measurements without measurement error are carried out for a subsample. If two variables and have correlated measurement errors, an explicit unobserved causal variable is needed to describe the structure of the measurement error. In the graph, there are arrows from to and to in addition to arrows and . Again only and are observed for the sample.

In causal models with design, sampling and nonresponse are formally treated in a similar way; the only difference is the type of the selection node which is ‘determined’ for sampling and ‘observed’ for nonresponse. Some conclusions on the type of missing data mechanism Rubin (1976) can be made directly from the causal model with design. Let to be the selection variable for the measurement of causal variable . If there is no (undirected) path from to except through , the data on are missing completely at random (MCAR), more precisely everywhere MCAR (Seaman et al., 2013). If there is an arrow from to , the data are missing not at random (MNAR). The traditional MCAR/MAR/MNAR classification concerns the data as a whole whereas causal models with design provide a description of the missingness mechanism variable by variable.

Many recent theoretical result on missing data and selection bias in causal inference can be applied to causal models with design. As these results are not defined directly for causal models with design but for other extensions of causal models, transformations are applied as the first step. Mohan et al. (2013) consider estimation when data are MNAR and derive conditions a “missingness graph” should satisfy to ensure the existence of a consistent estimator for a given probabilistic relation. In order to utilize these results, a causal model with design can be collapsed to a missingness graph by removing the intermediate selection nodes, i.e. selection nodes that are not parents of a data node. Formally this can be defined as follows:

Definition 4 (Collapse to a Missingness Graph)

Missingness graph is a collapse of causal model with design with causal diagram if (i) the set of nodes in consists of the causal nodes of , the data nodes of and such selection nodes of that are parents of some data node, (ii) there exist an edge from node to node in if there exists an edge from to in or if is a causal node and is a selection node and there exists a directed path from to in .

The results and algorithms by Bareinboim and Pearl (2012b) can be used to mitigate and sometimes to eliminate the selection bias caused by preferential data collection. The results are applicable in the important special case where a single selection node (often marked by ) is the parent for all data nodes. In order to apply these results, a causal model with design is first collapsed to a missingness graph and then the data nodes are removed. The transformed graph contains the selection node and all causal nodes. The results by Didelez et al. (2010), Geneletti et al. (2009) and Cooper (2000) can be also applied to the same transformed graph.

Bareinboim and Pearl (2013a, b) consider theoretical conditions for the transfer of experimental results from one or several populations to other populations. Causal models with design have only one population but the transportability results can be used between the conceptual populations. The application of the results and the algorithms by Bareinboim and Pearl (2013a, b) requires that the causal model with design has been collapsed to a selection diagram as follows:

Definition 5 (Collapse to a Selection Diagram for Transportability)

Selection diagram is a collapse of causal model with design with respect to a set of selection variables if (i) the conceptual populations of are identified by the variables of (ii) the set of nodes in consist of the causal nodes of (iii) there exist an edge from node to node in if there exists an edge from to in and does not belong to .

Other recent developments that can be applied to causal models with design include the results on the testability of counterfactuals (Shpitser and Pearl, 2007) and z-identifiability of surrogate experiments (Bareinboim and Pearl, 2012a).

4 Estimation of causal effects

The following steps are required to estimate causal effects using causal models with design:

  1. Specify the causal model.

  2. Check the identifiability of the causal effect in the causal model using the results by Tian and Pearl (2002), Shpitser and Pearl (2006b, a) and Bareinboim and Pearl (2012a). If the effect can be identified, use the rules of causal calculus (Pearl, 1995, 2009) to express the causal effect in terms of observed probabilities.

  3. Expand the causal model to the causal model with design.

  4. Form the likelihood according to the causal model with design and integrate it over the unobserved variables.

  5. Estimate the parameters needed to calculate the causal effect as derived in Step 2.

Causal models with design allow the estimation of causal effects in complex designs using only the rules of causal calculus and the likelihood. This requires, however, that the causal effect can be expressed in terms of observed probabilities (Step 2) and the parameters of the likelihood can be estimated (Step 5). Even if a causal effect is not identifiable in the general nonparametric form it may still be identifiable under a specific parametric model. For example, an instrumental variable may help to identify a causal effect in a linear model but not in a nonlinear model

(Pearl, 2009) and the average causal effect in clinical trials with noncompliance can be identified under specific assumptions (Angrist et al., 1996). Even if a causal effect is identifiable in the general nonparametric form, it may not be estimable from the collected data. A well-known example is the MNAR situation where a variable has a causal effect on its selection variable and the estimation is not possible in general without strong assumptions on the selection mechanism (Little and Rubin, 2002).

As an example of the estimation procedure, the smoking and lung cancer example of Section 2 is considered again. The causal model is specified in Figure 1(a) (Step 1). The goal is the estimate the causal effect where the do-operator represents action/intervention. The result (Step 2)

(1)

is obtained applying the following three rules of causal calculus (Pearl, 1995, 2009):

  1. Insertion and deletion of observations:

  2. Exchange of action and observation:

  3. Insertion and deletion of actions:

    where is the set of the -nodes that are not ancestors of any -node in the graph .

Here represents a graph where the incoming edges of the set of nodes are removed, represents a graph where the outgoing edges of the set of nodes are removed and represents a graph where the incoming edges of the -nodes and the outgoing edges of the -nodes are removed. The rules of causal calculus are sufficient for deriving all identifiable causal effects from observational data (Huang and Valtorta, 2006; Shpitser and Pearl, 2006b) and experimental data (Bareinboim and Pearl, 2012a) for a given population. Alternatively, the back-door and front-door criteria (Pearl, 2009) and the moralization (Lauritzen et al., 1990) can be also used to derive formulas for the causal effects. Algorithms for the automated application of causal calculus have been developed (Tian and Pearl, 2002; Huang and Valtorta, 2006; Shpitser and Pearl, 2006b; Bareinboim and Pearl, 2012a).

Next consider the case-control design of Figure 1(c) (Step 3). To estimate the causal effects, the model parameters must be estimated from the data collected according to this design. The likelihood can be factorized according to the graphical model

where represents the model parameters,

represents parameters related to the design and the vector notation, such as

, refers to the variables for all individuals in the population. The distributions are defined with respect to the first argument unless otherwise specified. The likelihood of the observed data is obtained as an integral over the unknown variables , , and (Step 4)

(2)

As the selection is random sampling from the population, the term may be ignored in the estimation of . The selection depends on the response and the term must not be ignored. Note also that although is not a parent of in the causal model, the likelihood (2) has the term .

In Step 5 the likelihood must be written in a parametric form. Finding a good parametrization, i.e. finding a good statistical model, is purely a statistical problem. Causal considerations are not needed in the model selection or in the parameter estimation and the vast literature on these topics is directly applicable. It follows from equation (1) that the probabilities , and are needed to estimate . The same probabilities are also components in the likelihood (2) and it is therefore natural to parametrize them. For simplicity Pearl (2009) assumes that the variables , and have possible values 0 and 1. The observed probabilities mentioned above can be now parametrized as follows:

With this parametrization, the causal effect of smoking to the risk of lung cancer given by equation (1) can be written as

(3)
(4)

These equations link the model parameters to the causal effects. The dependency of the selection probability on may be parametrized as

As the variables are binary, the data collected according to the case-control design can be presented in the form of frequencies given in Table 1. The size of the population is where is the number of cases selected, is the number of non-cases selected, is the number of cases not selected and is the number of non-cases not selected. In the other words, it is assumed that the lung cancer prevalence in the population is known. The log-likelihood derived from the likelihood (2) becomes

where represents summation over the corresponding marginal and

is a shorthand notation for the marginal probability of . The maximum likelihood estimates of can be obtained by numerical optimization of the log-likelihood. Naturally, a Bayesian analysis may be carried out as well.

Notation Numerical illustration
100 814
47 5
3 45
850 136
sum 1000 1000
Table 1: Data collected from the case-control study

For a numerical illustration, consider a case-control study where 1000 lung cancer cases and 1000 controls are selected for the covariate measurements. The parameters are set according to the (unrealistic) population probabilities used in (Pearl, 2009, page 84). The expected frequencies are shown in Table 1. With these frequencies and the numbers of non-selected individuals and , the maximum likelihood estimates , , , , , , , , and are obtained. The equations (3) and (4) give the causal effects

which are similar to the causal effects estimated from the whole population in (Pearl, 2009, page 84). The differences in the third decimal are due to the rounding of the expected frequencies in Table 1 to the nearest integer.

5 Examples with complex study design

The examples presented in this section aim to demonstrate how causal models with design can describe the essential features of complex experimental and observational studies in a precise and illustrative way. The examples are from medicine and epidemiology where complex study designs are commonly used. The first example is based on a real study and causal models with design are used to make conclusions on the identifiability of various causal effects from data missing not at random. The two other examples describe imaginary but realistic scenarios.

Causal graphs with design remove the ambiguity related to the common names of study designs such retrospective study, prospective study, cohort study, case-control study and two-stage study (Vandenbroucke, 1991; Knol et al., 2008). The process of the data collection can be seen directly from the causal graph with design.

For the estimation of causal effects, the procedure presented in Section 4 is applicable. Causal models with design are also useful in the estimation of predictive models when the study design and the missing data mechanism must be taken into account in the analysis. The likelihood factorized according to the causal model with design offers a natural starting point for the parameter estimation in both the frequentist and the Bayesian approach. The idea is to write first the full likelihood for the data, the design and the latent variables, and then see which parts of the likelihood are not needed in the estimation of the parameters of the interest. The likelihood functions for the examples of this section are given in Appendix.

Figure 2 illustrates a causal model with design for the two-stage case-cohort design used in the MORGAM Project (Kulathinal et al., 2007; Evans et al., 2005). The project aims to estimate the impact of classic and genetic risk factors to the risk of cardiovascular diseases. Currently 15 cohorts from 6 countries participate in the genetic component of the project. Most of the cohorts are selected as random samples of the underlying population of certain age range, typically 24–65 years although there is variation between the cohorts. Over 50,000 individuals have been examined for the classic risk factors and followed up for mortality and disease endpoints. Due to the cost of genotyping, genes have been measured only for a subset of each cohort. Over 10,000 individuals have been genotyped in the case-cohort setting.

The causal assumptions are described using four variables: genetic risk factors , classic risk factors and health status at baseline and at the end of the follow-up . Here classic risk factors are understood to include the actual risk factors such as smoking, cholesterol and blood pressure as well as all relevant background variables measured at baseline. The internal causal structure between these variables is not specified because it is not needed in the following considerations. From the graph it can be read that genes may affect the disease risk directly and via classic risk factors. Classic risk factors measured at baseline may be affected by the health status at baseline. The following conclusions can be made using causal calculus:

  1. The causal effect of genes to disease corresponds to the observed effect in the population

  2. The causal effect of classic risk factors to disease in the population is confounded by genes and health status at baseline

    (5)
  3. The causal effect of classic risk factors to disease conditioned on health status at baseline is confounded by genes

    (6)
  4. Causal effect of genes, classic risk factor and health status at baseline corresponds to the observed conditional effect in the population

In order to see whether these effects can be estimated from the collected data, the study design need to be investigated. The population is defined to include all individuals living in a specified geographical area and born in specified years. However, the sampling frame is not the birth cohort but the individuals alive at baseline. In other words, individuals who have died before baseline are left truncated. In the graph this left truncation is shown by an arrow from to the unobserved selection node . Sampling, denoted by node , is carried out and each selected individual makes a decision on the participation . This decision depends on health status , socio-economic status and classic risk factors (Chou et al., 1997; Hara et al., 2002; Cohen and Duffy, 2002; Jousilahti et al., 2005; Drivsholm et al., 2006; Knudsen et al., 2010; Alkerwi et al., 2010). The data are MNAR because the fact whether and are measured depend on the values of these variables. However, the missingness mechanism may still be ignorable in some analyses. Applying d-separation to causal model with design it can be concluded that

(7)
(8)
(9)
(10)

From result (7) it follows that the cohort data (and consequently the case-cohort data) cannot be used to estimate the causal (or predictive) effect genetic risk factors to disease in the population without accounting for the missingness mechanism. Result (8) tells that conditioning on the health status at baseline does not change the situation qualitatively. From result (9) it follows that the cohort data can be used to estimate the predictive model for the healthy population. This kind of conditioning on the health status is commonly used in epidemiology and has been applied also in the MORGAM Project, e.g. in (Asplund et al., 2009). To estimate the causal effects of classic risk factors in the population, the missingness mechanism must be taken into account because equation (5) contains the distribution , which is potentially different for participants and non-participants. Similarly, the term in equation (6) implies that the missingness mechanism must be taken into account also when the causal effects of classic risk factors are estimated for the healthy population. From result (10) it follows that the case-cohort data can be used to estimate the causal effect of genetic risk factors for the healthy population on the condition of classic risk factors. As the case-cohort selection depends on , the case-cohort selection should be taken into account in the estimation (Kulathinal et al., 2007; Kulathinal and Arjas, 2006).

The data on health status at baseline includes information on non-fatal cardiovascular events before baseline. Restricting the analysis to the individuals healthy at baseline, i.e. removing individuals with prior non-fatal events, discards a considerable amount of potentially useful data. In the MORGAM Project, several attempts have been made to use these so called baseline cases. In (Karvanen et al., 2009b)

baseline cases are analyzed separately. The joint analysis of baseline cases and follow-up cases requires compensation for the left truncation, which can be done using nonparametric imputation

(Karvanen et al., 2010) or conditional likelihood (Saarela et al., 2009). These works, however, do not take the non-participation into account.

Figure 3 shows how the experimental setup of a clinical trial can be described in a causal model with design. The treatment in the clinical trial is a causal variable determined by the researcher by the means of randomization. In the graph, this is presented by causal node which has the type determined and known. The example also demonstrates the compliance problem encountered in clinical trials: the actual treatment may differ from the allocated treatment if the participant does not follow the instructions given. In the graph, there is an arrow from to the actual treatment and affects outcome only through . In the intention-to-treat analysis, the observed outcome is explained by the intended treatment using all included participants in the trial . In the per-protocol analysis, only the compliant participants with are included.

Figure 4 illustrates a situation where there is a dependence structure between the selection variables of the individuals in the sample. In a nested case-control design, the controls are selected considering the individuals at risk at the time (age or calendar time) of the disease event. A control may later become a case which creates a complicated dependence structure between the selection probabilities (Saarela et al., 2012). Consequently, the selection probability for individual depends on the covariates and outcomes of all other individuals. In the graphical presentation drawn for individual , the case-control selection node has incoming arrows from , , and where index is used to refer to all other individuals.

Figure 2: Causal model with design for the two-stage case-cohort design used in the MORGAM Project (Kulathinal et al., 2007; Evans et al., 2005). The sampling frame is conditioned on the health status at the beginning of the study and this dependence must be taken into account when estimates for the population are required. At the first stage of the study, a random sample is selected. The decision to participate may depend on classic risk factors and current health status. Classic risk factors and current health status are measured at the beginning of the study for the cohort members . Blood samples taken at the baseline are frozen to be used later. After a follow-up period of 10 years or more, the selection for the second stage is made on the basis of the measurements and . All disease cases and an age-stratified random subset of the cohort are selected to the case-cohort set for which genetic factors are measured. Nonresponse occurs due to missing or contaminated samples or other technical reasons.
Figure 3: Causal model with design for a clinical trial. A sample is selected for screening from the population . The inclusion for the trial is based on the screening variable . At the baseline, covariate is measured for the trial participants and a randomized decision on the treatment is made. The actual treatment during the treatment period may differ from the intended treatment because of non-compliance. The outcome depends on the covariate and the treatment . At the end of the treatment period, measurements for the observed outcome and the observed treatment are made.
Figure 4: Causal model with design for a nested case-control study. The idea of the case-control design is to select the individuals for the measurement of the expensive risk factor on the basis the outcome and the inexpensive risk factor . At the first stage, a sample is selected from the population and variables and are measured. The selection of cases and controls depends not only on measurements of individual , and , but also on the outcome and the covariate of all other individuals in the sample. Each individual has a similar causal graph which has been omitted in the figure. The nonresponse reflects the fact the measurement may not be available for all individuals selected to the case-control set.

6 Discussion

Causal models with design offer a systematic and unifying view to scientific inference. They present the causal assumptions, the study design and the data collection in a way that accounts for the complexity encountered in real-world problems. The examples in Section 5 demonstrate how the concept can be used to describe medical studies with multiple stages. Conclusions whether a causal or observational relationship can be estimated from the collected incomplete data can be made directly from the graph as it was demonstrated with the MORGAM Project. Despite the complex design, the estimation of the causal effects can be carried out in a systematic way via causal calculus as illustrated in Section 4.

Causal models with design present the population and the selection as intrinsic parts of the model. Selection nodes may have both incoming and outgoing connections to other nodes. A distinction is made between a random variable and its measured value. Combined with the selection this allows the description of various sampling and missing data setups in terms of causal effects.

The limitations of the causal model with design are in many ways similar to the limitations of the causal models in general. The presentation of causal assumptions in the form of a graphical model has the benefit that many problems can be solved without specifying the parameters of the model. On the other hand, the explicit parametric definition of the functional relationships is still the only decisive presentation of the model. Certain causal effects may be identifiable only under specific parametric assumptions such as linearity of the effect.

The implications of the concept are two-fold. First, it ties together causality and study design and opens new possibilities for the practical application of graphical models. Second, it shows the key elements of the study in a compact visual format and thus increases the clarity and speed of communication. High standards of design, analysis and communication of scientific studies will significantly reduce the time and effort needed for the synthesis of scientific knowledge.

Acknowledgement

The author thanks Olli Saarela, Mervi Eerola, Antti Penttinen, Jukka Nyblom, Jaakko Reinikainen and the anonymous referees for their comments that helped to improve the article. Kari Kuulasmaa is acknowledged for the numeric details in the description of the MORGAM Project.

References

  • Alkerwi et al. (2010) Alkerwi, A., Sauvageot, N., Couffignal, S., Albert, A., Lair, M.-L., and Guillaume, M. (2010). Comparison of participants and non-participants to the ORISCAV-LUX population-based study on cardiovascular risk factors in Luxembourg. BMC Medical Research Methodology, 10(1):80.
  • Angrist et al. (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables (with comments). Journal of the American Statistical Association, 91(434):444–472.
  • Asplund et al. (2009) Asplund, K., Karvanen, J., Giampaoli, S., Jousilahti, P., Niemelä, M., Broda, G., Cesana, G., Dallongeville, J., Ducimetriere, P., Evans, A., , Ferrières, J., Haas, B., Jorgensen, T., Tamosiunas, A., D.Vanuzzo, Wiklund, P.-G., Yarnell, J., Kuulasmaa, K., and Kulathinal, for the MORGAM Project, S. (2009). Relative risks for stroke by age, sex, and population based on follow-up of 18 European populations in the MORGAM Project. Stroke, 40(7):2319–2326.
  • Bareinboim and Pearl (2012a) Bareinboim, E. and Pearl, J. (2012a). Causal inference by surrogate experiments: z-identifiability. In de Freitas, N. and Murphy, K., editors,

    Proceedings of the Twenty-Eight Conference on Uncertainty in Artificial Intelligence

    , pages 113–120. AUAI Press.
  • Bareinboim and Pearl (2012b) Bareinboim, E. and Pearl, J. (2012b). Controlling selection bias in causal inference. In JMLR Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 22, pages 100–108.
  • Bareinboim and Pearl (2013a) Bareinboim, E. and Pearl, J. (2013a). A general algorithm for deciding transportability of experimental results. Journal of Causal Inference, 1(1):107–134.
  • Bareinboim and Pearl (2013b) Bareinboim, E. and Pearl, J. (2013b). Meta-transportability of causal effects: A formal approach. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 135–143.
  • Chou et al. (1997) Chou, P., Kuo, H.-S., Chen, C.-H., and Lin, H.-C. (1997). Characteristics of non-participants and reasons for non-participation in a population survey in Kin-Hu, Kinmen. European Journal of Epidemiology, 13(2):195–200.
  • Cohen and Duffy (2002) Cohen, G. and Duffy, J. C. (2002). Are nonrespondents to health surveys less healthy than respondents? Journal of Official Statistics, 18(1):13–24.
  • Cooper (2000) Cooper, G. F. (2000). A Bayesian method for causal modeling and discovery under selection. In Boutilier, C. and Goldszmidt, M., editors, Proceedings of 16th Conference on Uncertainty in Artificial Intelligence, pages 98–106.
  • Didelez et al. (2010) Didelez, V., Kreiner, S., and Keiding, N. (2010). Graphical models for inference under outcome-dependent sampling. Statistical Science, 25(3):368–387.
  • Drivsholm et al. (2006) Drivsholm, T., Eplov, L. F., Davidsen, M., Jørgensen, T., Ibsen, H., Hollnagel, H., and Borch-Johnsen, K. (2006). Representativeness in population-based studies: a detailed description of non-response in a Danish cohort study. Scandinavian Journal of Public Health, 34(6):623–631.
  • Evans et al. (2005) Evans, A., Salomaa, V., Kulathinal, S., Asplund, K., Cambien, F., Ferrario, M., Perola, M., Peltonen, L., Shields, D., Tunstall-Pedoe, H., and K. Kuulasmaa for The MORGAM Project (2005). MORGAM (an international pooling of cardiovascular cohorts). International Journal of Epidemiology, 34:21–27.
  • Geneletti et al. (2009) Geneletti, S., Richardson, S., and Best, N. (2009). Adjusting for selection bias in retrospective case-control studies. Biostatistics, 10(1):17–31.
  • Hara et al. (2002) Hara, M., Sasaki, S., Sobue, T., Yamamoto, S., and Tsugane, S. (2002). Comparison of cause-specific mortality between respondents and nonrespondents in a population-based prospective study: ten-year follow-up of JPHC Study Cohort I. Journal of Clinical Epidemiology, 55(2):150–156.
  • Heckman (1979) Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47(1):153–161.
  • Huang and Valtorta (2006) Huang, Y. and Valtorta, M. (2006). Pearl’s calculus of intervention is complete. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 217–224. AUAI Press.
  • Jousilahti et al. (2005) Jousilahti, P., Salomaa, V., Kuulasmaa, K., Niemelä, M., and Vartiainen, E. (2005). Total and cause specific mortality among participants and non-participants of population based health surveys: a comprehensive follow up of 54 372 Finnish men and women. Journal of Epidemiology and Community Health, 59(4):310–315.
  • Karvanen et al. (2009a) Karvanen, J., Kulathinal, S., and Gasbarra, D. (2009a). Optimal designs to select individuals for genotyping conditional on observed binary or survival outcomes and non-genetic covariates. Computational Statistics & Data Analysis, 53:1782–1793.
  • Karvanen et al. (2010) Karvanen, J., Saarela, O., and Kuulasmaa, K. (2010). Nonparametric multiple imputation of left censored event times in analysis of follow-up data.

    Journal of Data Science

    , 8:151–172.
  • Karvanen et al. (2009b) Karvanen, J., Silander, K., Kee, F., Tiret, L., Salomaa, V., Kuulasmaa, K., Wiklund, P.-G., Virtamo, J., Saarela, O., Perret, C., Perola, M., Peltonen, L., Cambien, F., Erdmann, J., Samani, N. J., Schunkert, H., and Evans for the MORGAM Project, A. (2009b). The impact of newly-identified loci on coronary heart disease, stroke and total mortality in the MORGAM prospective cohorts. Genetic Epidemiology, 33:237–246.
  • Knol et al. (2008) Knol, M. J., Vandenbroucke, J. P., Scott, P., and Egger, M. (2008). What do case-control studies estimate? survey of methods and assumptions in published case-control research. American Journal Epidemiology, 168(9):1073–1081.
  • Knudsen et al. (2010) Knudsen, A. K., Hotopf, M., Skogen, J. C., Øverland, S., and Mykletun, A. (2010). The health status of nonparticipants in a population-based health study the Hordaland Health Study. American Journal of Epidemiology, 172(11):1306–1314.
  • Kulathinal and Arjas (2006) Kulathinal, S. and Arjas, E. (2006). Bayesian inference from case-cohort data with multiple end-points. Scandinavian Journal of Statistics, 33:25–36.
  • Kulathinal et al. (2007) Kulathinal, S., Karvanen, J., Saarela, O., Kuulasmaa, K., and for the MORGAM Project (2007). Case-cohort design in practice – experiences from the MORGAM Project. Epidemiological Perspectives & Innovations, 4(1):15.
  • Langholz (2007) Langholz, B. (2007). Use of cohort information in the design and analysis of case-control studies. Scandinavian Journal of Statistics, 34:120–136.
  • Lauritzen et al. (1990) Lauritzen, S., Dawid, A., Wen, B., and Leimer, H.-G. (1990). Independence properties of directed Markov fields. Networks, 20:491–505.
  • Little and Rubin (2002) Little, R. J. A. and Rubin, D. B. (2002). Statistical analysis with missing data. Wiley.
  • McNamee (2002) McNamee, R. (2002). Optimal designs of two-stage studies for estimation of sensitivity, specificity and positive predictive value. Statistics in Medicine, 21:3609–3625.
  • Miettinen (2011) Miettinen, O. S. (2011). Epidemiological research: terms and concepts. Springer, Dordrecht.
  • Mohan et al. (2013) Mohan, K., Pearl, J., and Tian, J. (2013). Graphical models for inference with missing data. In Proceedings of Neural Information Processing Systems Conference (NIPS-2013).
  • Moher et al. (2010) Moher, D., Hopewell, S., Schulz, K. F., Montori, V., Gotzsche, P. C., Devereaux, P., Elbourne, D., Egger, M., and Altman, D. G. (2010). CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. Journal of Clinical Epidemiology, 63(8):e1–e37.
  • Pearl (1995) Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–710.
  • Pearl (2009) Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press, second edition.
  • Reilly (1996) Reilly, M. (1996). Optimal sampling strategies for two-stage studies. American Journal of Epidemiology, 143(1):92–100.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
  • Rubin (1976) Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
  • Rubin (2008) Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 2(3):808––840.
  • Saarela et al. (2009) Saarela, O., Kulathinal, S., and Karvanen, J. (2009). Joint analysis of prevalence and incidence data using conditional likelihood. Biostatistics, 10:575–587.
  • Saarela et al. (2012) Saarela, O., Kulathinal, S., and Karvanen, J. (2012). Secondary analysis under cohort sampling designs using conditional likelihood. Journal of Probability and Statistics, Article ID 931416:37 pages.
  • Schulz et al. (2010) Schulz, K. F., Altman, D. G., Moher, D., and CONSORT Group (2010). CONSORT 2010 Statement: Updated guidelines for reporting parallel group randomized trials. Annals of Internal Medicine, 152(11):726–732.
  • Seaman et al. (2013) Seaman, S., Galati, J., Jackson, D., and Carlin, J. (2013). What is meant by “missing at random”? Statistical Science, 28(2):257–268.
  • Shpitser and Pearl (2006a) Shpitser, I. and Pearl, J. (2006a). Identification of conditional interventional distributions. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI2006), pages 437–444. AUAI Press.
  • Shpitser and Pearl (2006b) Shpitser, I. and Pearl, J. (2006b). Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, pages 1219–1226. AAAI Press.
  • Shpitser and Pearl (2007) Shpitser, I. and Pearl, J. (2007). What counterfactuals can be tested. In Proceedings of Twenty Third Conference on Uncertainty in Artificial Intelligence, pages 352–359, Vancouver, Canada.
  • Tian and Pearl (2002) Tian, J. and Pearl, J. (2002). A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 567–573. AAAI Press/The MIT Press.
  • Van Gestel et al. (2000) Van Gestel, S., Houwing-Duistermaat, J. J., Adolfsson, R., van Duijn, C. M., and Broeckhoven, C. V. (2000). Power of selective genotyping in genetic association analyses of quantitative traits. Behaviour Genetics, 30(2):141–146.
  • Vandenbroucke (1991) Vandenbroucke, J. P. (1991). Prospective or retrospective: what’s in the name? British Medical Journal, 302:249–250.
  • Vandenbroucke et al. (2007) Vandenbroucke, J. P., von Elm, E., Altman, D. G., Gøtzsche, P. C., Mulrow, C. D., Pocock, S. J., Poole, C., Schlesselman, J. J., Egger, M., and for the STROBE Initiative (2007). Strengthening the reporting of observational studies in epidemiology (STROBE): Explanation and elaboration. Epidemiology, 18(6):805–835.
  • von Elm et al. (2007) von Elm, E., Altman, D. G., Egger, M., Pocock, S., Gøtzsche, P., Vandenbroucke, J., and for the STROBE Initiative (2007). The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Epidemiology, 18(6):800–804.

Appendix: Likelihood factorizations

In this section, likelihood functions are presented for the examples of Section 5. The likelihood functions are derived for the population with the size starting from the factorization that follows directly from the DAG. At the first step, the likelihood function is written assuming that all variables are observed for the whole population. The measurements are redundant in this case because they are deterministic functions of the causal variables and the selection variables. The measurements becomes explicit when the likelihood function is further factorized according to the selection variables. Finally, the likelihood of the observed data is obtained as an integral over the unknown causal variables.

Parameters define the distribution of the causal variables and parameters define the distribution of the selection variables. A vectorized notation similar to is used for all variables and the distributions are defined with respect to the first argument unless otherwise specified.

The likelihood function for the MORGAM Project case-cohort design presented in Figure 2 has the form