1 Introduction
In the formal framework of causal inference it is sometimes possible to make experimental claims using observational data alone. First, we construct a causal model by encoding our knowledge into a graph and specify a probability distribution over the observed variables. An experiment can now be carried out symbolically in the model through an intervention, which is an action that forces variables to take specific values irrespective of the mechanism that would determine their values otherwise. The question is whether the observed probability distribution alone is enough to determine the effect of this intervention. This problem, known as the
identifiability problem, has been studied extensively in literature and solutions in the form of graphical criteria (pearl1995, ; pearl2009, ) as well as algorithms have been proposed (huang2006:complete, ; shpitser2006, ; tian2002:phdthesis, ). Various extensions to the identifiability problem have emerged in recent years. These include concepts such as transportability, where identifiability is considered in a target population, but information for the task is available from multiple source populations (bareinboim2013:metatransportability, ; pearl2014, ).The presence of unobserved confounders often renders causal effects of interest nonidentifiable from observational data alone. This leads us to ask whether experimental data can be of use in the identification task. The concept of surrogate experiments or zidentifiability considers this problem in a setting where in addition to the observed probability distribution, experimentation is allowed on a set of variables that is disjoint from the interventions of the target causal effect (bareinboim2012a, ) and the experimental distribution of these surrogate experiments is available over all variables. By experimental distribution we mean a distribution of a set of outcomes variables when some variables have been intervened on. We consider a more general problem than identifiability: instead of assuming that a experimental distribution over all variables is available, we assume that a collection of experimental distributions is available where every variable has not necessarily been observed. This kind of setting can occur for example in mediation analysis, where we have previously performed an experiment where the mediator was the outcome variable. Another example is a setting where we are interested in two outcome variables but have only measured one of them in a previous experiment.
In a practical study we usually have access to information about population characteristics when performing an experiment. Sometimes not all of these characteristics are be measured in conjunction with the experiment itself which leads to incomplete knowledge regarding the experimental distribution. Suppose that we are interested in the experimental distribution of another variable, one that was not measured during the experiment. The question is whether this distribution can be obtained from the observational data and the outcome of the previous experiment, which we refer to as the surrogate outcome. We label this generalization of identifiability as surrogate outcome identifiability.
Remarkably, a connection can be drawn between surrogate outcome identifiability and transportability. Transportability is concerned with identifiability across conceptual domains where both observational and experimental data are available from each domain. In practical terms, a domain can be for example a city, and data from multiple domains in this case could be for example the age distributions of the populations of these cities. Naturally, discrepancies between causal mechanisms can arise between domains, which has to be taken into account in the causal modeling framework. Typically, we are interested in the effect of an intervention in a single domain, known as the target domain, and the domains providing additional information for the task are known as source domains. However, existing methods for determining transportability only allow a single experiment to take place within a single domain, whereas our surrogate identifiability is concerned with multiple distributions from differing experiments in a single domain. We incorporate the framework of transportability by depicting each available experiment of the surrogate outcome problem as a source domain of a transportability problem with the same experiments.
An introductory example illustrates the difference between surrogate outcome identifiability and identifiability. We are interested in the causal effect of and on and in the graph of Fig. 1
, which is easily determined to be nonidentifiable from the joint distribution
alone for example via the application of the ID algorithm (shpitser2006, ; tikka2017:causaleffect, ). Suppose now that two surrogate outcomes were measured in previous experiments providing us with two experimental distributions, and . The availability of these two distributions cannot be represented as identifiability problem, since they are conditional causal effects and they have common interventions with the target causal effect. We cannot directly regard this problem as a transportability problem either, since we are concerned with only a single domain. The causal effect can now be identified with the help of the two experimental distributions, which we will show later in Section 3.In this paper we propose a way to transform a surrogate outcome problem into a transportability problem. We show that the identifiability of the transformed problem is a sufficient condition for identifiability of the surrogate outcome problem. We derive an identifiability algorithm for surrogate outcome problems and implement it as a part of the R package causaleffect (rsoft, ; tikka2017:causaleffect, ).
2 Notation and definitions
We assume that the reader is familiar with graph theoretic concepts fundamental to causal inference and refer them to works such as (koller2009, ). We use capital letters to denote vertices and the respective variables and small letters to denote their values. We sometimes write singleton sets as for clarity. A directed graph with a vertex set and an edge set is denoted by . For a graph and a set of vertices the sets and denote a set that contains in addition to its parents, children, ancestors and descendants in , respectively. A subgraph of a graph induced by a set of vertices is denoted by . This subgraph retains all edges of such that . The graph obtained from by removing all incoming edges of and all outgoing edges of is written as . A backdoor path from to is a path with an edge incoming to and . A topological ordering of is an ordering of its vertices in which every node is smaller than its descendants in . The set of vertices smaller than a vertex in is denoted by . To facilitate analysis of identifiability and the generalization to surrogate outcomes, we must first define the probabilistic causal model (pearl2009, ).
Definition 2.1 (Probabilistic causal model).
A probabilistic causal model is a quadruple
where is a set of unobserved (exogenous) variables that are determined by factors outside the model, is a set of observed (endogenous) variables that are determined by variables in . is a set of functions such that each is a mapping from (the respective domains of) to , and such that the entire set forms a mapping from to , and is a joint probability distribution of the variables in the set .
Each causal model induces a graph through the following construction: A vertex is added for each variable in and a directed edge from into whenever is defined in terms of . Conventionally, causal inference focuses on a subclass of models with additional assumptions: each appears in at most two functions of , the variables in are mutually independent and the induced graph of the model is acyclic. Models that satisfy these additional assumptions are called semiMarkovian causal models. The induced graph of a semiMarkovian model is called a semiMarkovian graph. In semiMarkovian graphs every has at most two children. In semiMarkovian models it is common not to depict background variables in the induced graph explicitly. Unobserved variables with exactly two children are not denoted as but as a bidirected edge instead. Furthermore, unobserved variables with only one or no children are omitted entirely. We also adopt these abbreviations. For semiMarkovian graphs the sets and contain only observed vertices. Additionally, a subgraph of a semiMarkovian graph retains any bidirected edges between vertices in .
A graph induced by a probabilistic causal model also encodes conditional independences among the variables in the model through a concept known as dseparation. We use the definition in (shpitser2008, ) which explicitly accounts for the presence of bidirected edges making it suitable for semiMarkovian graphs.
Definition 2.2 (dseparation).
A path in a semiMarkovian graph is said to be dseparated by a set if and only if either contains one of the following three patterns of edges: , or , such that , or contains one of the following three patterns of edges: , , , such that . Disjoint sets and are said to be dseparated by in if every path from to is dseparated by in .
Since we are dealing entirely with semiMarkovian graphs, we will henceforth refer to them simply as graphs. If no conditional independence statements other than those already encoded in the graph are implied by the distribution of the variables in the model, we say that the distribution is faithful (spirtes2000, ).
A causal model allows us to manipulate the functional relationships encoded in the set . An intervention on a model forces to take the specified value . The intervention also creates a new submodel, denoted by , where the functions in that determine the value of have been replaced with constant functions. The interventional distribution of a set of variables in the model is denoted by . This distribution is also known as the causal effect of on . Three inference rules known as docalculus (pearl1995, ) provide the means for manipulating interventional distributions.

Insertion and deletion of observations:

Exchange of actions and observations:

Insertion and deletion of actions
where
Regarding the identifiability problem, the goal is to transform into an expression that does not contain the dooperator using docalculus. A causal effect that admits this transformation is called identifiable, which is formally defined in e.g. (shpitser2006, ). Docalculus has been shown to be complete with respect to the identifiability problem (shpitser2006, ; huang2006:complete, ) as well as the transportability and identifiability problems (bareinboim2013:general, ; bareinboim2012a, ).
Special graphs known as ccomponents (confounded components) are crucial for causal effect identification (shpitser2006, ).
Definition 2.3 (ccomponent).
Let be a graph. A ccomponent (of ) is a subgraph of such that every pair of vertices in is connected via a bidirected path (a path consisting entirely of bidirected edges). A ccomponent is maximal if there are no vertices in that are connected to in via bidirected paths and is an induced subgraph of .
The joint distribution of a causal model admits the socalled ccomponent factorization with respect to the set of maximal ccomponents of the induced graph of the model, denoted by . Henceforth we will use the term ccomponent to refer to maximal ccomponents for brevity.
If in addition to the joint observed probability distribution experimentation is allowed on a set , the identifiability problem is known as identifiability (bareinboim2012a, ). The set is known as the set of surrogate experiments.
Definition 2.4 (identifiability).
Let be a graph and let , and be disjoint sets of variables such that . The causal effect of on is said to be identifiable from in if is uniquely computable from together with the interventional distributions , for all , in any model that induces .
As an example of a identifiable causal effect, we consider the identification of from and in the graph of Fig. 2. This effect is not identifiable without the experimental distribution, which can be verified for example by using the ID algorithm of (shpitser2006, ).
We derive the effect using docalculus:
where the second equality follows from the third rule of docalculus, since . The third equality follows from the third rule of docalculus, since and the fourth equality follows from the second rule of docalculus, since . The term is identifiable from via marginalization and conditioning and is identifiable from via marginalization.
The available information in a identifiability problem consists of a single observational distribution and experimental distributions resulting from interventions on subsets of . Our goal is to extend this problem to a setting where experimentation is allowed on the subsets of multiple surrogate experiments. Furthermore, we do not require that the distribution of the entire set is known under these experiments or that the experiments have to be disjoint from , the intervention in the target causal effect. We formalize these notions in the following definition.
Definition 2.5 (Surrogate outcome query).
A surrogate outcome query is a quadruple , where is a graph, are disjoint sets of variables. The set of surrogate outcomes is a collection of intervention–outcome pairs such that for all it holds that , , and for each .
While requirements for the sets and may appear complicated, they are only a formal statement of the fact that we require all variables subject to experimentation to precede all of the outcome variables in the causal order. We also assume that outcomes in a single intervention–outcome pair have the same ancestors. This assumption is made for technical reasons and outcomes with different ancestry can still be represented through separate intervention–outcome pairs. The intuition behind these assumptions is that each intervention–outcome pair should correspond to a single experiment where every manipulated variable has a potential effect on the outcomes. For example, in the graph of Fig. 3, we would not consider to be a valid intervention–outcome pair, since manipulating cannot affect .
Identifiability of a causal effect defined by a surrogate outcome query is characterized by the following definition.
Definition 2.6 (Surrogate outcome identifiability).
Let be a surrogate outcome query. Let , where , and let . Then the causal effect of on is said to be surrogate outcome identifiable from in if is uniquely computable from in any model that induces .
The precise formulation of the sets and the experimental distributions is needed to closely connect surrogate outcome identifiability to transportability as we will show later in Section 3. While the assumption that the interventional distributions are always available for every subset of every
is technical, it can have a realworld interpretation as well. For example, it is realistic to assume that when the joint effect of two medical treatments is studied, either the effect of each individual treatment is already known or they can be estimated from the same experiment. In many cases, it may be unethical to test for the joint effect if it is not known that the individual treatments are safe and efficient.
As an example on surrogate outcome identifiability, we consider the graphs of Fig. 4 and attempt to identify the causal effect of on from and . This corresponds to setting in Definition 2.6. It should be noted that this problem cannot be expressed as a identifiability problem, since the experimental distribution that is available contains an intervention on and it is not a full experimental distribution over the variables and , but is instead restricted to only.
We can derive the effect as follows in both Fig. 4(fig:intro_graph1) and 4(fig:intro_graph2):
Both terms in this expression are computable from : the term can be obtained via conditioning from and the term is already included in . Here the second equality follows from the second rule of docalculus, since . In this trivial example we can easily determine the correct sequence of applications of docalculus to reach the desired expression. In general, it is difficult to find such a sequence or determine whether such a sequence even exists. For tasks such as identifiability, the solution was to construct an algorithm that either derives the expression for the effect, or returns a graph structure that can be used to construct two models where the distributions over the observed variables agree, but the interventional distributions differ. Instead of developing a similar algorithm for surrogate outcome identifiability, we will describe this problem as a transportability problem, for which a complete solution already exists in the form of an algorithm (bareinboim2014:completeness, ).
3 Identifying surrogate outcome queries using transportability
In order to describe the connection between surrogate outcomes and transportability we first provide the definition of a transportability diagram.
Definition 3.1 (Transportability diagram).
Let be a pair of probabilistic causal models relative to domains , sharing a graph . The pair is said to induce a transportability diagram if is constructed as follows: every edge in is also an edge in , contains an extra edge whenever there might exist a discrepancy or between and .
In the above definition, a domain is simply a formalization of the intuitive notion of different contexts of the same phenomena. The domains serve as indices to differentiate between the different causal models that are depicted by the same graph and to associate the available observational and experimental distributions with specific models. We illustrate Definition 3.1 via an example. We consider two models, and that share graph of Fig. 5(fig:tr_exampleG) and have the same causal mechanism with the exception that . This discrepancy between the models is now depicted by the transportability diagram of Fig. 5(fig:tr_exampleD) where the corresponding transportability node and the extra edge have been added. Transportability nodes are denoted by gray squares. We note that transportability diagrams and transportability nodes are sometimes called selection diagrams and selection nodes (bareinboim2013:metatransportability, ) which should not be confused with the concept of selection bias.
The connection between transportability and surrogate outcome identifiability is not obvious. The general idea is to represent every available experimental distribution as a domain where discrepancies described by the transportability nodes take place in variables that have not been measured or randomized in the corresponding experiment, that is in . In the domain experimentation on is available and the goal is to now use the information provided by each domain to derive a transport formula for the causal effect. A transportability problem is often implicitly described by the target of identification and available experiments (e.g. bareinboim2013:metatransportability, ; bareinboim2014:completeness, ). Similarly to a surrogate outcome query, we formalize transportability queries in the following definition.
Definition 3.2 (Transportability query).
A transportability query is an octuple
where is a collection of transportability diagrams relative to source domains , is the graph of the target domain , are disjoint sets of variables, is a collection of sets of variables in which experiments can be conducted in each domain , and is the set of available experiments in the target domain.
Each transportability diagram in depicts the discrepancies between the domains and . Mirroring Definition 2.6, transportability of a causal effect defined by a transportability query is characterized by the following definition.
Definition 3.3 (Transportability).
Let be a transportability query. Let be the pair of observational and interventional distributions of , where , and in an analogous manner, let be the observational and interventional distributions of . Let be the set of available information. The causal effect is said to be transportable from to in with information if is uniquely computable from in any model that induces .
This definition is referred to as transportability in (bareinboim2014:completeness, ). Henceforth the superscript is used to refer to the source domain . A distribution governing a source domain is simply a shorthand notation for the conditional distribution where the transportability nodes of the corresponding domain are active, meaning that , where is the set of all transportability nodes of .
We present an example on transportability of using two source domains. The transportability diagrams and associated with the sources are depicted in Fig. 6(fig:tr_example2D1) and Fig. 6(fig:tr_example2D2) for and , respectively. In transportability diagrams, black squares denote variables for which experimentation is available in the corresponding domain. We assume that experiments on are available in and on in domain . No experiments are available in the target domain . The graph of the target domain can be obtained from either or by simply omitting the transportability nodes. The corresponding transportability query for this problem is
The transport formula can be derived using docalculus as follows:
Where the equalities follow from the following sequence: second equality from rules three and two by and , third equality from rules two and three by and , fourth equality from rule one by and . The last equality is just a rewrite of the terms in the shorthand notation for active transportability nodes of a specific domain.
Next, we will outline the procedure to transform a surrogate outcome identifiability query into a transportability query.
Definition 3.4 (Query transformation).
Let be a surrogate outcome query that is to be transformed into a transportability query , where sets and remain unchanged. The graph of the target domain is . The set of source domains and the collection of their respective transportability diagrams are constructed from as follows: contains an edge for every vertex , where and is the set of vertices of the ccomponent that contains the vertex . The collection of available experiments is obtained directly from by setting ( for ).
The transformation provided by Definition 3.4 serves as our basis for solving a given surrogate outcome identifiability problem. Transportability nodes are used to denote our lack of experimental information and to exert control over which transformed transportability queries should be identifiable. For each set , we know that the flow of information caused by the intervention of will not propagate to nondescendants of , which is why we add a transportability node for each vertex in . However, confounding must also be taken into account in the outcome set , which is why a transportability node is added for each vertex of each ccomponent that shares a vertex with with the exception of ancestors of . Later we will show that a causal effect is surrogate outcome identifiable if the corresponding causal effect obtained from the query transformation is transportable.
We return to the example on surrogate outcome identifiability in Section 2 and show how the surrogate outcome query is transformed into a transportability query in this instance. The task is to identify from and in the graph of Fig. 4(fig:intro_graph1). The corresponding surrogate outcome query is
The set consists of a single element , which means that our transformed query will have a single source domain . The transportability diagram for this domain is constructed according to Definition 3.4 by adding a transportability node for each vertex in . The set is empty so no other transportability nodes have to be added. The resulting transportability diagram is shown in Fig. 7. The transformed query is now
Next, we present an algorithm labeled TRSO for computing transportability formulas that is a modification of the algorithm presented in (bareinboim2014:completeness, ). The purpose of this modified algorithm is to solve transportability queries that have been obtained through a query transformation of a surrogate outcome problem. In the original formulation, experimental information from the source domains is used only if identification in the target domain fails. Instead, we will prioritize experiments over observations to make full use of the available information.
Some restrictions have to be imposed, since when transportability of causal effects is considered we always have access to the full experimental distributions in any domain . This has to be taken into account by preventing certain operations on the joint distributions to be carried out when query transformations for surrogate outcomes are considered. For example when line 10 is triggered, we check whether the local ccomponent is affected by transportability nodes and prevent the use of experimental information if this is the case. The original formulation of the algorithm also includes a weighting scheme for effects that can be identified from multiple domains. We omit this part for clarity and use the first domain where an identifiable effect was encountered. The following theorem formally describes in the purpose of TRSO.
Theorem 3.1.
Technical details and auxiliary results required to prove Theorem 3.1 are presented in the next section.
We recall the example from Section 1 on identifying from and in the graph of Fig. 1, and solve its query transformation using TRSO. The set of surrogate outcomes contains two intervention–outcome pairs, and . For , transportability nodes are added for
For , transportability nodes are added for
The corresponding transportability diagrams and for the domains and of the query transformation are shown in Fig. 8.
By tracing the algorithm, we trigger line 4 first and obtain
Since , line 2 and then line 1 are triggered for the last term, which is simply . The recursive calls for the first two terms both trigger line 2 due to not being an ancestor of and not being ancestors of . After these calls we have
Line 10 is triggered next for both of the first two terms because
and
This means that intervention on is activated for the first term and intervention on is activated for the second terms. Finally, line 7 is triggered for both remaining terms and we obtain
as the final expression. We obtain a solution for the original surrogate outcome problem by simply omitting the domain indicators from this expression
We can also derive the effect using the causaleffect R package with the following commands: [fontsize = ] library(causaleffect) library(igraph)
¿ fig1 ¡ graph.formula(x_1 + y_2, x_1 + y_1, w + y_1, w + y_2, + z + y_1, x_2 + y_2, z + y_2, z + x_2, w + z, z + w, + z + x_2, x_2 + z, y_1 + x_1, x_1 + y_1, simplify = FALSE) ¿ fig1 ¡ set.edge.attribute(fig1, ”description”, 9:14, ”U”)
¿ s1 ¡ list( + list(Z = c(”x_2”), W = c(”y_2”)), + list(Z = c(”x_1”), W = c(”y_1”)) ¿ )
¿ cat(surrogate.outcome(y = c(”y_1”, ”y_2”), x = c(”x_1”,”x_2”), + S = s1, G = fig1)) ∑_w,zP_x_2(y_2—x_1,w,z)P(w,z)P_x_1(y_1—w,z) The package uses the notation to denote .
In the next section we prove the correctness of TRSO and show that the omission of the domain indicators from its output always produces a valid expression for the original surrogate outcome identifiable causal effect.
4 Correctness of the modified transportability algorithm
First, we recall that docalculus is complete with respect to transportability and prove some useful lemmas.
Theorem 4.1 (docalculus characterization).
The rules of docalculus together with standard probability manipulations are complete for establishing transportability of causal effects.
Proof.
See (bareinboim2014:completeness, ). ∎
Theorem 4.1 shows that a sequence of valid operations necessarily exists for a transportable causal effect. We define this sequence explicitly.
Definition 4.1 (docalculus sequence).
Let be a graph or a transportability diagram, let be an identifiable or transportable causal effect and let be a set of available information. A docalculus sequence for in is a pair
where is an tuple such that each is either a member of the index set or a quintuple such that and if , if and if and . is a sequence of probability distributions such that if is of the first type described above, is obtained from via marginalization for , conditioning for
and the chainrule if
. If is of the second type, then is obtained from using rule number of docalculus licensed by the sets and . Furthermore, when is transformed as dictated by the sequence , an expression is obtained such that each term that appears in is a member of or computable from without docalculus.The idea is to use a docalculus sequence of a transportable causal effect to construct a docalculus sequence for its query transformation counterpart. However, as docalculus statements stem from dseparation in the underlying graph, we must first establish that dseparation is invariant to the presence of transportability nodes.
Lemma 4.1.
Let be a transportability diagram and let be its subgraph obtained by removing all transportability nodes from . Let be disjoint sets of vertices of such that they do not contain transportability nodes. Then and are dseparated by for every in if and only if and are dseparated by in .
Proof.
(i) Suppose that and are dseparated by in . By assumption and do not contain any transportability nodes. This means that no path from to contains transportability nodes, since a path containing such a node would necessarily have it as one of the path’s endpoints by Definition 3.1. Furthermore, a transportability node cannot be a descendant of a collider by definition. Thus all paths from to remain dseparated if we remove all transportability nodes from .
(ii) Suppose that and are dseparated by in . Adding transportability nodes to cannot create any new paths between and since a transportability node is only connected to other vertices of the graph through a single vertex. As before, a transportability node cannot be a descendant of a collider by definition. Thus all paths between and are dseparated by in for any subset . ∎
Corollary 4.1.
Let be a surrogate outcome query with a graph and let be its query transformation with a collection of transportability diagrams . Then any conditional independence statement that holds in some transportability diagram of also holds in if the sets and do not contain transportability nodes. Conversely, every conditional independence statement of holds in every diagram of .
Proof.
The proof immediately follows from Theorem 4.1 by noting that can be obtained from each element of by removing all transportability nodes. ∎
We show that there always exists docalculus sequence such that every operation that manipulates transportability nodes does not manipulate any other vertices at the same time.
Lemma 4.2.
Let be a transportability query and let be the set of all transportability nodes over the domains of and the target diagram . If is a transportable causal effect with transportability information of Definition 3.3, then there exists a docalculus sequence such that whenever is of the form then either or .
Proof.
Let