Even the best possible evidence regarding the effects of a treatment on an outcome is generally not enough to identify the probability that the outcome was caused by the treatment.
For instance, researchers conducting randomised controlled trials may determine that providing a medicine to school children increases the overall probability of good health from one third to two thirds. This information, no matter how precise, is not enough to answer the following question: Is Ann healthy because she took the medicine? It is not even enough to answer the question probabilistically. The reason is that, consistent with these results, it may be that the medicine makes a positive change for 2 out of 3 students, but an adverse change for the remainder: in that case the medicine certainly helped Ann. But it might alternatively be that the medicine makes a positive change for 1 in 3 children but no change for the others. In that case the chances it helped Ann are just 1 in 2. Of the children taking the medicine, two thirds are healthy. Half of these are healthy because of the medicine, whereas the other half would have been healthy anyway.
Put differently, the experimental data identifies the “effects of causes,” (EoC) but we are interested in the reverse problem, of quantifying “causes of effects” (CoE). The CoE task of defining and assessing the probability of causation (Robins and Greenland, 1989) in an individual case has been considered by Tian and Pearl (2000); Dawid (2011); Yamamoto (2012); Pearl (2015); Dawid, Musio and Fienberg (2016); Dawid, Murtas and Musio (2016); Dawid, Musio and Murtas (2017); Murtas, Dawid and Musio (2017). Note that this is distinct from the “reverse causal question” of Gelman and Imbens (2013), which is an EoC task aimed at ascertaining which causes have an effect on an outcome.
To understand causes of effects better, we might seek additional evidence along causal pathways. For example, researchers evaluating development programs specify “theories of change” and seek evidence for intermediate outcomes along a pathway linking treatment to outcomes—most simply, Was the treatment received? Was the medicine ingested? Van Evera (1997) describes various tests that might be implemented using such ancillary evidence. A “smoking gun test” searches for evidence that, though unlikely to be found, would give great confidence in a claim if it were to be found; a “hoop test” test is a search for evidence that we expect to find, but which, if found to be absent, would provide compelling evidence against a proposition (as if the proposition were asked to jump through a hoop).
Sometimes many points along a causal pathway are investigated. An intervention might be to provide citizens with information on political corruption, in the hope that this will lead to ultimate changes in politicians’ behavior. Researchers might then check many points along a chain of intermediate outcomes. Was the political message delivered? Was it understood? Was it believed? Did it induce a change in behavior by citizens? Did this in turn produce a change in behavior by politicians?
Seeing positive evidence at many points along a such a causal chain would appear to give confidence that the final outcome is indeed due to the conjectured cause. This is the core premise of “process tracing,” as deployed by qualitative political scientists (Collier, 2011), as well as of mixed methods research as used in development evaluation (White, 2009). In the most optimistic accounts it is assumed that, as one gets close enough to a process, by observing more and more links in a chain, the link between any two steps becomes less questionable and eventually the causal process reveals itself (Mahoney, 2012, 581).
We here provide a comprehensive treatment of the scope for inferences of this form from knowledge of causal chains. We obtain a general formula for calculating bounds on the probability of causation, for an arbitrary pattern of data along chains of binary variables. We derive implications of this formula, and calculate the largest and smallest upper and lower bounds achievable from any causal chain consistent with the known relation betweenand . We give special attention to what might appear to be the best possible conditions: those in which causal processes really do follow a simple causal chain, in which researchers have complete experimental evidence about the probabilistic relationship between any two consecutive nodes in the chain, in which the chain is arbitrarily long, in which the causal effect of each intermediate variable on its successor climbs to 1, and in which researchers observe outcomes consistent with positive effects at every point on the chain. We show that such information does indeed increase confidence that an outcome can be attributed to a cause and, for homogeneous chains at least, that the longer the chain the better. However, we find that even under these ideal conditions our ability to narrow the bounds for the probability of causation can be modest. In the example of attributing Ann’s health to good medicine, a homogeneous process with arbitrarily many positive intermediate steps observed might only tighten the bounds from to .
In contrast, we show that non-homogeneous processes can tighten the bounds considerably. For example, suppose Ann was prescribed the medicine and recovered. If we know that being prescribed the medicine is the only way in which Ann could have obtained and taken the medicine, and that taking the medicine helps anyone who would otherwise be sick, then with positive evidence on a single intermediate point on the causal chain—that Ann did indeed take the medicine—we can identify the probability that prescribing the medicine caused Ann’s recovery at . (We are still short of 1, because it is possible that Ann would have recovered even without the medicine.) A process like this, in which we observe a “necessary condition for a sufficient condition”, provides the largest possible lower bound on the probability of causation available from any observations on any chain. At this point we have done the best possible and more data along the chain will not help.
Although achieving identification of the probability of causation at 1 is generally elusive, negative data can yield identification at 0, either in two steps from a heterogeneous process, or from alternating data along an infinite homogeneous chain. In this sense, information on mediators can support “hoop” tests but not “smoking gun” tests.
1.1 Plan of paper
Existing results (Dawid, Murtas and Musio, 2016) have considered the case of a single unobserved mediator. We generalize this in two ways. First, we consider situations with chains of arbitrary length. Secondly, we calculate bounds for general data, that is, for situations in which the values of none, some or all the mediators are observed.
We proceed as follows. Section 2 introduces the set-up, and provides general formulae for bounding the probability of causation for a simple one-step process. In § 3 we extend these results to cases in which we know the structure of a complete mediation process. We consider various degrees of knowledge of the values of the mediators for the individual case at hand: all unobserved, all observed, or just some observed. Our main result is Theorem 4, which provides a general formula applicable to all cases.
Section 4 draws out the detailed implications of this result in a variety of contexts. In § 4.1 we investigate the largest achievable lower and upper bounds from any sequence, and find that these can be achieved by heterogeneous two-step processes. Section 4.2 examines the case of homogeneous processes of arbitrary length. We show that an alternating pattern for the values at all intermediate points can lead to a limiting value of 0 for the probability of causation. However, it is not generally possible for even the most positive evidence to identify the probability of causation—and a fortiori not possible to identify it at 1—even in the limit of infinitely many steps. Section § 4.3 considers implications of our results for gathering data on mediators. In § 5 we compare the bounds based on knowledge of mediator processes with those achievable from knowledge of covariates, which can be much tighter. We summarise our findings in § 6. Various technical details for the proofs in the paper are elaborated in three appendices.
We consider a binary treatment variable and binary outcome variable . We suppose we have access to experimental (or unconfounded observational) data supplying values for , where we use the notation to denote a regime in which is set to value by external intervention.
Then is the average causal effect of on , while is a measure of how common is.
The transition matrix from to (where the row and column labels of any such matrix are implicitly and in that order) can be written:
All entries of must be non-negative: this holds if and only if
We have equality in (2) if and only if one of the entries of (1) is 1, in which case we term degenerate. For , this will happen if either , in which case and can be thought of as a sufficient condition for ; or , in which case , and can be thought of as a necessary condition for . Defining, for ,
we might thus regard as measuring the relative sufficiency of for .111Although we do not focus on it, for the analogous quantity can be interpreted as the relative sufficiency of for .
2.1 Potential outcomes and causes of effects
While knowledge of the transition matrix , and in particular the “average causal effect” , is directly relevant for EoC (“effects of causes”) analysis, it is not enough to support CoE (“causes of effects”) analysis. For this we need to introduce the pair of potential outcomes, , where we conceive of as the value would take, if . We regard both and as existing simultaneously, even prior to setting the value of
, and as having a bivariate probability distribution.
We can now define the following events in terms of (where denotes , the value distinct from , etc.):
- General causation
That is, changing the value of will result in a change to the value of . We can also describe this as “ affects .”
When the relevant variables and are clear from the context we will simplify the notation to .
- Specific causation
:= “” (for or ).
That is, changing the value of from to would change the value of from to . We can also describe this as “ causes .” When the relevant variables and are clear from the context we will simplify the notation to .
We note that .
- Probability of Causation.
In cases of interest we will have observed , and want to know the probability that caused , given this information. We denote this quantity by , or when the relevant variables and are clear from the context. Thus
The joint distribution for, while constrained by knowledge of the transition matrix , is in general not fully determined by it. Rather, we can only deduce that it has the form of Table 1, where the marginal probabilities agree with (1) according to .
However, the internal entries of Table 1 are not determined by
, but have one degree of freedom, expressed by the “slack” quantity= . We see that
the probability of general causation.
The only constraints on are that all internal entries of Table 1 must be non-negative, which holds if and only if
In particular , and thus the bivariate distribution of in Table 1, is uniquely determined by if and only is degenerate.
We further note
whence, by (6),
Throughout this article we shall assume no confounding, expressed mathematically as . Then
This analysis delivers the following lower and upper bounds (prefix “s” for “simple”):
In the absence of additional information, the above bounds constitute the best available inference regarding the probability of causation.
Specifically, when , on defining
we have the following upper bounds:
2.2 Special case
A particular interest is in cases where (so the overall effect of and is positive) and we observe positive outcomes, , . In this case we omit the subscript . We have
and interval bounds given by
PC is identified (i.e., the interval in (26) reduces to a single point) if and only if , which holds when is degenerate with either the lower left or upper right element of being 0. In the former case , while in the latter case .
More generally, we have , so .
3 Bounds from mediation
We now suppose that, in addition to and , we can gather data on one or more binary mediator variables . We also define and . We are interested in assessing the probability that caused for a new case where we have information on the values of some or all of the mediators .
We assume that the data are based on experiments, or in any case are such as to allow us to determine the one-step interventional probabilities , . We shall here confine attention to the case of a complete mediation sequence, where
We shall further suppose that, for any new case considered, there is no confounding at every step, so that
In this case the sequence of observations
on a new case will form a (generally non-stationary) Markov chain. This is an empirically testable consequence of our assumptions, assumptions which would therefore be falsified if the Markov property is found to fail (although those assumptions are not guaranteed to be valid when it is found to hold.)
Let the transition matrix from to be , and the overall transition matrix from to be . We shall write
to indicate that we are assuming the above mediation sequence, and refer to (27) as a decomposition of the matrix . In particular we then have .
We can readily show by induction that
In particular, for the case , (29) becomes
On account of (28) we have the following result:
The average causal effect of on is the product of the successive average causal effects of each variable in the sequence on the following one.
Again, to conduct CoE rather than EoC analysis, we introduce, for , bivariate variables
where denotes the potential value of under , supposed unaffected by values of previous ’s. We further assume that the variable is common to all the various worlds, whether actual or counterfactual, under consideration. The actually realised values satisfy .
As the expression of our “no confounding” assumptions, we impose mutual independence between , ,…,.
. That is to say, affects if and only if each affects the next.
Proof. Suppose first that each variable affects the next. Then changing the value of will change that of , which in turn will change that of , and so on until the value of is changed, so showing that affects . Conversely, if, for some , does not affect , then, whether or not has been changed, the value of will be unchanged, whence so too will that of , and so on until the value of is unchanged, whence does not affect .
Given the detailed information on the decomposition (27), the constraints on are now:
On account of (i) we have:
For any decomposition, the probability that affects is the product of the probabilities that each variable in the sequence from to affects the next in the sequence.
Proof. Consider first the case . Then
It follows that
The result for general follows easily by induction.
We note that the above condition for strict inequality in (34), while sufficient, is not necessary. For example, in the case it will also hold if and have different signs, since then we would have strict inequality in (32).
Consider two decompositions and , where . Then the upper bound for for the former does not exceed that for the latter.
3.1 Bounds when mediators are unobserved
Suppose first that, for the new case, we have observed , but the values of the mediators are not observed. Even in this case, as shown for the two-term decomposition in Dawid, Murtas and Musio (2016), knowledge of the decomposition (27) of can alter the bounds for PC.
Indeed, in this case (4) still applies, where is given by (7) or (8) as appropriate, but now with subject to the revised bounds of (31). In each case the lower bound is unaffected, but, by Theorem 3, the upper bound is reduced.
This analysis delivers the following revised bounds (prefix “u” for “unobserved mediators”):
3.2 Special case
In particular, for the case , where we observe , (but the values of mediators are not observed), we have revised bounds
For this agrees with the analysis of Dawid, Murtas and Musio (2016).
3.3 Bounds when some or all mediators are observed
Now suppose that, in addition to , , we also observe data on mediators () for the new case. In particular we observe , for . For notational simplicity we write for , for . We also identify and (so , ).
The relevant probability of causation is now
Note that in contrast to the difference between (35)–(38) on the one hand and (11)–(14) on the other hand, which relate to the same quantity but express different conclusions about it, is a genuinely different quantity from , since it conditions on different information about the new case.
Given observations on , the probability that caused is given by the product of the probabilities that each observed term in the sequence caused the next observed term:
Proof. From Theorem 2 we have
whence, using the “no-confounding” independence properties,
Now since we have the decomposition information about the mediators (if any) occurring between and , but not their values for the new case, the bounds on any factor in (40) will, mutatis mutandis, have the form of the relevant expressions for and , as displayed in (35)—(38). Then the overall lower [resp., upper] bound on will be the product of these lower [resp., upper] bounds, across all terms. This procedure supplies a complete recipe for determining the appropriate bounds on in the knowledge of the full decomposition of and the values of the observed mediators for the new case.
3.4 Special cases
Again consider the case , . On account of (28) we can, after possibly switching the labels and for some of the ’s, take , all . We assume henceforth that this is the case. The above procedure then delivers lower bound unless , all , so that , all . In that case we obtain lower bound (with prefix “o” for “observed mediators”):
It is easy to see that this lower bound can only increase if we introduce further observed mediators. It follows that the smallest lower bound occurs when the are no observed mediators, when it reduces to as in (39) and (26); while the largest lower bound occurs when all mediators are observed (all taking value 1)—that is to say, there is positive evidence for every link in the mediation chain.
In the remainder of this paper we shall give special attention to this case, and write simply for , etc. The bounds for are then:
The following result follows directly from the above considerations:
It is not, however, always the case that : see (45) below.
Equation (40) provides a general formula for calculating bounds on the probability of causation for any pattern of data observed on mediating variables (including no data).
We now derive implications from this analysis.
4.1 Largest and smallest upper and lower bounds
Consider an arbitrary decomposition of :
with , . We restrict attention to the case and assume that variables are labeled so that each .
We investigate the smallest and largest achievable values for , mUB (prefix for mixed evidence) and show that in each case these are achievable by decompositions involving at most one mediator.
Let the (known, fixed) transition matrix from to be , with and . The largest and smallest upper and lower bounds from any complete mediation process for the case with mediators unobserved, for the case with positive outcomes on all mediators observed, and for mixed cases, that include some negative evidence on the mediators, are as given in Table 2.
|No evidence||Positive evidence||Mixed evidence|
These can all be achieved by decompositions of length 1 or 2.
Proof. See Appendix A.
The largest upper bound with mediators unobserved, , can be achieved without any mediators. Since unobserved mediators do not alter the lower bound we have . In addition we have , which is achievable, for example, from the following decomposition:
Note that with this decomposition PC is identified via two degenerate transition matrices: is a sufficient condition for , while is a necessary condition for .
The smallest upper and lower bounds available when mediators are observed agree with the simple lower bound. Positive evidence cannot reduce the lower bound, but it can reduce the upper bound to the lower bound, at which point is identified. This can be achieved by the same decomposition given in (44).
The largest upper bound with positive evidence on mediators, , can exceed the simple upper bound when . It is achieved by the following two-term decomposition, involving a single mediator:
The lower bound can be raised with positive information on mediators, and takes its largest value with the following degenerate two-term decomposition , involving a single mediator:
With this decomposition is identified via two degenerate transition matrices: in this case is a necessary condition for , while is a sufficient condition for . The largest lower bound with positive evidence from this decomposition is which can fall far short of 1, implying that in general mediators cannot provide “smoking gun” evidence that caused .
For the case with mixed evidence on the mediators the lower bound is always 0. The smallest upper bound is also 0, which can be achieved by the decomposition (46) above, with the single mediator observed at 0 (the key feature of this decomposition is that can not be caused by ). In this case is identified at 0, showing that it is possible for negative data on mediators to provide “hoop” evidence that did not cause . The highest upper bound, , can be achieved by a two-step decomposition , with the mediator taking value 0. For this occurs with the decomposition with parameters
For it occurs with decomposition parameterized by
4.2 Homogeneous transitions
Throughout this section we confine attention to the special case , . We specialize further to the case of a constant one-step transition matrix, for all . We define , , in terms of and in parallel to (3), (15) and (16).
In particular, we note that the relative sufficiency of for is preserved at each intermediate step: . It follows that .
Note that, for large , must be close to 1 and close to 0, with the same sign as .
In particular, for the degenerate cases , so that , we see, that for all , PC and are both identified, at when , and at 1 when —the existence of the mediators being irrelevant in these cases.
Here we assume the process is non-degenerate.
For the case with some negative evidence the lower bound, say, is always 0, as noted in Section § 3.4. The upper bound, however, depends on the particular pattern of positive and negative evidence. For any sequence of observations on consecutive mediators (allowing and , both required to take value 1), denote the associated upper bound by . Let denote a full sequence of observations (i.e., on all mediators). We search for a full sequence yielding the maximum value, say, of .
For large enough , we have
The optimal sequence alternates , except, if is odd, for the final 2 symbols.
Proof. See Appendix B.
For the smallest possible upper bound is for all . Otherwise, as . Then with alternating evidence on many mediators the associated probability of causation, say, is effectively identified as .
Figure 1 plots the intervals , and for a range of cases. It highlights how modest are the gains from repeated observation of homogeneous mediators and how alternating evidence can tighten bounds as long as .
4.2.1 Unboundedly many mediators
We now consider the behaviour of the bounds when we have a potentially unlimited sequence of variables directly mediating between and —still assuming identical one-step transition matrices. Our results are given in Theorem 7.
Proof. See Appendix C.
In particular, for we have