The goal of causal inference is to understand mechanistic relationships between random variables. Beyond simply observing that smokers have a higher rate of lung cancer than non-smokers, for instance, causal inference aims to determine whether lung cancer is a downstream effect of the act of smoking. As random variables are probabilistic objects, probability theory is intrinsic to causality.
Despite the centrality of probability in causal inference, the precise relationship between the two has historically been contested. For instance, it has long been emphasized that probabilistic relationships often have no causal interpretation, as any discussion of causality is quick to remark that “correlation is not causation.” The earliest recorded distinctions between dependence and causation predate the introduction of the correlation coefficient itself. Fechner, who in 1851 differentiated between a “causal dependency” and a “functional relationship” in his work on mathematical psychology , is possibly the first to articulate this distinction .
In contrast, Karl Pearson, the eponym of the Pearson correlation, held that correlation subsumed causation. In his influential book The Grammar of Science , Pearson states:
It is this conception of correlation between two occurrences embracing all relationships from absolute independence to complete dependence, which is the wider category by which we have to replace the old idea of causation.
To Pearson, causation was simply perfect co-occurance: a correlation coefficient of exactly . Notions of causality beyond probabilistic correlation, Pearson argued, were outside the realm of scientific inquiry .
Pearson’s view of causality is far from the main formulations of causality today. Under our modern understanding of causality, one can easily construct examples in which and have correlation 1, however neither is causal for nor is causal for . Both and may be the result of some common confounding cause, for instance. Likewise, one can construct examples of systems in which the observed correlation between and is 0, however is causal for . may be confounded with a third variable , which masks the effect of on in the population. Causality and correlation are now viewed as conceptually distinct phenomena.
The earliest attempts to define causality in a manner that resembles our current conception avoided probabilistic language altogether. A representative example of an early definition of causality, typically credited to Marshall  though likely of earlier origins , is paraphrased as follows.
Definition 1 (Early notion of causality (Ceteris Paribus)).
is said to be causal for if directly manipulating the value of , keeping everything else unchanged, changes the value of .
While Definition 1 is intuitively appealing—providing a practical description of causality for controlled laboratory settings—it clearly lacks mathematical rigor. In particular, it is unclear how to translate the idea of a “direct manipulation” into probabilistic language.
Viewed within the measure theoretic framework of probability, Definition 1 is particularly problematic. A pair of random variables and defined on the same probability space are determined by a common source of randomness: the selection of a random outcome . Thus, it is not at all clear why “directly” manipulating the value of would have an impact on . Classical probability allows random variables to convey information about each other, but only through the symmetric notion of probabilistic dependence. Conversely, causal inference hopes to distinguish directionality; the statement “smoking causes lung cancer” is distinct from the statement “lung cancer causes smoking.” Where causal inference seeks to draw arrows between random variables (), classical probability treats and symmetrically in that both are functions of a single random outcome, and .
The central aim of this work is to clearly explain how causal models can be constructed within the measure theoretic framework of classical probability theory. We take as our starting point the Neyman-Rubin model (NRM) of potential outcomes [8, 9, 10], and describe the structure of the probability space on which these potential outcomes are defined. From this perspective, we will see that a precise definition of causality can be couched in the standard probabilistic language of measure theory. Rather than defining causality in terms of “direct manipulations” of , we will define as causal for if the potential outcomes are unequal on subsets of nonzero measure. We emphasize throughout this work that causal models are probabilistic models with structured constraints between observed and unobserved (i.e., potential outcome) random variables.
We should be clear that we do not claim to unify probability theory with causality. The notion ceteris paribus from Definition 1 was formalized in probabilistic language as early as 1944 by Haavelmo [11, 2]. Today, probability is the common language of all modern causal inference frameworks. Within the Directed Acyclic Graphs (DAG) framework, causal relationships are discovered by searching for sets random variables satisfying certain conditional independence relationships [12, 13, 14]. Within the potential outcomes framework of causality [8, 9, 10]
, the primary goal is to estimate causal effects, defined in terms of expectations of partially observable random variables (e.g., theACE). In each framework, causal relationships map onto probabilistic relationships, which are in turn diagnosed by statistical tests. The contribution of the present work is not to unify causality with probability, but rather to explicate fundamental concepts of modern causal inference in the language of measure theory.
Clarifying the interface between causality and measure theory is useful for several reasons. First, measure theory provides a simplifying perspective for understanding the basic framework of causality. Classical probability theory, we will find, is completely sufficient to describe causal models. Second, the measure theoretic perspective is an insightful one. For instance, we find that consideration of the underlying probability spaces provides insight into experimental procedures (such as randomization) and non-experimental procedures (such as matching). Additionally, a simple method of visualizing causal models on probability spaces, which we employ throughout this work, enables one to generate and reason about a rich set of instructive examples. Third, by making explicit the relationship between causality and measure theory, we hope to initiate interest in applying the tools from measure theory to further develop causal inference.
The remainder of this work is organized as follows. In Section 2 we provide a brief overview of the measure theoretic framework of probability theory. We also introduce examples, notation, and a method for visualizing causal models that will frequently be used in later sections. In Section 3 we closely examine the simplest causal system: two binary random variables. Here we review the potential outcomes framework within the language of probability spaces, emphasizing that potential outcomes are simply random variables in the familiar sense of classical probability theory. We also introduce a formal definition of causality and a formal model for experimental randomization in this simple system. In Section 4, we consider a system of three binary random variables, an incrementally more complex system that introduces several new conceptual challenges. First, we see how two random variables may be jointly causal for a third random variable, despite neither being individually causal. We also re-examine the concept of matching–a popular method of causal inference in the observational setting–from the measure theoretic perspective. Finally, in Section 5 we expand the ideas developed in Sections 3 and 4 to more general causal models.
2 Background and notation: probability spaces and visual representation
In the present section, we provide a brief review of the measure theoretic framework of classical probability theory, both to establish notation and to introduce a method for visualizing probabilistic systems that we will use throughout this work. For a more detailed review of classical probability theory, please refer to Appendix A
The central construct within the measure theoretic framework of probability theory is the probability space. Denoted by the triple , the probability space consists of a sample space (), a -algebra (), and a probability measure (). A random variable is an -measurable function, mapping elements (called random outcomes) to . Somewhat counter-intuitively, random variables are deterministic functions of . Perfect knowledge of implies perfect knowledge of a random variable; uncertainty in results in uncertainty in a random variable. A random variable and a probability measure together define the probability law , which maps elements of the Borel -algebra to as follows:
The goal of classical statistical inference is to understand the probability law from observed realizations of the random variable .
Throughout this work, we will find it convenient to visually represent random variables on a simple probability space probability space, which we call the square space. The square space is defined by the triple , where the sample space is the unit square in , is the Borel -algebra on , and is the two-dimensional Lebesgue measure (equivalent to the common notion of “area”). We will find the square space particularly useful because it is both amenable to visualization and flexible enough to accommodate many probabilistic systems. In Figure 1, we represent a binary random variable on the square space. In this, and in all following examples, shaded regions of the sample space correspond to the pre-image of 1 for the corresponding binary random variable. Therefore, all points in the upper half of map to 1 and all points in the lower half of map to 0. Since the underlying probability measure is the Lebesgue measure, the probability law for is that of a fair coin: .
Multiple random variables can be defined on a single probability space with multivariate probability laws defined in the natural way. If and are two random variables defined on , then the multivariate random variable is defined as the following map between and :
The joint probability law is defined as a map between , the Borel -algebra on , and :
If for any Borel rectangle , we have the relationship
then and are called independent. Otherwise, and are dependent.
In Figure 2, we represent two binary random variables and simultaneously on the square space. is defined as in Figure 1, while maps from the upper right triangle to 1. The region where both and map to 1 is shaded darker; in this region, . In this example, and are dependent. This can be seen qualitatively from Figure 2 by noting that the distribution of differs on the subsets and .
Two probability spaces and can be used to construct a third probability space, called the product space:
A feature of the product space construction, which we will make use of in our discussion of experimental randomization, is that it induces independence between random variables. In particular, when is defined on and is defined on , and are independent random variables when defined jointly on the product space . Figure 3 displays a product space construction. In this example,
where is the Borel -algebra on and is the one-dimensional Lebesgue measure. The product space is therefore the square space, , and and are independent random variables by construction.
Before discussing causal models in the following sections, it is important to note that measure theoretic framework of probability just discussed initially seems at odds with causal intuitions. In particular, the causal notion of random variables affecting one another is unnatural under the measure theoretic model in which all random variables are functions of a single random outcome selected from the sample space. Later we will see that this contradiction is superficial. Causal models are a special class of probabilistic models, with structured relationships between observed and unobserved (i.e., potential outcome) random variables.
3 Causal inference on two variables
The minimal causal model, and by far the most studied, is that of a binary treatment and a binary response. For the sake of simplicity, this is where we begin. We frame our discussion around the quintessential causal inference question: Does smoking cause lung cancer?
3.1 Smoking and lung cancer
We model both smoking () and lung cancer () as binary random variables on the square space as in Figure 2. In this example, the marginal probability of both smoking and lung cancer is . A natural (but incorrect) approach one may take to quantify the effect of smoking on lung cancer is to estimate the Average Observed Effect:
For this particular example, .
As a population quantity, the AOE must be estimated. Given a dataset of i.i.d. realizations of the bivariate random variable , one can compute the quantity:
where , , and denotes the
sample (superscripts are used rather than subscripts to avoid confusion with notation introduced later). The law of large numbers ensures thatconverges to the true AOE as . Given enough samples, therefore, the AOE is estimable from the observed data.
While the AOE is estimable from observable data, it does not generally correspond to any causal quantity. In particular, implies nothing about how the incidence of lung cancer would change under an intervention in which cigarettes are eliminated from society altogether. Importantly, the difference in the conditional means of could be completely or partially explained by a third confounding variable .
3.2 Potential outcome random variables
The potential outcomes of the Neyman-Rubin model (NRM) [8, 15] provide a language for causality distinct from statistical relationships between observed random variables. Following convention, we notate potential outcomes with subscripts and describe them intuitively as follows:
If , then lung cancer would not be observed () in this particular individual if he had smoked, irrespective of whether or not he actually did smoke (). Potential outcomes are often described in the language of of “alternate universes.” If , then is observed as . On the other hand, is observed in the alternate universe which is identical to our universe in all respects except for the fact that .
Though useful for intuition, this description of potential outcomes in terms of counterfactual realities is not stated in terms of probability spaces. In the present work, we emphasize that potential outcomes are familiar objects: random variables mapping to defined on the same probability space as the random variables and . Potential outcomes are defined here according to a relationship with observable random variables. In the current example, the potential outcomes are related to the observable random variable by the following equation:
is the indicator random variable for the event. This relationship, further generalized in Section 4 and 5 by the contraction operation, defines the essential structure of a causal model.
One important feature of Equation 2 is that and are never simultaneously observed for a single ; is observed only when while is observed only when . This observation is typically referred to as the fundamental problem of causal inference. As a consequence of the fundamental problem of causal inference, there are generally many distinct sets of potential outcomes consistent with the observed random variables. For example the three distinct sets of potential outcomes , , and from Figure 4 are all consistent with the observable random variables in Figure 2. This is achieved since on the pre-image , while on the pre-image . However, on , the potential outcomes , , and may differ without altering the observed random variable . Likewise on , the potential outcomes , , and may differ without altering the observed random variable .
3.3 Causal effects
With potential outcomes, we can now define precise notions of causal effects by comparing the random variables and . The definition below improves on the informal Definition 1 by providing an unambiguous way to assess whether a binary random variable is causal for another random variable .
Definition 2 (Formal definition of causality).
A binary random variable is causal for another random variable (denoted ) if on a subset of nonzero measure.
Referring to Figure 4, we see that if the set of potential outcomes are either or , then we would conclude that is causal for . However, if the true set of potential outcomes is , then we would conclude that is not causal for . Importantly, each set of potential outcomes is consistent with the observable random variables and . Irrespective of how large our sample is, we cannot conclude whether is causal for from the observed data alone. Definition 2 makes clear that the fundamental problem of causal inference is in direct conflict with any attempt to determine causal relationships from observed data. We develop this relationship further in Section 5, where we generalize Definition 2 and the fundamental problem of causal inference beyond the simple treatment and response paradigm discussed in the present section.
As was noted previously, it is typically not the case that we have complete knowledge of the probability space. Rather, we observe realizations of random variables. Through these observations we then try to infer their probability laws. Thus, it is important to have a definition of causality that depends only on distributional information. Perhaps the most important such metric is the average causal effect (ACE):
Referring again to Figure 4, we can compute the following:
If the underlying potential outcomes are , then the ACE is zero, consistent with the observation that is not causal for . However, assuming the potential outcomes are yields an ACE which is also zero, despite the fact that is casual for . Finally, if the potential outcomes are , then the ACE is . This is opposite in sign to the observable AOE which we found in Section 3.1 to be .
This example suggests that a nonzero ACE implies that is causal for (although the inverse implication does not hold). For example, and is causal for under the set of potential outcomes in Figure 4. Corollary 1 below confirms this relationship for the case of binary and .
Corollary 1 ().
For binary and , if then is causal for
We first note that
since is assumed to be binary. Decomposing and ,
Therefore at least one of the events or must have nonzero measure. Therefore, by Definition 2, . ∎
As a brief side note, it is at least conceptually clear how one could generalize Definition 2 to handle non-binary . In particular, if denotes the image of , then is causal for if the set of potential outcomes differ on a set of nonzero measure. However, when contains infinitely many elements, it may be the case that the potential outcomes differ on a subset of of nonzero measure, however this occurs for a subset of zero measure. For instance, suppose and all of the potential outcomes are identical except for the potential outcome , which differs from all other potential outcomes on all of
. For simplicity, we avoid such subtleties, considering exclusively finite discrete random variables in the present work, where the generalization of Definition2 is obvious.
We saw in the previous section a set of observable random variables consistent with many sets of potential outcome random variables each implying different causal relationships. We also saw that determining whether is causal for according to Definition 2 is generally impossible since and are never simultaneously observable for any single . Similarly, computing the ACE is generally impossible since it requires evaluating expectations of random variables , which we only observe on incomplete and disjoint subsets of the sample space .
However, when is independent of the potential outcomes (which we will denote as ) estimation of the average causal effect is possible. When this is the case, the following simple argument shows that :
The second line follows from the first line by applying Equation 2, which defines our causal model. The fourth line follows from the third line by our assumption that is independent of the potential outcome random variables.
In a properly randomized experiment, it is often assumed that . In the present section, we describe a measure theoretic model of the process of randomization, which takes advantage of the product measure construction described in Section A.4.
Definition 3 (Experimental randomization of ).
Suppose and are defined on a probability space . An experimental randomization of produces a new probability space and new random variables and defined as follows:
where is defined arbitrarily a new probability space such that for all .
In an experimental randomization of , the scientist replaces the “naturally occurring” with an “artificially generated” , derived from an external source of randomization. An ideal (although unethical) randomized experiment to determine whether smoking causes lung cancer would allow the scientist to force individuals to smoke or not to smoke based on the outcome of a coin toss. Under experimental randomization, the choice to smoke is tied to an external source of randomness, and hence occurs altogether on a separate probability space . The definition of ensures that responds to the randomized version () in the same way that it responded to the nonrandomized version (). Defining the new observable random variables and on the product space ensures that is independent of the potential outcome random variables as desired.
As an example, suppose we experimentally randomize in the example from Figure 2, where the underlying potential outcomes are as in Figure 4(b). Suppose is defined on the probability space , where and . Then as in the toss of an unbiased coin. Then the random variables and live on the space , where represents the Borel -algebra on and represents the three dimensional Lebesgue measure (equivalent to the common notion of volume).
Figure 5 visualizes the randomization system . We can compute the AOE on the randomized system as follows:
as expected. It may be instructive to verify that upon experimental randomization of for the three other sets of consistent potential outcomes in Figure 4, but geometric intuition should make it clear that this will always work. For the region , we observe over the entire cross section , so . The same reasoning makes it clear that . These two observations imply when is experimentally randomized.
Theorem 1 describes an even more important consequence of experimental randomization. If is experimentally randomized, the probability law of potential outcomes can be deduced from observed conditional probability laws.
Under experimental randomization of ,
The proof follows from simply writing out the conditional probability explicitly:
The discussion of the present section makes clear why randomization is such a powerful technique. In a properly randomized system, true causal quantities such as the ACE can computed from observed data. However, it is important to recognize the shortcomings of experimental randomization. First, experimental randomization is still inadequate for the purposes of uncovering causality in situations like Figure 4d; although is causal for according to Definition 2, the probability laws and are identical. Second, the conditions of Definition 3 are very strict. Beyond just ensuring that , experimental randomization requires that can behave as a substitute for in Equation 2. For instance, if being involved in a randomized trial induces behavior that has some effect on lung cancer (i.e., cognizance of enrollment in a lung cancer trial may cause participants to pursue a healthier lifestyle), we cannot expect the causal effects computed from the randomized trial to reflect the causal effect of smoking “in the wild.”
4 Causal inference on three variables
Several new concepts in causality arise in systems of three observable variables. As such, in this section we study the simplest three-variable system: three binary random variables. We add to our running example of smoking () and lung cancer () a third binary random variable representing exercise habits. indicates a low level of exercise while indicates a high level of exercise. One could imagine exercise habits influencing both lung cancer outcomes and smoking choices.
4.1 A comment on notation
In previous sections we only needed a single subscript to specify potential outcomes. For instance, implicitly referred to the potential outcome “ had been 1.” The potential outcome from previous sections will now be denoted in order to distinguish it from the potential outcome . Further, the potential outcome “ had been 0 and been 1” will be denoted .
Equation 2 specifies the relationship between potential outcome random variables and observable random variables and . In the case of three random variables, we might naturally generalize Equation 2 as follows
Since Equation 2 must still hold, we have the following equality:
Together with the observation that , Equation 7 implies the following relationship between the double-subscripted potential outcomes and the single-subscripted potential outcomes :
Similar reasoning suggests the following relationship for the potential outcomes :
In this manner, any single-subscripted potential outcome may be derived from double-subscripted potential outcomes and observable random variables: by summing over the subscript to be removed and multiplying by the corresponding indicator random variables. We will refer to this operation as contraction, due to its similarity to tensorial contraction. For instance, the set of potential outcomes are obtained from the potential outcomes by “contraction over .” The observable random variable can be obtained by “contracting over and ” or equivalently “contracting over .” Thus, the simple relationship in Equation 2 represents a contraction. We will formalize and generalize the notion of contraction in Section 5.
4.3 Joint causality
In a system of three binary observable random variables , Definition 2 is still applicable to pairs of variables. For instance, is causal for if the potential outcomes , obtained by contracting over , are different on a subset of the sample space of nonzero measure. Similarly, one can assess if is causal for by examining the potential outcomes obtained by contracting over .
However, it is also possible for and to affect in a way not fully explained by their individual effects on . Figure 6 displays a particularly pronounced example. Here, neither nor is causal for alone according to Definition 2. This is because and for all . In fact, all single-subscripted potential outcomes and equal to zero on all of . For example:
for all . This is because on , where . Likewise on , where . Similar calculations can be done for each of the other three single-indexed potential outcomes , , and , and one can confirm that each of these potential outcomes is identically zero on all of .
However, the double-subscripted potential outcomes differ from each other on a subset of of measure one. This is because for all (excluding the measure zero subset along the vertical and horizontal mid-line of ), exactly one double-subscripted potential outcome is equal to one, with each of the other three equal to zero. In this example, we will say that and are jointly causal for .
Before precisely defining joint causality, we first recognize that Definition 2 can also apply to causal relationships between observable and potential outcome random variables. Noting that is itself a random variable, we can conclude that is causal for if on a subset of nonzero measure. Intuitively, if is causal for , the effect has on is modified by the value .
However, being causal for alone does not capture the notion of joint causality. For example, consider the set of potential outcomes displayed in Figure 7. In this case, is causal for since and differ on all of . Similarly, is also causal for . However, the potential outcomes do not depend on the subscript at all: the value of can be determined by and the subscript alone.
To ensure that we exclude scenarios like that in Figure 7, we define joint causality as follows:
Definition 4 (Joint causality).
Two binary random variables and are said to be jointly causal for a third random variable if both of the following hold:
is causal for for some .
is causal for for some .
As with Definition 2, some generalizations of Definition 4 obvious while others are not. For one, Definition 4 does not at all depend on and being binary; the definition is equally applicable to any finite discrete and . When either or are continuous, we encounter the same subtleties as in Definition 2. We can also imagine the definition of joint causality applying to sets of more than two random variables. For three random variables , and to be jointly causal for a fourth random variable , we require i) to be causal for for some ii) to be causal for for some and iii) to be causal for for some . The generalization to four or more finite discrete random variables is now straightforward.
4.4 Joint randomization
In Theorem 1, we saw that experimental randomization of allowed us to infer the distribution of the potential outcome from the distribution of the observable random variable . In the present section, we show how one can simultaneously randomize and to infer the distribution of the potential outcomes . This procedure of simultaneous randomization, detailed in Definition 5, is a natural extension of the procedure detailed in Definition 3.
Definition 5 (Joint experimental randomization of and ).
Suppose , , and are defined on a probability space . A joint experimental randomization of and produces a new probability space and new random variables , , and defined as follows:
where and are defined arbitrarily on a probability spaces such that for all .
In the definition of joint experimental randomization, we do not require and to be randomized on separate probability spaces. In other words, joint experimental randomization of and does not necessarily require and to be independent of each other. Of course, randomizing and on separate probability spaces and such that the randomized probability space is
also satisfies Definition 5.
Lastly we prove that joint experimental randomization allows us to observe the distribution double-subscripted potential outcomes. This result extends Theorem 1.
Under joint experimental randomization of and ,
The proof is analogous to that of the proof of Theorem 1. We simply write out the conditional probability explicitly: