# Probabilistic Reasoning across the Causal Hierarchy

We propose a formalization of the three-tier causal hierarchy of association, intervention, and counterfactuals as a series of probabilistic logical languages. Our languages are of strictly increasing expressivity, the first capable of expressing quantitative probabilistic reasoning—including conditional independence and Bayesian inference—the second encoding do-calculus reasoning for causal effects, and the third capturing a fully expressive do-calculus for arbitrary counterfactual queries. We give a corresponding series of finitary axiomatizations complete over both structural causal models and probabilistic programs, and show that satisfiability and validity for each language are decidable in polynomial space.

## Authors

• 7 publications
• 9 publications
11/27/2021

### Is Causal Reasoning Harder than Probabilistic Reasoning?

Many tasks in statistical and causal inference can be construed as probl...
01/10/2013

### A Calculus for Causal Relevance

This paper presents a sound and completecalculus for causal relevance, b...
07/04/2019

### On Open-Universe Causal Reasoning

We extend two kinds of causal models, structural equation models and sim...
10/17/2019

### MultiVerse: Causal Reasoning using Importance Sampling in Probabilistic Programming

We elaborate on using importance sampling for causal reasoning, in parti...
05/08/2018

### On the Conditional Logic of Simulation Models

We propose analyzing conditional reasoning by appeal to a notion of inte...
08/27/2018

### On the Distributability of Mobile Ambients

Modern society is dependent on distributed software systems and to verif...
07/30/2018

### Causal Modeling with Probabilistic Simulation Models

Recent authors have proposed analyzing conditional reasoning through a n...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction and Summary

Intelligence commonly involves prediction, anticipating future events on the basis of past observations (e.g., “Will the water pipes freeze again this winter?”). Intelligent planning and decision-making additionally require predicting what would happen under a hypothetical action (“Will the pipes freeze if we keep the heat on?”). An even more sophisticated ability—critical for tasks like explanation—is to reason counterfactually about what would have happened given knowledge about what in fact happened (“Would the pipes have frozen if we had left the heat on, given that the heat was off and the pipes in fact froze?”). These three modes of reasoning constitute a causal hierarchy [26, 24], highlighting the significance of structural causal knowledge for flexible thought and action.

The aim of the present article is to gain conceptual as well as technical insight into this hierarchy by employing tools from logic. Loosely following earlier work [26], we propose a characterization of its levels in terms of logical syntax. Semantically, all three languages are interpreted over the same class of models, namely structural causal models (and later probabilistic programs). The languages differ in how much they can express about these models. , the language of association, expresses only “pure” probabilistic facts and relationships;

, the language of probabilistic intervention, allows expressing probabilities of basic conditional “if…then…” statements;

, the language of probabilistic counterfactuals, encodes probabilities for arbitrary boolean combinations of such conditional statements. Using standard ideas from logic and existing results, we can address questions about definability and expressiveness. For instance, it is easy to prove in our framework that each language is strictly more expressive than those below it in the hierarchy (Prop. 1, 2 below). We can also interpret well-known insights from the graphical models and causal learning literatures as graph definability results for appropriate probabilistic logical languages, analogously to correspondence theory in modal logic [32].

In possession of a precise syntax and semantics for probabilistic causal reasoning, questions of axiomatization naturally arise. That is, we would like to identify a perspicuous set of basic principles that underly all such reasoning. One of our main technical contributions is a series of finitary (sound and complete) axiomatizations for each level of the causal hierarchy (Thm. 5), relying on methods from semialgebraic geometry. As a corollary to these completeness results, we also reveal a “small-model property” with the consequence that satisfiability and validity for , , and can be decided in polynomial space (Thm. 11).

Finally, in the last part of the paper we consider an alternative interpretation for our three logical languages. Probabilistic programs, with an appropriate notion of causal intervention, provide a procedural semantics for probabilistic counterfactual claims and queries. We establish an equivalence between these models and a natural subclass of computable structural causal models (Thm. 12). The equivalence in turn implies soundness and completeness of our axiomatizations with respect to this interpretation as well.

### Relation to Previous Work

While deterministic causal counterfactuals and probabilistic logics have both received extensive treatment independently, the present contribution appears to be the first systematic study of probabilistic counterfactuals. Our work thus synthesizes and improves upon a long line of previous work. Axioms for causal conditionals interpreted over structural causal models are well understood [7, 11, 24, 35, 16] and play a distinct role in causal reasoning tasks such as identification [26, 22]. Indeed, some prominent approaches to causal learning and identification even employ algorithmic logic-based techniques [13, 14, 31].

Meanwhile, much is known about formalized probability calculi. [6] considered a probability logic built over a language of polynomials sufficiently expressive to encompass essentially all ordinary probabilistic reasoning about propositional facts, including Bayesian inference, conditional independence, and so on. However they left open the problem of explicit axiomatization. A complete axiomatization was later provided by [25] using an infinitary proof rule. Whereas our main interest is causal reasoning beyond the first level of the hierarchy, Thm. 5 incidentally establishes the first (weakly) complete finitary axiomatization for “pure” probability logic over a language of polynomials. Moving to the second and third levels of the hierarchy, Thm. 5 also presents the first combined axiomatization for probabilistic reasoning about causal counterfactuals.

[6] established a complexity upper-bound () for their satisfiability problem. On all three levels of the hierarchy, we obtain the same upper bound for our decision problem (Thm. 11). Both arguments rely crucially on the procedure given by [5] to decide the existential theory of a real closed field. It has been previously suggested to apply cylindrical algebraic decomposition (which decides the full first-order theory of real closed fields) to causal questions [8].

Encoding causal knowledge in an implicit way via a generative probabilistic program has been explored recently by a number of research groups [21, 4, 30]. Deterministic conditionals over “simulation programs” have been axiomatized [15], showing that in general such an interpretation validates strictly fewer principles than structural causal models. This weaker axiomatic system was also embedded in a probability logic with linear inequalities [17]. It is possible, however, to restrict the class of probabilistic programs so as to ensure equivalence with (an appropriate class of) structural causal models [16]. We draw on all of this work in what follows.

## Causal Models

We are interested in structural causal models [27, 24] defined over a set of endogenous variables and a set of exogenous variables; let the combined variable set be . Every variable takes on a value from an admissible set . While and may be infinite, we assume is finite for all .

###### Definition 1 (Structural Causal Model).

We define a SCM to be a pair , where is a probability measure on a -algebra of settings of , and is a set of functions, one for each endogenous variable.

Given a set of variables , we call an assignment of each variable to a value in a setting. Thus every is a function from settings of endogenous variables and settings of exogenous variables to a value . and

thus define the obvious joint probability distribution

.

###### Definition 2 (Intervention).

An intervention is a partial function . It specifies variables to be held fixed, and the values to which they are fixed. Intervention induces a mapping of SCMs, also denoted , so that is identical to , but with replaced by the constant function for each . We say is finite whenever is finite.

Using this definition we introduce a relation of direct causal influence [10]. We say when there are endogenous settings differing only in their assignment to , and an exogenous setting such that . This in turn induces a dag . Throughout we restrict attention to SCMs such that is well-founded, generalizing the common acyclicity condition. This is also equivalent to the requirement for infinite SCMs from [16].

We say is Markov if each variable is -independent of its non-descendants (in ) conditional on its parents. The Markov condition is guaranteed provided the exogenous variables are jointly independent [24, Thm. 1.4.1]. We define d-separation on a dag standardly, and write to say that the variables are d-separated from given . In Markov structures d-separation guarantees conditional independence.

In order to establish a correspondence (Thm. 12) with probabilistic programs, we introduce a further restriction on SCMs, following previous work on computable causal models [20, 16]. Def. 3 below assumes a very simple “coin-flip” probability space; we leave it as an exercise to show that there is no loss of generality compared to using any computable probability space as in, e.g., [1], including any standard continuous probability distributions.

###### Definition 3.

is computable if (1) its exogenous variables consist of infinitely many binary

, each uniformly distributed by

, and (2) the collection is uniformly computable [33].

Call a model measurable if under every finite intervention

, the joint distribution

is well-defined. The next Fact is straightforward.

###### Fact 1.

A computable SCM is both Markov and measurable.

###### Proof.

A computable is Markov simply because all exogenous variables are jointly independent. Let be an endogenous setting and let where . Letting be our -algebra, is well-defined if . It suffices to show that for all since closes under countable intersection. There is a machine that halts outputting the value for any

. By then it has seen only finitely many exogenous bits, whose values we write in a finite vector

. Thus the cylinder set of that agree with wherever the latter is defined is contained in . So letting , . Every cylinder is in and there are only countably many cylinders, so ; obviously so as desired. ∎

Let be the class of all measurable SCMs that have a well-founded causal influence graph. Further, let be the subclass of computable models in .

## Probabilistic Conditionals

### Syntax

We define a succession of language fragments as follows, where and :

 Lint ::= ⊤|X=x|Lint∧Lint Lprop ::= X=x|¬Lprop|Lprop∧Lprop Lcond ::= [Lint]Lprop Lfull ::= Lcond|¬Lfull|Lfull∧Lfull

Based on these fragments we define a sequence of three increasingly expressive probabilistic languages . Each language speaks about probabilities over the base language . The base languages are

 Lbase1=Lprop,Lbase2=Lcond,Lbase3=Lfull.

Our languages describe facts about the probabilities that base language formulas hold. As our formulas are finitary, such facts correspond to polynomials in these probabilities. Let us make this precise. Fixing a set , define the polynomial terms in the variables to be the generated by this grammar (where generates any element of ):

 t ::= V|t+t|t⋅t|−t.

Then the terms of are polynomials over probabilities of base formulas, i.e., polynomial terms in the variables . The language , , is then a propositional language of term inequalities:

 Li ::= t⩾t|¬Li|Li∧Li

where is a term of . We employ the following abbreviations. For and we take to stand for any propositional contradiction, and for any propositional tautology. For terms: for , for . For formulas, we write for , and for . Note that we may use any rational number as a term via representing its numerator as a sum of s and clearing its denominator through an inequality , and we write for a rational thus considered as a term.

Strictly speaking, for is not a well-formed term in or . We will nonetheless use this notation with the understanding that is a shorthand for . and thus extend in this sense.

, , and correspond to the three levels of the causal hierarchy as proposed by [26, 24]; see also [3]. is simply the language of probability, capturing statements like , which is shorthand for expression . encompasses assertions about so-called causal effects, e.g., statements like .

### Semantics

A model is simply a measurable SCM . Since is well-founded, each determines the values of all endogenous variables. Thus, for we will write , defined in the obvious way. For we define the intervention operation so that is the result of applying the interventions specified by to (Def. 2). Then we say just in case . We have thus defined for all .

Finally, for any and model define the set . Measurability of guarantees that is always measurable. Toward specifying the semantics of we define recursively, with the crucial clause given by . Satisfaction of is as expected: iff , iff , and iff and . As in previous work, it is easy to see that none of the languages is compact [25, 15, 16]; consequently Thm. 5 shows weak completeness only.

### Comparing Expressivity

With a precise semantic interpretation of our three languages in hand, we can now show rigorously that they form a strict hierarchy, in the sense that models may be distinguishable only by moving up to higher levels of the hierarchy.

###### Proposition 1.

is strictly more expressive than .

###### Proof.

Consider with and while ; in we have and . It is easy to see by an induction on terms in that , and thus and validate the same formulas. Yet, , while . In particular the schema is also falsified by , a reflection of the distinction between observation and intervention. ∎

###### Proposition 2.

is strictly more expressive than .

###### Proof.

Consider an example adapted from [2] with two endogenous and two exogenous variables , where and , while . The difference between models and

is the equation for binary variable

. In we have equal to , and in we have given by . It is then easy to check (by induction) that and validate all the same formulas, whereas, e.g., , with the term denoting the probability of necessity and sufficiency . ∎

Note also that in this second example, showing that even allowing simple conjunctions of the form would increase the expressive power of . It is thus not possible in general to reason in about conditional expressions such as . On the other hand does handle conditional effects, since, e.g., can be rewritten as .111In the notation of do-calculus [23, 24], the expression would be written as .

In a companion manuscript [3] we improve upon Props. 1 and 2 by showing that for the -theory of a model almost-never determines its -theory, in the sense that the proportion of models where such collapse occurs goes asymptotically to zero.

### Graph Definability and Do-Calculus

Given the languages and interpretation considered so far, we mention as an aside that it may be enlightening to consider a notion of graph validity, analogous to “frame validity” in modal logic [32]. Let us say just in case for all Markov structures such that .

For any dag there is a probability distribution whose conditional independencies are exactly those implied by d-separation in [9]. It is then easy to construct an SCM with and , which immediately gives:222 represents a conjunction over all instances of this schema with all combinations of values , , . Here are lists of variables and are corresponding lists of values in the respective ranges.

###### Proposition 3.

if and only if . In other words, the graph property of d-separation is definable in .

One of the most intriguing components of structural causal reasoning is the -calculus [23, 24, 34], allowing the derivation of causal effects from observational data. This calculus can also be seen as involving graph validity. The next proposition is a slight extension of what was already proved in [23], combined with the result of [9] mentioned above.

###### Proposition 4.

Let be a dag over variables .333 is minus any edges into or out of , and is the set of all -nodes that are not ancestors of any -node in . Then (1) iff ; (2) iff ; and (3) iff . All formulas here are in .

We leave further exploration of questions about graph definability in these languages for a future occasion.

## Axiomatizations

We now give systems each of which axiomatizes the validities of over both and . These probabilistic logics build on the base (deterministic) logics, as any equivalent base formulas must be assigned the same probability. We call an -validity, and write , if for all and we have . For , write if is a propositional tautology. The validities of have been axiomatized in [16]. Suppose . It is easy to see that just in case . It is even easier to see that for , just in case . Satisfiability for every base language is -complete. We need to add one more axiom schema,444In [16], Def merely amounts to the law of excluded middle since for all ; here we assume only that every is finite. for all :

 Def. ⋀x,x′∈Val(X)x≠x′¬(X=x∧X=x′)∧⋁x∈Val(X)X=x.

The system for is then as follows.

 MP. Inference rule: φ,φ→ψ⊢ψ Bool. Boolean tautologies over Li NonNeg. P(φ)⩾0– Add. P(φ∧ψ)+P(φ∧¬ψ)≡P(φ) Dist. P(φ)=P(ψ) whenever ⊨Lbaseiφ↔ψ Poly. The polynomial schemata below.

The following 15 axioms constitute Poly.

 OrdTot. t1⩾t2∨t2⩾t1 OrdTrans. t1⩾t2∧t2⩾t3→t1⩾t3 NonDegen. ¬(0–≡1–) AddComm. t1+t2≡t2+t1 AddAssoc. (t1+t2)+t3≡t1+(t2+t3) Zero. t+0–≡t AddOrd. t1⩾t2→t1+t3⩾t2+t3 MulComm. MulAssoc. (t1⋅t2)⋅t3≡t1⋅(t2⋅t3) One. t⋅1–≡t MulDist. t1⋅(t2+t3)≡t1⋅t2+t1⋅t3 MulNonNeg. t1⩾0–∧t2⩾0–→t1⋅t2⩾0– ZeroMul. t⋅0–≡0– NoZeroDiv. t1⋅t2≡0–→t1≡0–∨t2≡0– Neg. t+(−t)≡0–

As for , Add turns out to be too weak—it only captures consequents of the trivial antecedent () since a purely propositional in the base language is interpreted as . Thus to obtain , we form the axiomatization as above, but add the following axiom Add2:

### Sample Derivation

Before proving completeness (Thm. 5) we illustrate the power of (and ) through a representative derivation. Our goal is to derive the example in [23, §3.2]:

 P([x∗]y∗) ≡ ∑zP(z|x∗)∑xP(y∗|x∧z)P(x) (1)

for any specific values and . (Rather than writing, e.g., , we are simply writing .) This formula (in ) is not in general valid. But it does follow from further assumptions easily statable in . Formulas (2)-(4) below are instances of the second -calculus schema, while (5) and (6) are instances of the third schema (recall Prop. 4).

 P([X]Z) ≡ P(Z|X) (2) P([X]Y|[X]Z) ≡ P([X∧Z]Y) (3) P([Z]Y|[Z]X) ≡ P(Y|X∧Z) (4) P([X∧Z]Y) ≡ P([Z]Y) (5) P([Z]X) ≡ P(X) (6)

Prop. 4 provides the graphical assumptions needed to justify each of these assertions. For example, they are all valid over the graph in Fig. 1 [23, 24].

We now argue that is derivable in our calculus.555It is worth observing that this derivation would go through even if we considered the weaker logic for studied in [15]. That is, deriving does not depend on any of the causal axioms that characterize structural causal models [11, 24]. However, slightly weaker assumptions—e.g., in place of (2)—would require the additional axioms. Notably, this can be done in using only . First, by appeal to MP, Bool, and Dist, we have that , which in turn using Add2 is equal to . By Poly this can be shown equal to . By (2) and (3) this is equal to , and by (5) to . Employing a similar argument to that above (using MP, Bool, Add2, and Dist), this is equal to . By (4) and (6) we finally obtain .

### Completeness Theorems

###### Theorem 5.

Each is sound and complete for the validities of with respect to both and .

###### Proof.

Soundness is straightforward. For completeness, we show any consistent is satisfiable. There is a consistent clause in its disjunctive normal form so we may assume that is a conjunction of literals (using MP, Bool). We can now obtain a normal form for as in [6]. Lem. 6 below gives this for the case. We will show later how to modify it (Lem. 9) for .

###### Lemma 6.

Suppose is a conjunction of literals. Let be the base atoms appearing in . Let . Let . Then there are polynomial terms in the variables such that is provably-in- equivalent to a conjunction

 ⋀δ∈Δ⊥P(δ)≡0–∧⋀δ∈Δ∖Δ⊥P(δ)⩾0–∧∑δ∈ΔP(δ)≡1–∧⋀1≤i≤mti⩾0–∧⋀1≤i≤m′t′i>0–. (7)
###### Proof.

The first part of (7) comes from Dist, the second from NonNeg, and the remainder from Add and Poly. This is most clearly illustrated by example. Consider so that and . Let abbreviate . By NonNeg, and by Add, . Now, we may compute the in (7). Using Poly and Add:

 ⊢φ↔2–P([X]Y)+4–P([Y]Z)−1–≡0– ⊢φ↔2–P([X]Y∧[Y]Z)+2–P([X]Y∧¬[Y]Z) ⊢φ↔6–p1+2–p2+4–p3−1–≡0– ⊢φ↔6–p1+2–p2+4–p3−1–⩽0– ∧−6–p1−2–p2−4–p3+1–⩽0–

so that in (7), and is while is . We can carry out this process for any , with if contains negations. ∎

(7) is a system of polynomial inequalities in the unknowns . We now demonstrate this system has a solution provided (7) is consistent. Our primary tool is the following semialgebraic result [28].

###### Theorem 7 (Positivstellensatz).

Let and suppose , , . Let be the closure of under addition and multiplication, and let . Then either the system has a solution over , or there exist , , with

 g+h+F2n=0 (8)

where .

Each clause in (7) easily translates to a polynomial in Thm. 7; a clause becomes two constraints: and . If there’s no solution, let for some as in (8), where is an iterated multiplication. We claim so that is inconsistent, a contradiction. We use the principles below, all derivable from Poly:

First, we show . Note that by OrdSq and by ZeroMul given Thm. 7 and the clauses of (7). Also, by NonDegen if in (7) or in (8) and by MulPos otherwise. So by AddPos. Now we show . In fact, we don’t need . We show that Poly is powerful enough to simplify polynomials; then by soundness and since (8) holds identically, . Using MulDist, NegAdd, where is a sum of non- monomials, each of which is either itself or contains no factors of (One), and contains at most one sign (NegNeg). By group the factors in each left-associatively and in increasing (lexicographic) order of their variables. Then with AddComm, AddAssoc, Neg group and cancel out to equal but opposite monomials. Adding all the s, we have .

The can be assumed computable, as there is an algebraic [29] and a fortiori computable solution to (7). Now, this implies there is a that satisfies each with probability , so that :

###### Lemma 8.

Let be satisfiable -formulas no pair of which is jointly satisfiable, and let be nonnegative computable reals summing to unity. Then there is a such that for all .

###### Proof.

[16] give semantics of over deterministic SCMs, i.e., those in which . Thus there are determinstic such that for all . Consider with one exogenous variable such that , where for all the probability that is . Define as for any endogenous setting . When , the structural equations of and coincide, so . Conversely, if , then : otherwise . Thus and for all . Clearly can be made to satisfy Def. 3. ∎

As for the case, Lem. 6 must be modified, but the proof is the same (no elements of are jointly satisfiable):

###### Lemma 9.

Suppose is a conjunction of literals. Let be the -atoms appearing in (i.e. its subformulas of the form ) and let be the antecedents of any conditionals appearing in . Let and let . Let . Then there are polynomial terms in the variables such that is provably-in- equivalent to a conjunction

 ⋀δ∈Δ⊥P(δ)≡0–∧⋀δ∈Δ∖Δ⊥P(δ)⩾0–∧⋀1≤i≤l∑δprop∈ΔpropP([αi]δ% prop)≡1–∧⋀1≤i≤mti⩽0–∧⋀1≤i≤m′t′i>