# Interventions and Counterfactuals in Tractable Probabilistic Models: Limitations of Contemporary Transformations

In recent years, there has been an increasing interest in studying causality-related properties in machine learning models generally, and in generative models in particular. While that is well motivated, it inherits the fundamental computational hardness of probabilistic inference, making exact reasoning intractable. Probabilistic tractable models have also recently emerged, which guarantee that conditional marginals can be computed in time linear in the size of the model, where the model is usually learned from data. Although initially limited to low tree-width models, recent tractable models such as sum product networks (SPNs) and probabilistic sentential decision diagrams (PSDDs) exploit efficient function representations and also capture high tree-width models. In this paper, we ask the following technical question: can we use the distributions represented or learned by these models to perform causal queries, such as reasoning about interventions and counterfactuals? By appealing to some existing ideas on transforming such models to Bayesian networks, we answer mostly in the negative. We show that when transforming SPNs to a causal graph interventional reasoning reduces to computing marginal distributions; in other words, only trivial causal reasoning is possible. For PSDDs the situation is only slightly better. We first provide an algorithm for constructing a causal graph from a PSDD, which introduces augmented variables. Intervening on the original variables, once again, reduces to marginal distributions, but when intervening on the augmented variables, a deterministic but nonetheless causal-semantics can be provided for PSDDs.

## Authors

• 3 publications
• 28 publications
• ### Interventional Sum-Product Networks: Causal Inference with Tractable Probabilistic Models

While probabilistic models are an important tool for studying causality,...
02/20/2021 ∙ by Matej Zečević, et al. ∙ 16

• ### Conditional Sum-Product Networks: Imposing Structure on Deep Probabilistic Architectures

Bayesian networks are a central tool in machine learning and artificial ...
05/21/2019 ∙ by Xiaoting Shao, et al. ∙ 0

• ### Tractable Querying and Learning in Hybrid Domains via Sum-Product Networks

Probabilistic representations, such as Bayesian and Markov networks, are...
07/14/2018 ∙ by Andreas Bueff, et al. ∙ 0

• ### Fairness in Machine Learning with Tractable Models

Machine Learning techniques have become pervasive across a range of diff...
05/16/2019 ∙ by Michael Varley, et al. ∙ 0

• ### A Compositional Atlas of Tractable Circuit Operations: From Simple Transformations to Complex Information-Theoretic Queries

Circuit representations are becoming the lingua franca to express and re...
02/11/2021 ∙ by Antonio Vergari, et al. ∙ 0

• ### Learning Causal Bayesian Networks from Text

Causal relationships form the basis for reasoning and decision-making in...
11/26/2020 ∙ by Farhad Moghimifar, et al. ∙ 0

• ### Provable Guarantees on the Robustness of Decision Rules to Causal Interventions

Robustness of decision rules to shifts in the data-generating process is...
05/19/2021 ∙ by Benjie Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, there has been an increasing interest in studying causality-related properties in machine learning models. For example, [kansky2017schema]

argue for the ability to assess past observations and explain away alternative causes in deep reinforcement learning methods. In

[DBLP:journals/corr/abs-1811-10597], the question of what units are responsible for controlling and manipulating certain features within an image is considered. In [DBLP:journals/corr/abs-1812-03253]

, strategies to give a causal interpretation to the intrinsic structure of deep learning models is investigated. Broadly speaking

[pearl2019seven], the motivation stems from extending the query and reasoning capabilities over probabilistic domains. That is, in standard probabilistic models, one is simply interested in conditioning on observations : e.g., what is the likelihood of lung inflammations given that the patient smokes? Causal reasoning allows us to reason about interventions : e.g., how are lung inflammations affected when the patient reduces the amount of tobacco smoked in a day? Counterfactual queries allow us to directly reason about alternate worlds : e.g., what state would the patient’s lung inflammations be in had he not smoked in the previous year? Thus, causal reasoning allows us to inspect our domain model much more comprehensively than possible by observational conditioning alone.

A fundamental challenge underlying stochastic models, however, is the intractability of inference [article]. This has led to the paradigm of tractable probabilistic models, where conditional or marginal distributions can be computed in time linear in the size of the model. Although initially limited to low tree-width models [bach2002thin], recent tractable models such as sum product networks (SPNs) [6130310, SPN_structure_learning] and probabilistic sentential decision diagrams (PSDDs) [kisa2014probabilistic] are derived from arithmetic circuits (ACs) and knowledge compilation approaches, more generally [darwiche2002logical, choi2017relaxing], which exploit efficient function representations and also capture high tree-width models. These models can also be learnt from data [Peharz2014LearningSS, kisa2014probabilistic] which leverage the efficiency of inference. Consider that in classical structure learning approaches for graphical models, once learned, inference would have to be approximated, owing to its intractability. In that regard, such models offer a robust and tractable framework for learning and inferring from data.

Naturally, owing to such attractive properties, the theoretical underpinnings of such models have received considerable attention. One the one hand, when viewed from a knowledge representation angle, they are related to tractable representations of Boolean functions, including BDDs [Akers:1978:BDD:1310167.1310815] and d-DNNFs [ddnnf, kisa2014probabilistic]. On the other hand, from a probabilistic modeling perspective, they can be derived as instances of ACs, providing a tractable representation of probabilistic reasoning, owing to the fact that ACs can compactly represent and compute the network polynomial of a Bayesian network (BN) [Darwiche2000ADA]. In the presence of latent variables they can also be seen as a deep architecture with probabilistic semantics [6130310], leading to numerous extensions, e.g., for mixed discrete-continuous domains [inproceedings18], and applications, including preference ranking [Choi2015TractableLF], classification [Liang2018LearningLC]

[6130310, article17]. Owing to its clear probabilistic semantics, in [NIPS2011_4350], the expressive power of such deep models is studied, and in [Zhao:2015:RSN:3045118.3045132], the relationship between SPNs and BNs has been further analyzed.

In this paper we push the envelope on this inquiry towards the following objective: can tractable models offer not only a computationally attractive but also compelling alternative to standard graphical models, especially when reasoning about causality? In fact, on studying the relationship between SPNs and BNs [Zhao:2015:RSN:3045118.3045132], the authors conclude with:

The structure of the resulting BNs can be used to study probabilistic dependencies and causal relationships between the variables of the original SPNs.

Unfortunately, we answer in the negative for SPNs and PSDDs.

For SPNs, using the transformation from [Zhao:2015:RSN:3045118.3045132] we are going to show that the resulting graph is not sufficient for studying causal relationships between the variables. Roughly, the problem is that this class of models allows for a lot of expressive freedom, and, because of that, all the correlations between the variables are attributed to external latent factors. Next, for PSDDs, we first provide an algorithm for constructing a causal graph that also needs to introduce some augmented variables, which conforms to PSDD on all probabilistic queries. On the one hand, intervening on the original variables in the resulting causal graph is also uninteresting, and reduces to computing marginal distributions, like in SPNs. However we can perform non-trivial counterfactuals on the augmented variables. This is possible because, in contrast to SPNs, PSDDs impose more restrictions on the structure of the resulting model, specifically in terms of its equivalence to a propositional formula, which we then can use to recover a structural equation model (SEM) [Pearl2009CausalII]. We note that this structure is of a somewhat “deterministic” nature, and so, in a sense, the result is also negative. Nonetheless, we can provide a “causal semantics” for PSDDs in the process.

We reiterate that our focus is purely on the distributions represented or learnt using tractable probabilistic models, and specifically SPNs and PSDDs. These models do not come with any guarantees that the dependencies learnt actually capture the underlying causal process of the domain (in contrast to approaches such as [DBLP:conf/icml/GhassamiSKB18]). Throughout the rest of our analysis, we suppose that the causal graph is not known beforehand and our aim is to examine what kind of information can be recovered using the trained SPNs or PSDDs. Our results demonstrate that regardless of whether these models do capture causal information, performing causal reasoning on them is not immediately attractive.

We are organized as follows. We first consider the SPN case, and then move on to the PSDD case. We will finally conclude with some discussions.

## 2 Background

We will briefly review PSDDs, SPNs and SEMs, and we refer the reader to [kisa2014probabilistic, 6130310, 10.2307/3541871] for more extensive discussions.

Our notation will be as follows: An uppercase letter

denotes a Boolean random variable. In the context of a probabilistic statement, we will use a lowercase letter

to denote an assignment to ; for example, where

denotes the probability of the event where

is assigned the value In the context of a logical formula, and respectively assign true () and false () to variable . Sets of variables and joint assignments are denoted in bold.

PSDDs.    The idea behind PSDDs is to use Sentential Decision Diagrams (SDDs) [inproceedings]

to represent a propositional logic theory, and then recursively define a probability distribution over it. Terminal nodes can be either a literal,

, or , while decision (intermediate) nodes are of the form , where the are called primes and the subs. The primes form a partition, meaning they are mutually exclusive and their disjunction is valid. Each prime in a decision node is assigned a non-negative parameter such that and if and only if . Additionally each terminal node corresponding to has a parameter such that . Using this notation, a PSDD node defines a distribution over the variables of the vtree node that it is normalized for, as follows. (The notion of a vtree, defined in [vtree], is needed to fully define an SDD; they can be obtained directly from data or by compiling domain constaints [Liang2017LearningTS].)

• If is terminal node, and has variable , then

    ⊤    n Prn(X) Prn(¬X) θ 1 - θ 0 0 1 0 0 1
• If is a decision node and has left children and right children , then , for where , and where denote the distribution of the PSDD nodes corresponding to , respectively.

SPNs.   SPNs are rooted directed graphical models that provide for an efficient way of representing the network polynomial [Darwiche2000ADA] of a BN [6130310], as a multilinear function . Here is the (possibly unormalized) probability distribution of the BN, x

is a vector containing all the variables of the model, i.e.,

, the summation is over all possible states, and is the indicator function. An SPN over Boolean variables has leaves corresponding to indicators and and whose internal nodes are sums and products.

Any edge exiting a sum node has a non-negative weight assigned to it. The value of a product node is the product of its children, while the value of a sum node is a weighted sum of its children, , where is the set containing the children of node , and is the sub-SPN rooted at node . SPNs can represent a wide class of models, including weighted mixtures of univariate distributions; see [6130310] for discussions.

Causality.   We base our causal analysis on SEMs [Pearl2009CausalII], which provide an effective way to encode dependencies between variables, as well as allow for queries regarding interventions and counterfactuals. In this setting we can represent a set of probabilistic dependencies through a BN, as usual, but on top of that we can also encode the specific mechanism that determines the value of each variable. In this sense, it is more general than just having a BN, since we not only possess a distribution over the variables, but also a (either stochastic or deterministic) set of equations. In what follows we denote by the set of variables that are internal to the model, and by the exogenous or external variables (that act as random, latent, factors). We use to denote the set containing the plausible values of each variable. Every endogenous (internal) variable is assigned an equation determining its value as a function of both its endogenous and exogenous parents in the BN, called structural equation. Finally, in what follows, we make the standard assumption that these BNs do not contain any directed cycle, so they are equivalently referred to as directed acyclic graphs (DAGs).

###### Definition 1.

A causal model is a pair where is a signature and is a set of structural equations

One of the advantages of employing graphical models is that by just utilizing the topology of the graph we can answer probabilistic queries, such as whether two sets of variables are dependent.

###### Definition 2.

A directed path is d-separated (blocked) by a set of nodes, Z, iff one of these hold:

1. It contains a triple or , such that .

2. It contains a triple , such that neither nor any of its descendants are in Z.

Two sets X,Y are d-separated by Z if and only if every path between any two nodes , is blocked by Z. It is a well established result that if two nodes are d-separated by a set Z, then they are conditionally independent (where Z is the conditioning set). As we mentioned earlier, SEMs allow for studying interventional distributions, meaning the distribution of a set of variables, after we force a second set of variables to attain certain values. We denote the distribution of after the intervention , by or . In order to study such probabilistic statements we transform the original DAG corresponding to our model, by deleting all the edges pointing towards , set to , and then proceed with the analysis. What follows is an essential graphical tool for deciding under what conditions we can reduce interventional queries to conditional ones. Here, denotes the graph obtained after deleting all the edges pointing to , the one resulting after deleting all the edges emerging from , and for deleting both kinds of edges from .

###### Definition 3.

(Rules of do-Calculus) Let be a DAG corresponding to a SEM and the probability measure induced by it. If X, Y, Z, W are disjoint sets, then the following hold:

• Rule 1: = if .

• Rule 2: if .

• Rule 3: if , where Z(W) is the set of Z-nodes that are not ancestors of any W-node in .

## 3 Main Results

### 3.1 The SPN Case

As discussed, SPNs are an elegant formalism for capturing weighted mixtures of distribution, and so the expressive power of SPNs and standard BNs has been of considerable interest. The question of how to transform SPNs to BNs and the recoverability of the SPN from the transformation was studied in [Zhao:2015:RSN:3045118.3045132]

. For space reasons, we cannot provide too many details on this transformation, but the key idea is to compactly represent the local conditional probability distribution in the corresponding BN by exploiting context specific independence. Intuitively, we create a node for every observable variable, a latent variable for every sub-SPN, and then draw an arrow from each latent variable to the observable variables corresponding to the scope of the sub-SPN. This procedure yields a bipartite graph with arrows stemming only from latent to observable variables.

To our knowledge this is the only way proposed so far to turn an SPN to a BN, and many subsequent papers on SPNs’ theoretical properties [Peharz2016OnTL] are similar in thrust. And, as stated previously, the authors of [Zhao:2015:RSN:3045118.3045132] were hopeful about the causal expressiveness of their approach.

So we will base our analysis on that approach.

We first make the following technical observation about graphs having this topology.

###### Theorem 4.

Let be the DAG associated with a causal model . For any set such that no node in

has an edge coming out of it, the interventional distribution of the remaining variables equals their joint distribution, i.e.

, where and . More specifically, we have that the rest of the remaining variables are unaffected by the intervention, i.e. .

Proof:    Using the 3rd rule of Pearl’s do-calculus, it suffices to show that . By assumption, no edges emanate from nodes in , which implies that each of them will be isolated in , so the desired independence holds, meaning that . In addition, we have that . ∎

Unfortunately, since the BN stemming from the algorithm in [Zhao:2015:RSN:3045118.3045132] has no edge coming out of an observable variable, we get the following:

###### Theorem 5.

The BN, , that results after transforming an SPN using the procedure described in [Zhao:2015:RSN:3045118.3045132] satisfies the property , for any .

So this result answers that the method proposed in [Zhao:2015:RSN:3045118.3045132] for producing a BN is not useful for causal inference tasks, since we cannot really study interventional distributions utilizing it.

What are the reasons for this limitation? As has been noted in previous work [Peharz2016OnTL, 6130310], sum nodes in SPNs can be interpreted as marginalized, latent, variables, whose values correspond to the children of the sum node. Thus, when an SPN is turned into a BN all of the variables within the scope of a sum node are treated as children of a latent variable. This leads to every probabilistic dependency being attributed to an unobserved confounder, and there is no edge between the SPN variables. Thus, it is reasonable that any intervention on a subset of the observable variables would not affect the rest, because the mechanism encoded in the graph tells that no variable has any causal effect on the others.

A special class of SPNs, referred to as selective SPNs were introduced recently [Peharz2014LearningSS]. They impose determinism in that only one of the children of a sum node can be true for any given variable assignment. Interestingly, even this stipulation does not remedy the problem, since the discussion in [Peharz2016OnTL] makes clear the resulting BN would still have no edges between the SPN variables. Consequently, we get:

###### Theorem 6.

The BN, , that results after transforming a selective SPN using the procedure described in [Peharz2016OnTL] satisfies the property , for any .

We suspect that to get rid of this limitation SPNs should be augmented in a way that captures the functional dependency between the variables in the scope of a sum node. Another strategy perhaps is to enable a more expressive way to represent and reason about probabilistic dependencies between the variables, although it is not immediate how this could be made possible.

### 3.2 The PSDD Case

Interestingly, the situation turns out to be slightly better for PSDDs. The intuitive reason for that is because of the dependency that can be established between a node in the PSDD and its children. More precisely, consider that each node in a PSDD [kisa2014probabilistic] has a support – the set of assignments it assigns a positive probability – which is related to the support of its children. This set is called the base of and is denoted by . It can also be defined as a logical formula: if is a decision node , then . Since

’s form a partition, their corresponding bases are disjoint, as well, so a decision node can be seen as deciding between different possible worlds, based on which prime base was satisfied. Since the prime bases of a node form a partition we can apply the law of total probability and Proposition 1 from

[liang2017learning] to get that . Combining this expression with the semantics provided in Proposition 1 in [liang2017learning] and Theorem 2 in [kisa2014probabilistic], as well as the fact that under any given assignment the only non-zero term of the form is the one for which , we see that the probability of a node is not a mixture over its children (as in SPNs). Indeed, the distribution of decision node is understood very differently. In fact, we can also see that PSDD nodes do not condition on a latent variable, but on their prime bases instead, which do not depend on unobserved quantities.

Our work builds on this observation and the fact that by construction PSDDs are probabilistic extensions of SDDs, which, in turn, denote a propositional formula. Basically, we use that formula to create an augmented set of variables, not just the original ones the PSDD used for training, in such a way so the PSDD distribution and the BN one agree on the original variables. It is worth noting that the resulting BN is also equipped with a set of equations that determine the value of the children as functions of their parents, so we end up having a SEM. Below we present the procedure to construct this SEM, where the input propositional formula is the one represented by the trained PSDD.

About the hidden variable: The latent variable, H, is the only component of the graph that is purely stochastic, and we motivate its need here.

Note that any instantiation of it is enough to determine all the other variables in the model. Conversely, each probabilistic query about any of the rest of the variables can be reduced into another query relying solely on H (since there is no other source of randomness in the model). Its dimension is equal to the number of the original variables in the PSDD and its distribution is equal to the PSDD distribution of the original variables. Denoting by the probability measure over the DAG’s variables and by the PSDD distribution over the original variables, we set these two measures to satisfy the following condition . Suppose the PSDD is comprised of variables, , then . The structural equations connecting them are:

Looking at these equations we see that: This remark assures us about the consistency between the PSDD and the SEM distribution of the original variables. We would also like to note that although H

is introduced as a vector, it could be rewritten as a simple categorical variable with an exponential number of states, each one corresponding to a different configuration of the original variables. We present this result in a more formal way, using the vectorized version of

H.

###### Theorem 7.

Let P be a PSDD over variables and let be the DAG resulting from Algorithm 1. The distribution of , induced by is equal to their PSDD distribution, meaning that .

Interestingly, the SEM obtained from a PSDD in this manner has the same limitations as identified for SPNs when intervening on the original variables:

###### Theorem 8.

The SEM, , that results after applying Algorithm 1 to a PSDD compiled formula satisfies the property , where is any subset of the original variables, and denotes the rest of the original variables.

Proof:    We are going to use the 3rd rule of Pearl’s do-calculus, so it is enough to show that for any path bewtween the original variables is blocked. Let be the variable we intervene on and let be any of the rest of the original variables. We have to show that . By construction, since the BN is created using Algorithm 1, there are no edges between the original variables. Furthermore, no original variable is a descendant of another one, since the only parent of an original variable is the latent variable. This means that, in , all the paths connecting and contain v-structures, so they are blocked and the 3rd rule is satisfied. Since was chosen at random, we can generalize this result for arbitrary subsets of the original variables, concluding the proof. ∎

However, when intervening on the augmented variables, we are able to enable non-trivial (but also non-standard) causal reasoning, a point we return to shortly.

Moreover H serves another purpose as we will shortly discuss. Using the BN from Algorithm 1 without including the hidden variable, it is not difficult to see that the original PSDD variables are independent, since all the paths connecting them are blocked by v-structures, meaning that (2) in Definition 2 is satisfied, with . On the other hand, it is not necessarily the case that the PSDD distribution encodes such properties about the variables, so there is a chance that the BN distribution enforces independences that do not agree with the PSDD one, rendering the DAG unfaithful [Pearl:2009:CMR:1642718]. By including the hidden variable we eliminate this behaviour, but we introduce a new property, the other extreme, that all of the variables are dependent. This might also not be the actual case either, but we think that it is safer to assume dependency among the variables, rather than independency, which is a fairly strong assumption. A better way to address this behaviour would be to utilize the PSDD distribution and some independency tests in order to decide the subsets of dependent variables, and then use as many hidden variables as the dependent subsets, so we explicitly encode only the dependencies that are implied by the PSDD distribution. (Incidentally, such tests are used when learning SPNs [SPN_structure_learning].) In this work we are mostly interested in introducing the connection between BNs and PSDDs, so we leave this for future research.

We should also note that for any node in the graph resulting from Algorithm 1, denoting the set of its parents as , we have:

 Pr(X=1|PAX)={1if assignments in PAX render X=10otherwise

Building on top of this remark, the distribution of given the specification of any partial subset of its parents is as follows:

 Pr(X=1|V)=⎧⎨⎩1if assignments in V render % X=10if assignments in V render X=0Pr(X|V=1)otherwise

where denotes the formula that results from after substituting the assignments from in it. Finally, the marginal distribution of , for example, can be computed by using the PSDD.

Example:  We will give an example of how to construct a SEM model using a PSDD. We start by studying the PSDD in Figure 1. This is the PSDD corresponding to a problem considered in [kisa2014probabilistic]

. The setting is that there is a department having four courses: Logic (L), Knowledge Representation (K), Probability (P), and Artificial Intelligence (A). Students must enroll to them, but at the same time they have to obey the following constraints:

, , (where implication means if they enroll in LHS, then they must enroll in RHS). The objective is to learn the joint distribution of using a dataset of student enrollments and the above constraints. The authors utilize PSDDs to perform this task and the resulting model can be seen in Figure 1.

Starting from the bottom of Figure 1 and moving towards the root, we see that it corresponds to the following propositional formula:

 (((¬L∧K)∨(L∧⊥)))∧((P∧A)∨(¬P∧⊥))) ∨(((L∧K)∨(¬L∧⊤))∧((¬P∧¬A)∨(P∧A))) ∨(((¬L∧¬K)∨(L∧⊥))∧((P∧A)∨(¬P∧⊥)))

This is the raw form of the formula, so some terms are tautologically false. Rewriting the above expression after eliminating inconsistencies yields the following:

 ((¬L∧K)∧(P∧A))∨((L∧K)∧((¬P∧¬A)∨(P∧A))) ∨((¬L∧¬K)∧(P∧A)) (⋆)

Algorithm 1 takes (3.2) as input and constructs a SEM model, as follows: The first thing is to create a node corresponding to the whole expression. Then, since (3.2) is composed of three disjunctions, we make three new variables, one for each of them, and draw arrows from them pointing to the first variable. We continue this procedure recursively; so for example, the term is made from two formulas that are connected with a conjunction, so we create two new nodes, one for and one for , and draw arrows from them towards the node representing their conjunction. Now we have reached the point where the formulas under consideration are just conjunctions of literals, so if we look at , we make a node for the variable (although it is that is part of the formula) and one for . We repeat the above procedure until we go through all the formulas and in the end we create an additional latent variable that is a parent of all the original PSDD variables, here . The resulting BN can be seen in Figure 2 (Left). It is also worth noting that since we create at most two new nodes for any disjunction or conjunction, the size of the BN is linear in the their number. Furthermore, looking at the procedure described above, we see that Algorithm 1 is not directly applicable to SPNs, since sums and products are between distributions, while in PSDDs, conjunctions are disjunctions are between variables, which is exactly what Algorithm 1 exploits in order to construct the resulting SEM.

We have kept the names of the original variables the same and have named the rest as . In addition, the latent variable is a vector of 4 random variables, since the PSDD was over four variables. By construction, it is now apparent that the structural equations of this BN are:

 A=H1,L=H2,K=H3,P=H4,X1=P∧A,X2=¬P∧¬A, X3=¬L∧K,X4=L∧K,X5=¬L∧¬K,X6=X1∨X2, X7=X1∧X3,X8=X4∧X6,X9=X1∧X5,X10=X7∨X8∨X9

We can immediately see that an intervention on one of the augmented variables will result in a non-trivial interventional distribution. We expand on this point in the next section.

### 3.3 Interventions

In this section we are going to state a result connecting the interventional to the observational distribution. The idea behind is that, since there is no noise in the graph, every augmented variable will be a deterministic function of its parents. In turn, this reduces the problem of estimating interventional queries to a simpler one, which is, essentially, a problem of counting all the possible assignments of the parents of the intervened variable, that then leads to the intervened variable getting a corresponding value. This transforms the question from a statement of the form “how probable it is to observe

, given an intervention ” to a statement of the form “how probable it is to observe and , simultaneously”. Formally, we have the following:

###### Theorem 9.

Let be the DAG resulting from Algorithm 1, and let be two augmented variables. Then for any intervention , we have that .

Proof:    We are going to base our proof on the back-door criterion [Pearl:2009:CMR:1642718], adjusting for the parents of , . By doing that, we can rewrite our expression as follows:

 Pr(Y=y|do(X=x))= ∑x1,⋯,xNPr(Y=y|X=x,X1=x1,⋯,XN=xN)⋅Pr(X1=x1,⋯,XN=xN) =∑x1,⋯,xN:X=xPr(Y=y|X1=x1,⋯,XN=xN)⋅Pr(X1=x1,⋯,XN=xN) =∑x1,⋯,xN:X=xPr(Y=y,X1=x1,⋯,XN=xN)=Pr(Y=y,X=x)

The first equality is due to the back-door criterion, the second one is because since is a deterministic function of its parents, all the assignments of that do not result in , make the term equal to zero. On the other hand, all the assignments that result in lead to , since now the condition is redundant. ∎

The above formula clearly gives rise to a non-trivial distribution, although its usefulness and the overall utility of performing causal analysis based on it, should probably be assessed depending on the application. In the next section we demonstrate how counterfactual queries can be estimated using the output of Algorithm 1.

### 3.4 Counterfactuals

In this section we will examine if it is possible to use the BN from Algorithm 1 in order to compute counterfactual quantities. We will mostly investigate counterfactuals conditioned on some evidence, which is equivalent to computing probabilistic statements of the form . These statements can be handled using the following major result [Pearl:2009:CMR:1642718]:

###### Theorem 10.

Let be a causal model and a probability measure over the variables in . The counterfactual probability , meaning “Had been then would have been , given evidence , can be computed as follows:

• Abduction: Update the distribution by incorporating the evidence, to obtain .

• Action: Construct the graph that results from the intervention .

• Prediction: Use the probability measure and the graph, from the previous steps, to compute the probability of .

Since our model is deterministic, we do not have to do a lot of probabilistic calculations, but mostly we are going to manipulate logical expressions. We will go on with our working example to demonstrate how we could study counterfactuals and their properties. The question of interest is the following: Supposing we have observed that , what is the probability it would have been equal to 1, had P been equal to 1? At this point we would like to emphasize that an intervention on one of the augmented variables corresponds to multiple interventions on the original ones. For example, suppose that later on we decide to intervene on and force it to become equal to zero. In turn, this would mean that we force to become zero. We notice that this outcome can be achieved by several assignments on these two variables, namely , , and . This means that a single intervention on induced three interventions on and simultaneously. We would also like to emphasize that although belongs to the augmented set of variables, it still has an interpretation relating it to the original variables, as it is the case with any of the augmented variables. In this case, just represents the event of taking both courses, and .

Formally, we ask for the probability of the following expression . The first step is to update the distribution of our exogenous variables (in our case, this is H) conditioning on the evidence . As we have already discussed, , so or is equal to zero. This means the updated distribution should assign zero probability to the case of and being true at the same time. Since this is the only fact we can recover from the conditioning observation, the posterior and the prior distributions should agree on all other cases. Thus, we end up with being obtained as:

 {Pr(H1,H2,H3,H4,X1=0)Pr(X1=0)=Pr(H1,H2,H3,H4)Pr(X1=0)if H1=0 or H4=00otherwise

The upper branch equality follows from Bayes’ Theorem and the fact that

. Next, we construct the graph corresponding to the world where we intervene on and force it to be true, which is shown in Figure 2 (Right). Now we update the structural equations, by substituting to all the equations. Since we are not going to make use of all of them in this particular example, we will write down only the first few.

 A=H1,L=H2,K=H3,P=1,X1=A,X2=0

Now we are ready to perform all the desired calculations,in our case the probability of in the causal graph of Figure 2 (Right). We proceed as follows:

 \allowbreakPr(X1=1) =Pr(A=1)=Pr(H1=1)=Pr(H1=1,H4=0) =∑H2∑H3Pr(H1=1,H2,H3,H4=0)Pr(X1=0)

We immediately see that all of the needed probabilistic quantities can be calculated right away using the PSDD and the correspondence between and .

We could also ask more complex counterfactual queries as well. This time we will include the actual numeric values, so that we can compare the various distribution of the variable of interest. The data is taken from [kisa2014probabilistic] is shown here, but the full calculations can be found in the supplementary material. The question this time is supposing we have witnessed that , meaning there is a student not satisfying the property “he/she has taken both and , while not taking neither or : what is the probability of him/her satisfying this property, had been equal to . We repeat the same steps as before, to obtain the probability , which turns out to be equal to . We compare the resulting counterfactual distribution to the conditional and the plain marginal . The results can be seen in Figure 3. It is evident that the counterfactual distribution is vastly different from the others, expanding the semantics of PSDDs in a non-trivial way.

## 4 Discussion and Conclusions

Tractable models are attractive in offering polynomial time inference capabilities, and hence are gaining in popularity. The theoretical properties of such models have received considerable attention recently. The question of whether these models can also be useful for causal reasoning was studied in this work, and we showed that the results are mostly of a negative nature. For SPNs, we showed that we cannot really study interventional distributions. For PSDDs, we motivated a way to construct a SEM from a trained PSDD. We showed that when intervening on the original variables, the situation is once again uninteresting, but when non-trivial properties emerge when augmented variables are considered. While this does provide a causal semantics for PSDDs, we observe the causal graph is very unusual in lacking noise. So, the overall usefulness of these class of tractable models is questionable for causal reasoning. We would like to reiterate that the thrust of this contribution assumes that the only information we have is the probabilistic circuit. Clearly if we had the original BN in hand, we would perform causal reasoning directly on that BN. However, starting from the circuit, we show that going to the BN loses information about the underline mechanisms that the variables interact with each other, as it is evident when using SPNs. For PSDDs, although the outcome of the analysis is about the same, the problem is of a different nature, and it is mostly due to absence of latent factors. In many cases, in causal modeling, we tend to include some latent variables, in order to account for unobserved background factors, but it is unclear why one should do this for PSDDs. From a causal viewpoint, SPNs and PSDDs also seem to be on the opposite sides of the spectrum, one former attributing everything to latent factors, while the latter attributing nothing to them. To recap, SPN sum nodes define weighted mixtures over their children, while PSDD decision nodes are propositional expressions over them. This difference lies in the core of the nature of our results.

We think there are many interesting directions for the future. For example, given our last observation about latent factors, are there tractable models that enable causal graphs with lie somewhere on the middle ground wrt causal graphs? Current structure learning algorithms for tractable models also do not attempt to capture the underlying causal process. In that regard, the quality of the causal graph obtained in Algorithm 1 is only going to be as good as the quality of the PSDD. We think there are three ways to improve this situation, both under the assumption that we are in possession of prior knowledge in terms of certain dependencies and independencies. Firstly, a brute force (and very likely inefficient) structure learner would first build a PSDD, use Algorithm 1 to recover all the dependencies and interactions between variables and test whether our prior knowledge is in agreement with what the PSDD has learned. An insufficient model would then be discarded by means of a suitable evaluation metric. Secondly, it is shown in

[liang2017learning] that the training of PSDDs can be subjected to logical prior knowledge. It may be possible to extend that approach, in that we learn PSDDs that are also subjected to independency constraints expressed as probabilistic prior knowledge. Thirdly, and perhaps most significantly, investigating whether ideas from the existing literature on learning causal relations (e.g.,[DBLP:conf/icml/GhassamiSKB18]) can be imported to tractable learners is a worthwhile question. Of course, such an endeavor would be most useful if we discover ways to augment SPNs and PSDDs in some (clever) way that goes beyond trivial and/or deterministic reasoning. That would be perhaps the main open challenge resulting from our work.