# Learning and Testing Causal Models with Interventions

We consider testing and learning problems on causal Bayesian networks as defined by Pearl (Pearl, 2009). Given a causal Bayesian network M on a graph with n discrete variables and bounded in-degree and bounded `confounded components', we show that O( n) interventions on an unknown causal Bayesian network X on the same graph, and Õ(n/ϵ^2) samples per intervention, suffice to efficiently distinguish whether X=M or whether there exists some intervention under which X and M are farther than ϵ in total variation distance. We also obtain sample/time/intervention efficient algorithms for: (i) testing the identity of two unknown causal Bayesian networks on the same graph; and (ii) learning a causal Bayesian network on a given graph. Although our algorithms are non-adaptive, we show that adaptivity does not help in general: Ω( n) interventions are necessary for testing the identity of two unknown causal Bayesian networks on the same graph, even adaptively. Our algorithms are enabled by a new subadditivity inequality for the squared Hellinger distance between two causal Bayesian networks.

## Authors

• 27 publications
• 17 publications
• 40 publications
• 4 publications
• ### Efficient Intervention Design for Causal Discovery with Latents

We consider recovering a causal graph in presence of latent variables, w...
05/24/2020 ∙ by Raghavendra Addanki, et al. ∙ 0

• ### Efficient Distance Approximation for Structured High-Dimensional Distributions via Learning

We design efficient distance approximation algorithms for several classe...
02/13/2020 ∙ by Arnab Bhattacharyya, et al. ∙ 0

• ### The Cognitive Processing of Causal Knowledge

There is a brief description of the probabilistic causal graph model for...
02/06/2013 ∙ by Scott B. Morris, et al. ∙ 0

• ### Provable Guarantees on the Robustness of Decision Rules to Causal Interventions

Robustness of decision rules to shifts in the data-generating process is...
05/19/2021 ∙ by Benjie Wang, et al. ∙ 0

• ### Amortized learning of neural causal representations

Causal models can compactly and efficiently encode the data-generating p...
08/21/2020 ∙ by Nan Rosemary Ke, et al. ∙ 5

• ### Causal datasheet: An approximate guide to practically assess Bayesian networks in the real world

In solving real-world problems like changing healthcare-seeking behavior...
03/12/2020 ∙ by Bradley Butcher, et al. ∙ 0

• ### Efficient inference of interventional distributions

We consider the problem of efficiently inferring interventional distribu...
07/25/2021 ∙ by Arnab Bhattacharyya, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A central task in statistical inference is learning properties of a high-dimensional distribution over some variables of interest given observational data. However, probability distributions only capture the association between variables of interest and may not suffice to predict what the consequences would be of setting some of the variables to particular values. A standard example illustrating the point is this: From observational data, we may learn that atmospheric air pressure and the readout of a barometer are correlated. But can we predict whether the atmospheric pressure would stay the same or go up if the barometer readout was forcefully increased by moving its needle?

Such issues are at the heart of causal inference, where the goal is to learn a causal model over some variables of interest, which can predict the result of external interventions on the variables. For example, a causal model on two variables of interest and need not only determine conditional probabilities of the form , but also interventional probabilities where, following Pearl’s notation [Pea09], means that has been forced to take the value by an external action. In our previous example, but , reflecting that the atmospheric pressure causes the barometer readout, not the other way around.

Causality has been the focus of extensive study, with a wide range of analytical frameworks proposed to capture causal relationships and perform causal inference. A prevalent class of causal models are graphical causal models, going back to Wright [Wri21] who introduced such models for path analysis, and Haavelmo [Haa43] who used them to define structural equation models. Today, graphical causal models are widely used to represent causal relationships in a variety of ways [SDLC93, GC99, Pea09, SGS00, Nea04, KF09].

In our work, we focus on the central model of causal Bayesian networks (CBNs) [Pea09, SGS00, Nea04]

. Recall that a (standard) Bayesian network is a distribution over several random variables that is associated with a directed acyclic graph. The vertices of the graph are the random variables over which the distribution is defined, and the graph describes conditional independence properties of the distribution. In particular, every variable is independent of its non-descendants, conditioned on the values of its parents in the graph. A CBN is also associated with a directed acyclic graph (DAG) whose vertices are the random variables on which the distribution is defined. However, a CBN is not a single distribution over these variables but the collection of all possible interventional distributions, defined by setting any subset of the variables to any set of values. In particular, every vertex is both a variable

and a mechanism to generate the value of given the values of the parent vertices, and the interventional distributions are defined in terms of these mechanisms.

We allow CBNs to contain both observable and unobservable (hidden) random variables. Importantly, we allow unobservable confounding variables. These are variables that are not observable, yet they are ancestors of at least two observable variables. These are especially tricky in statistical inference, as they may lead to spurious associations.

### 1.1 Our Contributions

Consider the following situations:

1. An engineer designs a large circuit using a circuit simulation program and then builds it in hardware. The simulator predicts relationships between the voltages and currents at different nodes of the circuit. Now, the engineer would like to verify whether the simulator’s predictions hold for the real circuit by doing a limited number of experiments (e.g., holding some voltages at set levels, cutting some wires, etc.). If not, then she would want to learn a model for the system that has sufficiently good accuracy.

2. A biologist is studying the role of a set of genes in migraine. He would like to know whether the mechanisms relating the products of these genes are approximately the same for patients with and without migraine. He has access to tools (e.g., CRISPR-based gene editing technologies [DPL16]) that generate data for gene activation and knockout experiments.

Motivated by such scenarios, we study the problems of hypothesis testing and learning CBNs when both observational and interventional data are available. The main highlight of our work is that we prove bounds on the number of samples, interventions, and time steps required by our algorithms.

To define our problems precisely, we need to specify what we consider to be a good approximation of a causal model. Given , we say that two causal models and on a set of variables (observable and unobservable resp.) are -close (denoted ) if for every subset of and assignment to , performing the same intervention to both and leads to the two interventional distributions being -close to each other in total variation distance. Otherwise, the two models are said to be -far and .

Thus, two models and are close according to the above definition if there is no intervention which can make the resulting distributions differ significantly. This definition is motivated by the philosophy articulated by Pearl (pp. 414, [Pea09]) that “causation is a summary of behavior under intervention”. Intuitively, if there is some intervention that makes and behave differently, then and do not describe the same causal process. Without having any prior information about the set of relevant interventions, we adopt a worst-case view and simply require that causal models and behave similarly for every intervention to be declared close to each other.111To quote Pearl again, “It is the nature of any causal explanation that its utility be proven not over standard situations but rather over novel settings that require innovative manipulations of the standards.” (pp. 219, [Pea09]).

The goodness-of-fit testing problem can now be described as follows. Suppose that a collection (observable and unobservable resp.) of random variables are causally related to each other. Let be a hypothesized causal model for that we are given explicitly. Suppose that the true model to describe the causal relationships is an unknown . Then, the goodness-of-fit testing problem is to distinguish between: (i) , versus (ii) , by sampling from and experimenting on , i.e. forcing some variables in to certain values and sampling from the thus intervened upon distribution.

We study goodness-of-fit testing assuming and are causal Bayesian networks over a known DAG . Given a DAG , CBN and , we denote the corresponding goodness-of-fit testing problem . For example, the engineer above, who wants to determine whether the circuit behaves as the simulation software predicts, is interested in the problem where is the simulator’s prediction, is determined by the circuit layout, and is a user-specified accuracy parameter. Here is our theorem for goodness-of-fit testing.

###### Theorem 1.1 (Goodness-of-fit Testing – Informal).

Let be a DAG on vertices with bounded in-degree and bounded “confounded components.” Let be a given CBN over . Then, there exists an algorithm solving that makes interventions, takes samples per intervention and runs in time . Namely, the algorithm gets access to a CBN over , accepts with probability if and rejects with probability if .

By “confounded component” in the above statement, we mean a c-component in , as defined in Definition 2.10. Roughly, a c-component is a maximal set of observable vertices that are pairwise connected by paths of the form where and correspond to observable and unobservable variables respectively. The decomposition of CBNs into c-components has been important in earlier work [TP02] and continues to be an important structural property here.

We can use our techniques to extend Theorem 1.1 in several ways:

In the two-sample testing problem for causal models, the tester gets access to two unknown causal models and on the same set of variables (observable and unobservable resp.). For a given , the goal is to distinguish between (i) and (ii) by sampling from and intervening on in both and .

We solve the two-sample testing problem when the inputs are two CBNs over the same DAG in variables; for a given and DAG , call the problem . Specifically, we show an algorithm to solve that makes interventions on the input models and , uses samples per intervention and runs in time , when has bounded in-degree and c-component size.222Of course, it is allowed for the two networks to be different subgraphs of . So, could be defined by the graph and by . Our result holds when is a DAG with bounded in-degree and c-component size.

For the problem, the requirement that be fully known is rather strict. Instead, suppose the common graph is unknown and only bounds on its in-degree and maximum c-component size are given. For example, the biologist above who wants to test whether certain causal mechanisms are identical for patients with and without migraine can reasonably assume that the underlying causal graph is the same (even though he doesn’t know what it is exactly) and that only the strengths of the relationships may differ between subjects with and without migraine. For this problem, we obtain an efficient algorithm with nearly the same number of samples and interventions as above.

The problem of learning a causal model can be posed as follows: the learning algorithm gets access to an unknown causal model over a set of variables (observable and unobservable resp.), and its objective is to output a causal model such that .

We consider the problem of learning a CBN over a known DAG on the observable and unobservable variables. For example, this is the problem facing the engineer above who wants to learn a good model for his circuit by conducting some experiments; the DAG in this case is known from the circuit layout. Given a DAG with bounded in-degree and c-component size and a parameter , we design an algorithm that on getting access to a CBN defined over , makes interventions, uses samples per intervention, runs in time , and returns an oracle that can efficiently compute for any and with error at most in TV distance.

The sample complexity of our testing algorithms matches the state-of-the-art for testing identity of (standard) Bayes nets [DP17, CDKS17]. Designing a goodness-of-fit tester using samples is a very interesting challenge and seems to require fundamentally new techniques.

We also show that the number of interventions for and is nearly optimal, even in its dependence on the in-degree and c-component size, and even when the algorithms are allowed to be adaptive. By ‘adaptive’ we mean the algorithms are allowed to choose the future interventions based on the samples observed from the past interventions. Specifically,

###### Theorem 1.2.

There exists a causal graph on vertices, with maximum in-degree at most and largest c-component size at most , such that interventions are necessary for any algorithm (even adaptive) that solves or .

### 1.2 Related Work

#### 1.2.1 Causality

As mentioned before, there is a huge and old literature on causality, for both testing causal relationships and inferring causal graphs that is impossible to detail here. Below, we point out some representative directions of research that are relevant to our work. This discussion is far from exhaustive, and the reader is encouraged to pursue the references cited in the mentioned works.

Most work on statistical tests for causal models has been in the parametric setting. Structural equation models have traditionally been tested for goodness-of-fit by comparing observed and predicted covariance matrices [BL92]. Another class of tests that has been proposed assumes that the causal factors and the noise factors are conditionally independent. In the additive noise model [HJM09, PJS11, ZPJS12, SSS17], each variable is the sum of a (non-linear) function of its parent variables and independent noise, often assumed to be Gaussian. This point of view has been refined into an information-geometric criterion in [JMZ12]. In the non-parametric setting, which is the concern of this paper, Tian and Pearl [TP02] show how to derive functional constraints from causal Bayesian graphs that give equality and inequality constraints among the (distributions of) observed variables, not just conditional independence relations. Kang and Tian [KT06] derive such functional constraints on interventional distributions. Although these results yield non-trivial constraints, they are valid for any model that respects a particular graph and it is not clear how to use them for testing goodness-of-fit with statistical guarantees.

Learning in the context of causal inference has been extensively studied. To the best of our knowledge, though, most previous work is on learning only the causal graph, whereas our objective is to learn the entire causal model (i.e., the set of all interventional distributions). Pearl and Verma [PV95, VP92] investigated the problem of finding a causal graph with hidden variables that is consistent with a given list of conditional independence relations in observational data. In fact, there may be a large number of causal graphs that are consistent with a given set of conditional independence relations. [SGS00, ARSZ05], and Zhang [Zha08] (building on the FCI algorithm [SMR99]) has given a complete and sound algorithm for recovering a representative of the equivalence class consistent with a set of conditional independence relations.

Subsequent work considered the setting when both observational and interventional data are available. This setting has been a recent focus of study [HB12a, WSYU17, YKU18], motivated by advances in genomics that allow high-resolution observational and interventional data for gene expression using flow cytometry and CRISPR technologies [SPP05, MBS15, DPL16]. When there are no confounding variables, Hauser and Bühlmann [HB12b], following up on work by Eberhardt and others [EGS05, Ebe07], find the information-theoretically minimum number of interventions that are sufficient to identify333More precisely, the goal is to discover the causal graph given the conditional independence relations satisfied by the interventional distributions. the underlying causal graph and provide a polynomial time algorithm to find such a set of interventions. A recent paper [KDV17] extends the work of [HB12b] to minimize the total cost of interventions where each vertex is assigned a cost. Another work by Shanmugam et al. [SKDV15] investigates the problem of learning causal graphs without confounding variables using interventions on sets of small size. In the presence of confounding variables, there are several works which aim to learn the causal graph from interventional data (e.g., [MMLM06, HEH13]). In particular, a recent work of Kacaoglu et al. [KSB17] gives an efficient randomized algorithm to learn a causal graph with confounding variables while minimizing the number of interventions from which conditional independence relations are obtained.

All the works mentioned above assume access to an oracle that gives conditional independence relations between variables in the observed and interventional distributions. This is clearly a problematic assumption because it implicitly requires unbounded training data. For example, Scheines and Spirtes [SS08] have pointed out that measurement error, quantization and aggregation can easily alter conditional independence relations. The problem of developing finite sample bounds for testing and learning causal models has been repeatedly posed in the literature. The excellent survey by Guyon, Janzing and Schölkopf [GJS10]

on causality from a machine learning perspective underlines the issue as one of the “ten open problems” in the area. To the best of our knowledge, our work is the first to show finite sample complexity and running time bounds for inference problems on causal Bayesian networks.

An application of our learning algorithm is to the problem of transportability, studied in [BP13, SP08, LH13, PB11, BP12], which refers to the notion of transferring causal knowledge from a set of source domains to a target domain to identify causal effects in the target domain, when there are certain commonalities between the source and target domains. Most work in this area assume the existence of an algorithm that learns the set of all interventions, that is the complete specification of the the source domain model. Our learning algorithm can be used for this purpose; it is efficient in terms of time, interventions, and sample complexity, and it learns each intervention distribution to error at most .

#### 1.2.2 Distribution Testing and Learning

There is a vast literature on testing and learning high dimensional distributions in the statistics, and information theory literature, and more recently in computer science with a focus on the computational efficiency of solving such problems. We will not be able to cover and do justice to all of these works in this section. However, we will provide pointers to some of the resources, and also discuss some of the recent progress that is the most closely related to the work we present here.

In the distribution learning and testing framework, the closest to our work is learning and testing graphical models. The seminal work of Chow-Liu [CL68] considered the problem of learning tree-structured graphical models. Motivated by applications across many fields, the problem of learning graphical models from samples has gathered recent interest. Of particular interest is the apparent gap between the sample complexity and computational complexity of learning graphical models. [AKN06, BMS08] provided algorithms for learning bounded degree graphical models with polynomial sample and time complexity. A lower bound on the sample complexity that grows exponentially with the degree, and only logarithmically with the number of dimensions was provided by [SW12], and recent works [Bre15, VMLC16, KM17] have proposed algorithms with near optimal sample complexity, and polynomial running time for learning Ising models.

Sample and computational complexity of testing graphical models has been studied recently, in [CDKS17] for testing Bayesian Networks, and in [DDK18] for testing Ising models. Given sample access to an unknown Bayesian Network, or Ising model, they study the sample complexity, and computation complexity of deciding whether the unknown model is equal to a known fixed model (hypothesis testing).

The problem of testing and learning distribution properties has itself received wide attention in statistics with a history of over a century [Fis25, LR06, CT06]

. In these fields, the emphasis is on asymptotic analysis characterizing the convergence rates, and error exponents, as the number of samples tends to infinity. A recent line of work originating from

[GR00, BFR00] focuses on sublinear algorithms where the goal is to design algorithms with the number of samples that is smaller than the domain size (e.g., [Can15, Gol17], and references therein).

While most of these results are for learning and testing low dimensional (usually one dimensional) distributions, there are some notable exceptions. Testing for properties such as independence, and monotonicity in high dimensions have been considered recently [BFRV11, ADK15, DK16]. These results show that the optimal sample complexity for testing these properties grows exponentially with the number of dimensions. A line of recent work [DP17, CDKS17, DDK17, DDK18] overcomes this barrier by utilizing additional structure in the high-dimensional distribution induced by Bayesian network or Markov Random Field assumptions.

### 1.3 Overview of our Techniques

In this section, we give an overview of the proof of Theorem 1.1 and the lower bound construction. We start by making a well-known observation [TP02, VP90] that CBNs can be assumed to be over a particular class of DAGs known as semi-Markovian causal graphs. A semi-Markovian causal graph is a DAG where every vertex corresponding to an unobservable variable is a root and has exactly two children, both observable. More details of the correspondence are given in Appendix B.

In a semi-Markovian causal graph, two observable vertices and are said to be connected by a bi-directed edge if there is a common unobservable parent of and . Each connected component of the graph restricted to bi-directed edges is called a c-component. The decomposition into c-components gives very useful structural information about the causal model. In particular, a fact that is key to our whole analysis is that if is a semi-Markovian Bayesian network on observable and unobservable variables with c-components , then for any :

 PN[v]=p∏i=1PN[ci∣do(V∖Ci=v∖ci)] (1)

where is the alphabet set, is the restriction of to and is the restriction of to . Moreover, one can write a similar formula (Lemma 2.12) for an interventional distribution on instead of the observable distribution .

The most direct approach to test whether two causal Bayes networks and are identical is to test whether each interventional distribution is identical in the two models. This strategy would require many interventions, each on a variable set of size , where is the total number of observable vertices. To reduce the number of interventions as well as the sample complexity, a natural approach, given (1) and its extension to interventional distributions, is to test for identity between each pair of “local” distributions

 PX[S∣do(v∖s)]% andPY[S∣do(v∖s)]

for every subset of a c-component and assignment to . We assume that each c-component is bounded, so each local distribution has bounded support. Moreover, using the conditional independence properties of Bayesian networks, note that in each local distribution, we only need to intervene on observable parents of that are outside , not on all of .

Through a probabilistic argument, we efficiently find a small set of covering interventions, which are defined as a set of interventions with the following property: For every subset of a c-component and for every assignment to the observable parents of , there is an intervention that does not intervene on and sets the parents of to exactly . Our test performs all the interventions in on both and and hence can observe each of the local distributions and . What remains is to bound in terms of the distances between each pair of local distributions.

To that end, we develop a subadditivity theorem about CBNs, and this is the main technical contribution of our upper bound results. We show that if each pair of local distributions is within distance in squared Hellinger distance, then for any intervention , applying to and results in distributions that are within distance in squared Hellinger distance, assuming bounded in-degree and c-component size of the underlying graph. A bound on the total variation distance between the interventional distributions and hence follows. The subadditivity theorem is inspired from [DP17]

, where they showed that for Bayes networks, “closeness of local marginals implies closeness of the joint distribution”. Our result is in a very different set-up, where we prove “closeness of local interventions implies closeness of any joint interventional distribution”, and requires a new proof technique. We relax the squared Hellinger distance between the interventional distributions as the objective of a minimization program in which the constraints are that each pair of local distributions is

-close in squared Hellinger distance. By a sequence of transformations of the program, we lower bound its objective in terms of , thus proving our result. In the absence of unobservable variables, the analysis becomes much simpler and is sketched in Appendix A.

Regarding the lower bound, we prove that the number of interventions required by our algorithms are indeed necessary for any algorithm that solves or , even if the algorithms are provided with infinite samples/time. For any algorithm that fails to perform some local intervention , we provide a construction of two models which do not agree on and agree on all other interventions. Our construction is designed in such a way that it allows adaptive algorithms. The idea is to show an adversary that, for each intervention, reveals a distribution to the algorithm. Towards the end, when the algorithm fails to perform some local intervention , we can show a construction of two models such that: i) both the models do not agree on , and the total variation distance between the interventional distributions is equal to one; ii) and for all other interventions, the interventional distributions revealed by the adversary match with the corresponding distributions on both the models. This, together with a probabilitic argument, shows the existence of a causal graph that requires sufficiently large number of interventions to solve and .

### 1.4 Future Directions

We hope that this work paves the way for future research on designing efficient algorithms with bounded sample complexity for learning and testing causal models. For the sake of concreteness, we list a few open problems.

• Interventional experiments are often expensive or infeasible, so one would like to deduce causal models from observational data alone. In general, this is impossible. However, in identifiable causal Bayesian networks (see [Tia02]), one can identify causal effects from observational data alone. Is there an efficient algorithm to learn an identifiable interventional distribution from samples?444Schulman and Srivastava [SS16] have shown that under adversarial noise, there exist causal Bayesian networks on

nodes where estimating an identifiable intervention to precision

requires precision in the estimates of the probabilities of observed events. However, this instability is likely due to the adversarial noise and does not preclude an efficient sampling-based algorithm, especially if we assume a balancedness condition as in [CDKS17].

• A deficiency of our work is that we assume the underlying causal graph is fully known. Can our learning algorithm be extended to the setting where the hypothesis only consists of some limited information about the causal graph (e.g., in-degree, c-component size) instead of the whole graph? In fact, it is open how to efficiently learn the distribution given by a Bayesian network based on samples from it, if we don’t have access to the underlying graph [DP17, CDKS17].

• Our goodness-of-fit algorithm might reject even when the input is very close to the hypothesis . Is there a tolerant goodness-of-fit tester that accepts when and rejects when for ? Our current analysis does not extend to a tolerant tester. The same question holds for two-sample testing.

• In many applications, causal models are described in terms of structural equation models, in which each variable is a deterministic function of its parents as well as some stochastic error terms. Design sample and time efficient algorithms for testing and learning structural equation models. Other questions such as evaluating counterfactual queries or doing policy analysis (see Chapter 7 of [Pea09]) also present interesting algorithmic problems.

## 2 Preliminaries

##### Notation.

We use capital (bold capital) letters to denote variables (sets of variables), e.g., is a variable and is a set of variables. We use small (bold small) letters to denote values taken by the corresponding variables (sets of variables), e.g., is the value of and is the value of the set of variables . The variables in this paper take values in a discrete set . We use to denote .

##### Probability and Statistics.

The total variation (TV) distance between distributions and over the same set is The squared Hellinger distance (given in (9)) and the total variation distance are related by the following.

###### Lemma 2.1 (Hellinger vs total variation).

The Hellinger distance and the total variation distance between two distributions and are related by the following inequality:

 H2(P,Q)⩽δTV(P,Q)⩽√2H2(P,Q).

The problem of two-sample testing for discrete distributions in Hellinger distance, and learning with respect to total variation distance has been studied in the literature, and the following two lemmas state two results we use. Let and denote distributions over a domain of size .

###### Lemma 2.2 (Hellinger Test, [Dk16]).

Given samples from each unknown distributions and , we can distinguish between vs with probability at least . This probability can be boosted to at a cost of an additional factor in the sample complexity. The running time of the algorithm is quasi-linear in the sample size.

###### Lemma 2.3 (Learning in TV distance, folklore (e.g. [Dl12])).

For all , the empirical distribution computed using samples from satisfies , with probability at least .

##### Bayesian Networks.

Bayesian networks are popular probabilistic graphical models for describing high-dimensional distributions.

###### Definition 2.4.

A Bayesian Network (BN) is a distribution that can be specified by a tuple where: (i) is a set of variables over alphabet , (ii) is a directed acyclic graph with nodes corresponding to the elements of , and (iii) is the conditional distribution of variable given that its parents in take the values .

The Bayesian Network defines a unique probability distribution over , as follows. For all ,

 PN[v]=∏Vi∈VPr[vi∣pa(Vi)].

In this distribution, each variable is independent of its non-descendants given its parents in .

Conditional independence relations in graphical models are captured by the following definitions.

###### Definition 2.5.

Given a DAG , a (not necessarily directed) path in is said to be blocked by a set of nodes , if (i) contains a chain node () or a fork node () such that (or) (ii) contains a collider node () such that and no descendant of is in .

###### Definition 2.6 (d-separation).

For a given DAG on , two disjoint sets of vertices are said to be d-separated by in , if every (not necessarily directed) path in between and is blocked by .

###### Lemma 2.7 (Graphical criterion for independence).

For a given BN and , if and are d-separated by in , then is independent of given in , denoted by in .

### 2.1 Causality

We describe Pearl’s notion of causality from [Pea95]. Central to his formalism is the notion of an intervention. Given a variable set and a subset , an intervention is the process of fixing the set of variables to the values . The interventional distribution is the distribution on after setting to . As discussed in the introduction, an intervention is quite different from conditioning.

Another important component of Pearl’s formalism is that some variables may be unobservable. The unobservable variables can neither be observed nor be intervened. We partition our variable set into two sets and , where the variables in are observable and the variables in are unobservable. Given a directed acyclic graph on and a subset , we use , , and to denote the set of all parents, observable parents, observable ancestors and observable descendants respectively of , excluding , in . When the graph is clear, we may omit the subscript. As usual, small letters, , and are used to denote their corresponding values. And, we use and to denote the graph obtained from by removing the incoming edges to and outgoing edges from respectively.

###### Definition 2.8 (Causal Bayesian Network).

A causal Bayesian network (CBN) is a collection of interventional distributions that can be defined in terms of a tuple , where (i) and are the sets of observable and unobservable variables respectively, (ii) is a directed acyclic graph on , and (iii) and

and resp. given that its parents and resp. take the values and ) resp.

A CBN  defines a unique interventional distribution for every subset (including ) and assignment , as follows. For all :

 PM[v∣do(x)]={∑u∏Vi∈V∖XPr[vi∣π(Vi)]⋅∏Ui∈UPr[ui∣π(Ui)]if vis consistent with x0otherwise.

We say that is the causal graph corresponding to the CBN .

Another equivalent way to define a CBN is by specifying the set of interventional distributions for all subsets and assignments . To connect to the preceding definition, we require that each is defined by the Bayesian network described by with the conditional probability distributions obtained by setting the variables in to the constants .

It is standard in the causality literature to work with causal graphs of a particular structure:

###### Definition 2.9 (Semi-Markovian causal graph and Semi-Markovian Bayesian network).

A semi-Markovian causal graph (SMCG) is a directed acyclic graph on where every unobservable variable is a root node and has exactly two children, both observable. A semi-Markovian Bayesian network (SMBN) is a causal Bayesian network where the causal graph is semi-Markovian.

There exists a known reduction (described formally in Appendix B) from general causal Bayesian networks to semi-Markovian Bayesian networks that preserves all the properties we use in our analysis, so that henceforth, we will restrict only to SMBNs.

In SMCGs, the divergent edges are usually represented by bi-directed edges . A bi-directed edge between two observable variables implicitly represents the presence of an unobservable parent.

###### Definition 2.10 (c-component).

For a given SMCG , is a c-component of , if is a maximal set such that between any two vertices of , there exists a path that uses only bi-directed edges.

Since a c-component forms an equivalence relation, the set of all c-components forms a partition of , the observable vertices of . We use the notation to denote the partition of into the c-components of , where each is a c-component of .

Also, for , the induced subgraph is the subgraph obtained by removing the vertices and their corresponding edges from . We use the notation to denote the set of all c-components of , that is each is a c-component of . The next two lemmas capture the factorizations of distributions in SMBN.

###### Lemma 2.11.

Let be a given SMBN with respect to the SMCG . For any set , and a subset such that , and for any assignment ,

 PM[s∣do(d)]=PM[s∣do(pa(S))]

where is consistent with the assignment .

###### Proof.

When the parents of , , are targeted for intervention, the distribution on remains the same irrespective of whether the other vertices in are intervened or not. ∎

###### Lemma 2.12 (c-component factorization, [Tp02]).

Given a SMBN with respect to the causal graph and a subset , let . For any given assignment ,

For a given SMCG , the in-degree and out-degree of an observable vertex denote the number of observable parents and observable children of in respectively. The maximum in-degree of a SMCG is the maximum in-degree over all the observable vertices. The maximum degree of a SMCG is the maximum of the sum of the in-degree and out-degree over all the observable vertices.

###### Definition 2.13 (Graphs with bounded in-degree and bounded c-component).

denotes the class of SMCGs with maximum in-degree at most and the size of the largest c-component at most .

### 2.2 Problem Definitions

Here we define the testing and learning problems considered in the paper. Let and be two SMBNs. We say that , if

 PM[V∖T∣do(t)]=PN[V∖T∣do(t)]∀T⊆V,t∈Σ|T|.

And we say that , if there exists and such that

 δTV(PM[V∖T∣do(t)],PN[V∖T∣do(t)])>ε.
###### Definition 2.14 (Causal Goodness-of-fit Testing (CGFT(G,M,ε))).

Given a SMCG , a (known) SMBN on , and . Let denote an unknown SMBN on . The objective of is to distinguish between versus with probability at least 2/3, by performing interventions and taking samples from the resulting interventional distributions of .

###### Definition 2.15 (Causal Two-sample Testing (C2ST(G,ε))).

Given a SMCG , and . Let and be two unknown SMBNs on . The objective of is to distinguish between versus with probability at least 2/3, by performing interventions and taking samples from the resulting interventional distributions of and .

###### Definition 2.16 (Learning SMBNs (CL(G,ε))).

Given a SMCG and . Let be an unknown SMBN on . The objective of is to perform interventions and taking samples from the resulting interventional distributions of , and return an oracle that for any and returns an estimated interventional distribution such that

 δTV([PX[V∖T∣do(t)],PES[V∖T∣do(t)])<ε.

We emphasize that in all three problems, the causal graph is known explicitly in advance.

## 3 Testing and Learning Algorithms for Smbns

Before we discuss our algorithms, we begin by defining covering intervention sets.

###### Definition 3.1.

A set of interventions is a covering intervention set if for every subset of every c-component, and every assignment there exists an such that,

• No node in is intervened in .

• Every node in is intervened.

• restricted to has the assignment .

Our algorithms comprise of two key arguments.

• A procedure to compute a covering intervention set of small size.

• A sub-additivity result for CBNs that allows us to localize the distances: where we show that two CBNs are far implies there exist a marginal distribution of some intervention in such that the marginals are far.

These two results are formalized in Section 4.1, and Section 4.2 respectively.

### 3.1 Testing

Our main testing result is the following upper bound for testing of causal models.

###### Theorem 3.2 (Algorithm for C2ST(G,ε)).

Let be a SMCG with vertices. Let the variables take values over a set of size . Then, there is an algorithm to solve , that makes interventions to each of the unknown SMBNs and , taking samples per intervention, in time .

When the maximum degree (in-degree plus out-degree) of is bounded by , then our algorithm uses interventions with the same sample complexity and running time as above.

This result gives Theorem 1.1 as a corollary, since two sample tests are harder than one sample tests.

###### Proof of Theorem 3.2.

Our algorithm is described in Algorithm 3.1.

The algorithm starts with a covering intervention set . Lemma 4.1 gives an with interventions. When the maximum degree is bounded by , then Lemma 4.3 gives an of size . Moreover, by the remarks following Lemmas 4.1 and 4.3, can be found in time.

algocf Algorithm 0 Algorithm for : Covering intervention set Under each intervention : Obtain samples from the interventional distribution of in both models and . For any subset of a c-component of , if does not set but sets to , then using Lemma 2.2, Lemma 2.11 and the obtained samples, test (with error probability at most ):

Output “” if the latter. Output “”.

We will now analyze the performance of our algorithm.

Number of interventions, time, and sample requirements. The number of interventions is the size of , bounded from Lemma 4.1 or Lemma 4.3. The number of samples per intervention is given in the algorithm. The algorithm performs sub-tests. And for each such sub-test, the algorithm’s running time is quasi-linear in the sample complexity (Lemma 2.2), therefore taking a total time of .

Correctness. In Theorem 4.5, we show that when , there exists a subset of some c-component, and an that does not intervene any node in but intervenes with some assignment such that

 H2(PX[S∣do(pa(S))],PY[S∣do(pa(S))])>ε2/(2Kℓ(d+1)n).

This structural result is the key to our algorithm. This together with Lemma 2.1 proves that and are far in terms of the total variation distance. To bound the error probability, note that the number of total sub-tests we run is bounded by , and the error probability for each subset is at most , by the union bound, we will have an error of at most over the entire algorithm. ∎

In some cases, the underlying SMCG might not be known. We will now consider the problem of two sample testing, where and are still on the same common SMCG , but is unknown. We now show an algorithm that uses the same number of interventions and samples as Theorem 3.2 for the known case, however requiring time.

###### Theorem 3.3 (Algorithm for C2ST(G,ε) – Unknown graph).

Consider the same set-up as Theorem 3.2, except that the SMCG is unknown. Then, there is an algorithm to this problem, that makes interventions to and , taking samples per intervention, in time .

###### Proof.

We first use Lemma 4.1 and obtain a set of interventions , such that is a covering set with error probability at most . Note that Lemma 4.1 holds even when the underlying graph is unknown.

algocf Algorithm 0 Algorithm for – Unknown graph: Covering intervention set Under each intervention : Obtain samples from the interventional distribution of in both models and . For each subset of size , using Lemma 2.2, Lemma 2.11 and the obtained samples, test (with error probability at most ):

Output “” if the latter. Output “”.

For each intervention, we go over all subsets of size . Therefore we perform at most sub-tests for an intervention. For each sub-test, the algorithm’s running time is quasi-linear in the sample complexity (Lemma 2.2), therefore taking a total time of . The number of interventions follow from Lemma 4.1 and the number of samples follow from the algorithm.

##### Correctness.

As in the proof of Theorem 3.2, we use Theorem 4.5 to show that when , then there exists a subset of some c-component and an that does not intervene any node in but intervenes with some assignment such that

 H2(PX[S∣do(pa(S))],PY[S∣do(pa(S))])>ε2/(2Kℓ(d+1)n).

This together with Lemma 4.5 proves that and are far in terms of the total variation distance. Since the error probability of each sub-test is bounded by at most and the error probability of being a covering intervention set is at most , by union bound, we will have an error of at most over the entire algorithm. ∎

### 3.2 Learning

Our next result is on learning SMBNs over a known causal graph. Our algorithm is improper, meaning that it does not output a causal model in the form of an SMBN, but rather outputs an oracle which succinctly encodes all the interventional distributions. See Definition 2.16 for a rigorous formulation of the problem.

###### Theorem 3.4 (Algorithm for CL(G,ε)).

For any given SMCG  with vertices and a parameter , there exists an algorithm that takes as input an unknown SMBN over , that performs interventions to , taking samples per intervention, that runs in time , and that with probability at least , outputs an oracle with the following behavior. Given as input any and assignment , outputs an interventional distribution such that:

 δTV(PX[V∖T∣do(t)],PN[V∖T∣do(t)])<ε

When the maximum degree (in-degree plus out-degree) of is bounded by , then our algorithm uses interventions with the same sample complexity and running time as above.

algocf Algorithm 0 Algorithm for : Covering intervention set Under each intervention : Obtain samples from the interventional distribution of in . For each subset of a c-component, if does not set but sets to , use Lemma 2.3, Lemma 2.11 and the obtained samples to learn:

with probability of error at most . Return the following oracle that takes as input: and Let . Output the distribution where for any assignment :

The covering intervention set used in the algorithm above is as defined in Definition 3.1.

Number of interventions, time, and sample requirements. The number of interventions is obtained using the bound on the size of the covering intervention set from Lemma 4.1. When the maximum degree is bounded, we can use Lemma 4.3. The number of samples per intervention is obtained from Lemma 2.3. Since the algorithm learns at most interventions (subroutines), and each subroutine takes time linear in the sample size, the time complexity follows.

##### Correctness.

For any given , , let . Lemma 2.12 justifies that

Similar to the proof of Theorem 3.2, using Theorem 4.5 and Lemma 2.1, we get:

 H2(PN[V∖T∣do(t)],PX[V∖T∣do(t)]) <ε2/2 ⟹δTV(PN[V∖T∣do(t)],PX[V∖T∣do(t)]) <ε.

## 4 Main Ingredients of the Analysis

### 4.1 Covering Intervention Sets

###### Lemma 4.1 (Counting Lemma: bounded in-degree).

Let be a SMCG with vertices and be an alphabet set of size . Then, there is a randomized algorithm that outputs a set of size . such that, with probability at least , is a covering intervention set.

###### Proof.

Let . The interventions in are chosen by the following procedure: For each and for each , is observed in with probability and otherwise, is intervened with the assignment chosen uniformly from . Let denotes that is not intervened. Consider a fixed c-component , a fixed subset , a fixed assignment and a fixed . Now,

 Pr[Ij(S)=∗|S|∧Ij(Pa(S))=pa(S)] =(1d+1)|S|⋅(dK(d+1))|Pa(S)| ⩾(d+1)−ℓK−ℓde−ℓ[Since |Pa(S)|⩽ℓd and |S|⩽ℓ] ⩾(3d)−ℓK−ℓd.

This implies that

 Pr[∀j∈[t],(Ij(S)≠∗|S|∨Ij(Pa(S))≠pa(S))]⩽(1−(3d)−ℓK−ℓd)t⩽δnK−2ℓd.

Hence,

 Pr[∃ C-component C,∃S⊆C,∃pa(S)∈Σ|Pa(S)|,∀j∈[t],(Ij(S)≠∗|S|∨Ij(Pa(S))≠pa(S))] ⩽n2ℓKℓd⋅δnK−2ℓd⩽δ

by the union bound. ∎

###### Remark 4.2.

The above proof can be made deterministic by using explicit deterministic constructions of almost -wise independent random variables [AGHP92, EGL92].

###### Lemma 4.3 (Counting Lemma: bounded total degree).

Let be an SMCG with vertices, whose variables take values in with , and whose maximum degree is bounded by . Then, there exists covering intervention set of size .

###### Proof.

Let . The interventions in are chosen by the following procedure: For each and for each , is observed in with probability and otherwise, is intervened with the assignment chosen uniformly from the set . Let denotes that is observed (not intervened).

For a fixed set that is a subset of a c-component and a fixed assignment , let be the event: . Similar to the proof of Lemma 4.1, for any fixed and : .

Now, note that and are independent if