Log In Sign Up

Quantifying causal contribution via structure preserving interventions

by   Dominik Janzing, et al.

We introduce 'Causal Information Contribution (CIC)' and 'Causal Variance Contribution (CVC)' to quantify the influence of each variable in a causal directed acyclic graph on some target variable. CIC is based on the underlying Functional Causal Model (FCM), in which we define 'structure preserving interventions' as those that act on the unobserved noise variables only. This way, we obtain a measure of influence that quantifies the contribution of each node in its 'normal operation mode'. The total uncertainty of a target variable (measured in terms of variance or Shannon entropy) can then be attributed to the information from each noise term via Shapley values. CIC and CVC are inspired by Analysis of Variance (ANOVA), but also applies to non-linear influence with causally dependent causes.


page 1

page 2

page 3

page 4


Causal structure based root cause analysis of outliers

We describe a formal approach to identify 'root causes' of outliers obse...

Psychological and Normative Theories of Causal Power and the Probabilities of Causes

This paper (1)shows that the best supported current psychological theory...

Causal Entropy Optimization

We study the problem of globally optimizing the causal effect on a targe...

Goodness of Causal Fit

We propose a Goodness of Causal Fit (GCF) measure which depends on Pearl...

Near-Optimal Multi-Perturbation Experimental Design for Causal Structure Learning

Causal structure learning is a key problem in many domains. Causal struc...

Asymptotic Causal Inference

We investigate causal inference in the asymptotic regime as the number o...

Detecting low-complexity unobserved causes

We describe a method that infers whether statistical dependences between...

1 Introduction

Quantification of causal influence not only plays a role in expert’s research on specific scientific problems, but also in highly controversial public discussions. For instance, the question to what extent environmental factors versus genetic disposition influence human intelligence, is an ongoing debate [1]. Given the relevance of these questions, there is surprisingly little clarity about how to define strength of influence in the first place, see e.g. [2]

, apart from the problem of estimating it from empirical data after it is properly defined. More recent discussions on feature relevance quantification in explainable artificial intelligence has raised the problem of quantification of influence once more, see e.g.

[3, 4, 5].

Probabilistic versus functional causal models

Our language for talking about causal relations will be based on Pearl’s seminal work [6].

Definition 1 (Causal Bayesian Network).

A causal Bayesian network is a directed acyclic graph (DAG)

whose nodes are random variables

with joint distribution

satisfying the Markov condition with respect to : Each node is conditionally independent of its non-descendants , given its parents

and thus the joint probability factorizes according to


Apart from this purely statistical condition (which may hold for many DAGs other than the causal one), describes the behaviour of the system under interventions in the following sense. If one sets the parents of a node to some values , will then be distributed according to .

While Bayesian networks can also be used to formalize statistical conditional (in)-dependences between random variables without causal interpretation, the last condition in Definition 1 clarifies that the DAG is causal because for a general set of variables statistically related to , setting them to certain values does not result in the same distribution as observing them to attain these values (difference between interventional and observational conditionals, see [6]). A more ‘fine-grained’ causal model is given by functional causal models, also called non-linear structural equation models:

Definition 2 (Functional Causal Model).

An FCM is a DAG with observed variables as nodes and as unobserved noise variables such that
(1) Each is deterministically given by its parents and the noise term, that is,


(2) all noise variables are jointly statistically independent. Moreover, (2) is not just an equation, but a causal assignment in the sense that for any particular observed value of , setting to instead of would have changed to (if denotes the value attained by for that particular statistical instance).

The existence of an FCM implies that satisfies the Markov condition with respect to [6]. On the other hand, every joint distribution that is Markovian relative to can be generated by an FCM, but this construction is not unique. This is because knowing the causal DAG and the joint distribution alone does not determine counterfactual causal statements like111See also [7] for an explicit description of the non-uniqueness. ‘how would have changed, had I set from to ’. While the more ‘conventional’ approaches to causal discovery infer the graphical model [8] from conditional statistical independences (up to Markov equivalence), more recent approaches based on other properties of the joint distribution infer the DAG by inferring FCMs [9, 10], subject to strong additional assumptions. This allows us to draw counterfactual conclusions from empirical data which could not be inferred otherwise.

The paper is structured as follows. Section 2 describes via intuitive examples which notion of causal influence we have in mind. Section 3 defines our quantification of causal influence in the sense of information contribution. Section 4 introduces causal influence in the sense of contribution to the variance, which we mention because it comes with advantages for some practical applications, although the main focus of the paper is information theoretic in oder to admit applicability to variables with general range. Then we discuss some properties of both contribution measures in Section 5. Section 6 discusses earlier measures of causal influence that are also information-based and compares it to ours.

The main goal of the present paper is to show the sophistication of the problem of quantifying causal influence. Even if one agrees to base notions of causal influence on interventions in causal DAGs; and even if one decides to measure its strength in terms of Shannon information, there are still substantially different notions of causal strength. We argue that these different notions coexist for good reasons, since they formalize different ideas on what causal influence is about. Although the paper is purely theoretical, the main contribution is not a mathematical one. Instead, it aims at motivating a definition of influence, exploring whether it aligns with intuitive human concepts (by example stories), and critically discussing its limitations.

2 Motivation: influence in the sense of contribution

Assume three authors are jointly writing a document. writes the first section, passes it to who writes the second section and passes it to . Finally, adds the third section. Let denote the documents after contributed their sections , respectively. We visualize the underlying communication scenario by the causal DAG

To judge how much each author ‘influenced’ the resulting document we could argue in at least two different ways:

  1. Influence in the sense of contribution: each author contributed his/her section.

  2. Influence in the sense of options for actions: author had an influence on section , author on sections , while author influenced . After all, they also saw the sections of the authors contributing earlier and thus could have changed them drastically. Assume, for instance, author realizes that is complete non-sense. Then may not only blame for having written it but also for not correcting it. In this sense, author had the smallest influence because he/she could only influence , author could have influenced and author even .

To describe the difference of these two interpretations quantitatively, we now consider a statistical scenario where the three authors repeatedly write documents following the same communication scenario above and we consider as independent random variables with Shannon entropy , respectively (since we argue in terms of information theory, may attain values in arbitrary sets, e.g., in the set of texts of finite length). We then define the FCM


where denotes the concatenation of texts. Assume we like to quantify the influence of each author on the final document in terms of Shannon mutual information [11].

For the influence of each author on in the sense of contribution we get

On the other hand, let us measure the influence in the sense of ‘options to change’, or ‘potential influence’ by quantifying the amount of information that each author has seen and could therefore have changed.222One may certainly argue that authors and would have had the option to write also the sections following theirs, but this ‘potential influence’ breaks the protocol even more drastically. Accordingly, we consider an experiment where each author randomizes not only his/her section but also replaces the sections that already exist with random inputs from the corresponding distribution. We then obtain

To elaborate on the difference between both notions of influence, one could say that the first one accounts for the influence in a somehow ‘normal mode’, an influence that respects the mechanisms given by the structural equations (3) to (5), while the second notion assumes a radical change of the mechanisms that blocks the dependence from causal parents. Although there are certainly good reasons for both notions, we will mainly focus on the first one for two reasons. First, quantification of causal influence so far has covered this notion the least333Note that two substantially different notions of causal influence, namely Information Flow in [12] and Causal Strength in [13] are related to rather than to , see Section 6.. Second, we believe that influence defined by hypothetical radical mechanism changes is not what one is interested in for typical applications. Assume, for instance, one discusses the factors that influence the revenue of a bank. Following the second notion of influence, one could argue that the person driving the money transporter has a significant influence because he/she could have stolen a lot of money. However, a serious economical discussion on the factors influencing revenue would only account for this case if it happened with significant probability. Then, however, one would introduce the loss caused by the money transporter as an additional noise variable in the corresponding structural equation and again end up at contribution analysis in the sense of our first notion.

Structure preserving interventions

Our remarks above implicitly refer to actions that do not match the standard notion of interventions. The interventional calculus by [6] refers by default to interventions that set nodes to particular fixed value. This type is formally444More general interventions where (2) is replaced with a different structural equation, are considered by [14, 15, 16, 17, 18]. This certainly includes the below notion of intervention as a special case. defined by removing in (2) and replacing it with . We will now consider interventions that preserve the structural equations because they act on the noise only. To see that this notion is substantially different from the former just consider the case where a variable is deterministically given by its parents. Then there is no structure preserving intervention on that node possible. Accordingly we cannot talk about its influence on any other node. Following our remarks above, this is in agreement with the common practice of not attributing any influence to mechanisms that just transport the information reliably without any change.

One can think of different types of interventions on the noise: (1) setting to some specific value, (2) applying some function to , that is, . A quite natural example for real-valued variables is the shift for some . Note that it is not allowed that the operation in the noise depends on the value of the parents, because this would implicitly mean changing the structure (such an operation changes the way depends on its parents).

Below, we consider interventions on the noise that randomize according to the un-intervened distribution (the usual term ‘observational distribution’ seems out of place here because is assumed to be unobserved, although they can be computed from observations subject to assumptions like additive noise as used by [9]). There are two arguments for this choice. First, it is convenient that the mutual information that shares with any observed node is then exactly the information the two variables share in the un-intervened distribution (note that noise variables are always unconfounded because they don’t have incoming arrows). This way, we do not need to explicitly compute interventional distributions via ‘blocking back-door paths’ [6]. Second, allowing randomized interventions with noise distribution different from the observed introduces would raise the difficult question of which distribution to use instead. Choosing a distribution with larger variance, for instance, could influence downstream variables in a much stronger way than it usually would. Randomizing with the ‘natural distribution’ is therefore aligned with our idea of measuring the influence of a node in the ‘normal mode’ rather than in a scenario with drastic changes.

3 Defining Causal Information Contribution

3.1 Multiple independent direct causes:

To be prepared for general DAGs we first consider the simple DAG shown in Figure 1, where independent variables directly influence a target . A natural information theoretic quantification of causal influence in this case is given as follows.

Figure 1: The simple case of multiple causally independent causes influencing the target .

For any subset , we define

as the vector containing

. The Shannon mutual information between and is nonnegative. We then define the causal information contribution of node , given some not containing , as the difference


see [11] for the second equality. Obviously, we can then decompose the mutual information between and all inputs as


Unfortunately, the contribution of each node depends on an arbitrarily chosen order of nodes, which yields an ill defined value. Shapley values in cooperative game theory get rid of this dependence by symmetrizing (

7) over all orderings [19]. They are defined as follows:

Definition 3 (Shapley value).

Let be a set with elements (called ‘players’ in the context of game theory) and be a set function with (assigning a ‘worth’ to each ‘coalition’). Then the Shapley value of is given by


is thought of measuring the contribution of each player in a fair way and satisfies


By defining the ‘worth of a coalition of noise terms’ via , we can define the ‘Shapley contribution’ of each node via


which, using (9) still add up to the total information

Interpretation as reduction of uncertainty:

Note that can also be interpreted as measuring to what extent the uncertainty of is reduced by knowing on top of . This is because the conditional mutual information can be decomposed into a difference of conditional Shannon entropies [11]:

Relation to ANOVA:

Despite its weaknesses, see e.g. [20, 21], Analysis of Variance (ANOVA ) is still a popular measure for quantifying causal contributions555also a classical method for quantifying heritability [22], mainly thanks to its simplicity. For the DAG in Figure 1 it can indeed be though to measure causal influence subject to the additional assumption that the influence is linear:

Due to the uncorrelatedness of the causes we have

Accordingly, one can quantify the fraction of that is explained by each :

which entails . If the values of any subset of variables is known, the variance of reduces to . Accordingly, we can also interpret as the fraction by which the uncertainty of is reduced by knowing . While above uncertainty was meant in the sense of entropy, here it is measured in terms of variance. The rephrasing (3.1) thus shows that CIC is in the same spirit as ANOVA, with the difference that the reduction of uncertainty in the simple linear model of ANOVA does not depend on which other variables are already known (and thus ANOVA does not require Shapley values). In Section 4 we will define Causal Variance Contribution, which measures uncertainty by variance instead of entropy and a thus literary generalizes ANOVA.

3.2 Definition for general DAGs

FCMs provide an elegant way to reduce the case of a general DAG with nodes to the case in Subsection 3.1. Let us assume that is the target node we are interested in and, without loss of generality, that is a sink node of the DAG (that is, it has no descendants). We can then recursively insert structural equations (2) into each other and write entirely in terms of the unobserved noise variables:666

Note that writing some target variable in terms of all upstream noise terms seems appears to be a general approach in various types of attribution analysis. It has also been used for root cause analysis of outliers

[23]. More generally speaking, writing the vector in terms of the independent noise terms

can be seen as a particularly meaningful type of non-linear independent component analysis because it is based on a causal model.


Therefore, we can think of being the effect of the independent causes . This way, we are back at the scenario in Figure 1 by replacing the inputs with . Hence, we define our measure of causal influence accordingly:

Definition 4 (causal information contribution (CIC)).

Assume we are given variables whose causal relation is described by an FCM (2) where is a sink node. Let as in (12) express in terms of all noise variables . Then the Causal Information Contribution (CIC) of node , given some subset is given by


Note that this definition includes the special case in Section 2 which has also been defined by the noise variable in the structural equations (3) to (5). To further ‘demystify’ the noise variables in our context, one may think of a train schedule where formalizes the delay of train and is the set of all trains that can delay th departure because it waits for them. Then, only in (2) allows us to describe the part of th delay that is genuinely caused by itself rather than being ‘inherited’ from other trains. 777Note that [24] already argued in favor of using FCMs for causation, in their case for causation of singular events. However, their notion of ’contributory cause’ is again based on -interventions rather than structure-preserving interventions and thus substantially different from our notion of ’contribution’.

Further note that we decided to write instead of to emphasize that we do not condition on observations from the random variables (as opposed to (6), where computing CIC really involves conditioning on observed random variables). Then the Shapley CIC, denoted by , is defined similarly as in (3.1) by the weighted averaged over for all conditioning sets :


The following lower bound can be convenient:

Lemma 1 (no information loss by conditioning).

For any sets we have

In particular,


The proof follows easily from for any random variables with .

We observe that (which is infinite for continuous variables888Note that replacing Shannon information with differential Shannon entropy does not solve the problem that the mutual information is still infinite., in which case we propose to discretize the variables). Hence we obtain


This justifies to consider as the ‘contribution of each node to the information of ’.

Subsection 5.3 will argue, however, that the symmetrization over conditioning sets has also disadvantages. Depending on the context, one may therefore decide to also work with ‘plain’ containing the dependence on the conditioning sets. For many applications one may also stick to , although it is blind to the influence that gets only apparent after some other noise terms are known. For a general discussion of this ‘synergy’ effect in information theory see also [25].

4 Causal Variance Contribution:

The relation to ANOVA mentioned earlier suggests to measure the uncertainty reduction that results from knowing certain noise terms also in terms of variance. In other word, we obtain an extension of ANOVA in a more literal sense by replacing the difference of conditional entropies in (14) with conditional variance:

Definition 5 (Causal Variance Contribution).

Let be a real-valued target variable. Then the Causal Variance Contribution of on , given is defined by

Likewise, we define Shapley contribution by symmetrization over subsets as in (3.2).

reduces to ANOVA for the scenario where is a linear combination of independent causes, that is,

Then the reduction of conditional variance caused by including is given by .

Quantifying uncertainty in terms of variance rather than Shannon entropy may be more intuitive for many practical applications. Further, conditional variance can be better estimated from finite data than entropy since (via the squared error of regression models). Nevertheless, information theoretic quantification will remain our main focus in order not to allow also for non-real values variables like categorical ones.

5 Some properties of CIC and CVC

Studying properties of CIC and CVC may help understand in what sense they quantify indirect influence. This way, the reader may judge him/herself whether it describes a notion of influence that is appropriate for the application in mind.

5.1 Inserting dummy nodes

Assume we are given the causal DAG with the structural equations

Then, straightforward application of (3.2) yields


Let us now insert an intermediate node that is just an exact copy of , that is, we define the modified FCM


The corresponding DAG reads . From a physicists perspective, such a refinement of the description should always be possible because any causal influence propagates via a signal that can be inspected right after it leaves the source. The following result shows that (19) to (21) entail the same value for as (18) because the ‘dummy’ noise variable corresponding to is irrelevant for the contribution of the other nodes:

Lemma 2 (dummy noise variables).

Let be noise variables of an FCM with observed nodes . Let be a modified FCM with observed variables and noise variables modelling the same joint distribution on . Assume that the additional noise variables are irrelevant for , that is


for all . Then and yield the same values for and for all .

First we need the following property of Shapley values:

Lemma 3 (adding zero value players).

For let be given by

that is, is an extension of to irrelevant elements. Then

We are now able to prove Lemma 2. is given by first defining the function

in Definition 3, for each . Then . Further, define

for each . Then, . To see this, set . Then we have

The second term vanishes due to (22). Hence, defines an extension of to irrelevant elements in the sense of Lemma 3. Since with respect to the extended FCM is given by , the statement follows from Lemma 3.

To show the same for , we define the set function and (22) implies

5.2 Marginalization over grandparents

We are given the causal chain


with the structural equations


Assume now we ignore the variable and consider the causal structure


which is consistent with (23) because can be thought of part of the noise term for . We would then describe (27) by the structural equations


One can easily see that is not the same as for the larger model. In the limiting case where is just a copy of we obtain for (23) while the DAG (27) is blind for the fact that has ‘inherited’ all its information from its grandparent. This matches our explanations on contributions of authors in Section 2: not being aware of the original source, one may erroneously attribute all sections to an author who just added one. The same argument for the deterministic case applies to because a node without noise cannot contribute to the variance either.

5.3 Marginalization over intermediate nodes

While we inserted a deterministic node in Subsection 5.1 we now marginalize over an intermediate node that depends non-deterministically from its cause.999Note that consistence of causal structures under various coarse-grainings is an interesting topic in a more general context too [26]. Let us again consider the chain (23) with the structural equations (19) to (21). Recall that contains the terms , , , .

Marginalizing over yields the causal DAG with the structural equations


where and

For the reduced structure, contains only terms of the form and

while the terms , do not occur any longer. Hence, the causal information contribution is not invariant with respect to the marginalization. The reason is that Shapley symmetrization averages the relevance of over all possible combinations of background conditions. Reducing the possible combinations by ignoring nodes can result in different values for . This may be an argument for using ‘plain’ which explicitly refers to the set of background variables under consideration.

One can easily verify that marginalization also affects for qualitatively the same reasons.

5.4 Dependence on the functional causal model:

and may differ for different FCMs (with the same DAG) describing the same joint distribution. For

with binary variables

, first consider the structural equations


where are binary noise variables with unbiased. Then we have

The same joint distribution can also be generated by


for which we obtain101010One can easily change the model to more generic parameters if one dislikes describing independent variables with the ‘non-minimal’ DAG .

Again, the same holds for .

Given that the scientific content of causal counterfactuals is philosophically controversial one may find this dependence on the specific FCM worrisome. The following example suggests that this dependence can also be considered a desirable feature rather a conceptual flaw. Assume that stand for the bits of an unencrypted and encrypted text, respectively. Let be obtained from by bitwise XOR with a randomly generated secret key . If we have no access to , we cannot detect any statistical dependence between and . However, we would not argue that the author of did not contribute to just because looks the same to us as an encrypted version of an entirely different text. Merely the knowledge that has been obtained by such a secret encryption of the message would let us claim that contributed to . Our argument thus follows the justification of counterfactuals by [6], who does not assume the noise to be unobservable in principle.

Some readers may not find this argument entirely convincing and still try to measure the ‘information contribution’ of each node on a target in terms of a quantity that refers to observed nodes only, rather than to an underlying FCM. We do not want to exclude that reasonable notions of indirect influence of this kind could be defined. Knowing Pearl’s interventional calculus [6] and information theory, it seems natural to compute causal influence via some information shared by nodes after ‘blocking back-door paths’ (to refer to interventional rather than observational probabilities). As far as we can see, these attempts fail to formalize our intuitive notion of ‘information contribution’, since they have some undesired properties, despite being reasonable concepts in their own right. This will be explained in the following section.

6 Previous work on quantifying causal influence

This paper focuses on information-based quantifications of causal influence because they apply to variables with arbitrary range, while measures like Average Causal Effect [6] – whether they measure direct or indirect influence [27] –are restricted to numerical variables since it quantifies the change of an expectation of a variable caused by changes of its ancestors in the DAG. Below, we therefore describe previous work only for information-based quantification of causal influence.

6.1 Information Flow

As pointed out by [12], quantifying causal influence between observed nodes via the information they share, requires computing information with respect to interventional probabilities rather that information given in a passive observational scenario (recall that this distinction has been irrelevant for us since dependences between observed nodes and noise are always causal and unconfounded). Accordingly, they define the information flow from a set of variables to another set , imposing that some background variables are set to by


Here, the distribution is Pearl’s [6] notation for the interventional probability after setting to . Moreover, [12] define also the average of (36) over all :


Note that measures the mutual information of and when is set to and is randomized with probability . [13] described problems when quantifying the strength of edges via Information Flow by violations of desirable postulates. We now argue that we consider it inappropriate as a measure of total influence in the sense of ‘contribution’ for similar reasons111111We don’t consider the postulates in [13] because they referred to strength of edges. We are not claiming that there cannot be a reasonable way of using the notion of information flow for quantifying the contribution of every node to the information of a target node – we just want to describe why our attempts of doing so failed. This discussion elaborates on ideas in the appendix of [13]. Consider the DAG in Figure 2, left.



Figure 2: Left: Causal DAG for which it is already non-trivial to define the strength of the influence of on – if one demands that this definition should also apply to the special cases in the middle and on the right.

A definition for the influence of on makes only sense to us if it still holds in the limiting case where the edge between and disappears (Figure 2, middle), and if the arrow disappears Figure 2, right). This is hard to achieve if one does not want to give up other strongly desired properties of causal strength. It is tempting to define the strength of the influence of on simply via the information flow , which here coincides with the usual mutual information . Assume we maintain this definition when disappears. For symmetry reasons we should then also define the strength of the effect of on by . Unfortunately, this admits a scenario where both and have zero influence of although they together have a strong influence on : Let all three variables be binaries attaining and with probability each. Assume is generated from and via . Then every pair of variables is statistically independent and thus

which is disturbing since together determine . Here, information flow fails because it does not account for the ‘background’ variable . For each fixed value of , does have an observable influence on , while the influence gets opaque when we ignore . Lead by this insight, one may want to define the influence of on in Figure 2, middle, by . However, this blocks the indirect path in Figure 2, left, and accounts only for the direct effect (because adjusting the intermediate node turns off the influence that is mediated by the latter) and even yields zero for Figure 2, right. Hence, all our attempts of ‘information flow based’ quantification of influence of on for the DAG in Figure 2, left, yield undesired results for the limiting cases shown in Figure 2, middle and right. From a high-level perspective, the reason of this failure is that Information Flow rather formalizes what we described as in Section 2. Since for the three-author scenario, we have . This suggests that ‘hard interventions’ that adjust a variable to fixed, but randomized, values are inappropriate for measuring contributions. Other notions of indirect and path-specific influence [27] that rely on hard interventions cannot be used for our purpose either.

6.2 Defining strength of causal arrows and indirect causal influence

[13] stated some properties that they expect from an information theoretic quantification of the strength of an arrow in a general DAG and proposed a measure based on an operation they called ‘cutting of edges’.

To quantify the information transferred along an arrow, [13] think of arrows as channels that propagate information through space – for instance ‘wires’ that connect electrical devices. To measure the impact of an arrow , they ‘cut’ it and feed it with a random input that is an i.i.d. copy of . This results in the following ‘post-cutting’ distribution:

Definition 6 (Single arrow post-cutting distribution).

Let be a causal DAG with nodes and be Markovian with respect to . Further, let denote the parents of without . Define the ‘post-cutting conditional’ by


Then, the post-cutting distribution is defined by replacing in (1) with (38).

The relative entropy between the observed joint distribution and the post-cutting distribution now measures the strength of the arrow:

Definition 7 (Quantifying strength of an arrow).

The strength of the arrow is given by

Note that the values for the DAGs (23) and (27) coincide, which has even been stated as a Postulate by [13]. Translated into the scenario in Section 2, the strength of the edge is given by . We have argued that our measure of causal contribution is not supposed to satisfy this property because we explicitly want to attribute inherited information to the respective ancestors.

The idea of feeding conditionals with independent copies of variables and measuring the KL-distance between the true distribution and the one resulting from this randomization is also used for quantifying indirect and path-specific causal influence in [28]. Note that also these notions of causal influence have a different intention and do not distinguish whether the information node propagates to the target has been inherited from ’s parents or generated at the node itself (recall the question of whether the delay of a train has been caused by waiting for other delayed trains or by issues related to the train itself). For the quantification of the influence of the influence pf on in (23), indirect influence in [28] coincides with the strength of the arrow from [13] anyway.

No attribution analysis

Although satisfies all the Postulates for causal strength stated by [13], it fails in providing an attribution of causal influence in the sense desired here. For the simple case of multiple independent causes of (Subsection 3.1) one easily checks (which follows also easily from Theorem 4 in [13]). Although this makes intuitively sense, the sum of all does not represent anything meaningful. Defining an attribution via thus lacks conceptual justification.

7 Discussion

We have defined an information-based measure of causal influence that is lead by quantifying the contribution of nodes in the sense of what information is added by the node on top of the information it inherits from its parents. By discussing other measures of causal influence, we have argued that quantifying causality raises difficult questions since different definitions capture different ideas on what exactly is supposed to be measured. The present paper is lead by the idea that structure preserving interventions are particularly helpful to formalize causality in the ‘normal mode’ as opposed to interventions that involve changes of the structural equation of the particular node at hand (as, for instance, the do-operator does).

8 Appendix: Proof of Lemma 3

It is sufficient to show our claim for the case where contains just one additional element, say , since the remaining part follows by induction. When computing via a sum over all we can always merge two corresponding terms: one set not containing and one corresponding set . Due to the irrelevance of we have

that is, both terms are the same as for the set in (8), up to the combinatorial factors. For the computation of , the term with comes with the factor , while comes with . The sum of these terms reads

which coincides with the factor in (8).


  • [1] E. Krapohl, Ka. Rimfeld, N. Shakeshaft, M. Trzaskowski, A. McMillan, J-B. Pingault, K. Asbury, N. Harlaar, Y. Kovas, P. Dale, and R. Plomin. The high heritability of educational achievement reflects many genetically influenced traits, not just intelligence. Proceedings of the National Academy of Sciences, 111(42):15273–15278, 2014.
  • [2] Steven P. R. Rose. Commentary: heritability estimates–long past their sell-by date. International journal of epidemiology, 35 3:525–7, 2006.
  • [3] A. Datta, S. Sen, and Y. Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In 2016 IEEE Symposium on Security and Privacy (SP), pages 598–617, 2016.
  • [4] S. Lundberg and S. Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
  • [5] D. Janzing, L. Minorics, and P. Bloebaum. Feature relevance quantification in explainable ai: A causal problem. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of

    Proceedings of Machine Learning Research

    , pages 2907–2916, Online, 26–28 Aug 2020. PMLR.
  • [6] J. Pearl. Causality. Cambridge University Press, 2000.
  • [7] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference – Foundations and Learning Algorithms. MIT Press, 2017.
  • [8] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer-Verlag, New York, NY, 1993.
  • [9] J. Peters, JM. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15:2009–2053, 2014.
  • [10] J. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1–102, 2016.
  • [11] T. Cover and J. Thomas. Elements of Information Theory. Wileys Series in Telecommunications, New York, 1991.
  • [12] N. Ay and D. Polani. Information flows in causal networks. Advances in Complex Systems, 11(1):17–41, 2008.
  • [13] D. Janzing, D. Balduzzi, M. Grosse-Wentrup, and B. Schölkopf. Quantifying causal influences. Annals of Statistics, 41(5):2324–2358, 2013.
  • [14] F. Eberhardt and R. Scheines. Interventions and causal inference. Philosophy of Science, 74(5):981–995, 2007.
  • [15] Korb K., Hope L., Nicholson A., and Axnick K. Varieties of causal intervention. In C. Zhang, Guesgen H., and W. Yeap, editors, Trends in Artificial Intelligence, volume 3157 of Lecture Notes in Computer Science. Springer, 2004.
  • [16] J. Tian and J. Pearl. Causal discovery from changes. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, page 512–521. Morgan Kaufmann Publishers Inc., 2001.
  • [17] E. Daniel and K. Murphy. Exact bayesian structure learning from uncertain interventions. In Meila M. and X. Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pages 107–114, San Juan, Puerto Rico, 2007. PMLR.
  • [18] Florian M., Steffen G., and Rainer S. Probabilistic soft interventions in conditional gaussian networks. In AISTATS, 2005.
  • [19] L. Shapley. A value for n-person games. Contributions to the Theory of Games (AM-28), 2, 1953.
  • [20] R.C. Lewontin. Annotation: the analysis of variance and the analysis of causes. American Journal Human Genetics, 26(3):400–411, 1974.
  • [21] R. Northcott. Can ANOVA measure causal strength? The Quaterly Review of Biology, 83(1):47–55, 2008.
  • [22] O. Kempthorne. An Introduction to Genetic Statistics. John Wiley and Sons Inc, New York, 1957.
  • [23] D. Janzing, K. Budhathoki, L. Minorics, and P. Blöbaum. Causal structure based root cause analysis of outliers, 2019.
  • [24] J. Halpern and J. Pearl. Causes and explanations: A structural-model approach. part ii: Explanations. The British Journal for the Philosophy of Science, 56(4):889–911, 2005.
  • [25] R. Quax, O. Har-Shemesh, and P. Sloot. Quantifying synergistic information using intermediate stochastic variables. Entropy, 19(2), 2017.
  • [26] P. K. Rubenstein, S. Weichwald, S. Bongers, J. M. Mooij, D. Janzing, M. Grosse-Wentrup, and B. Schölkopf. Causal consistency of structural equation models. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI 2017), 2017.
  • [27] Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, page 411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
  • [28] G. Schamberg, W. Chapman, X. Shang-Ping, and T. Coleman. Direct and indirect effects – an information theoretic perspective. arxiv:1912.10508, 2019.