 # Bayesian Causal Inference

We address the problem of two-variable causal inference. This task is to infer an existing causal relation between two random variables, i.e. X → Y or Y → X, from purely observational data. We briefly review a number of state-of-the-art methods for this, including very recent ones. A novel inference method is introduced, Bayesian Causal Inference (BCI), which assumes a generative Bayesian hierarchical model to pursue the strategy of Bayesian model selection. In the model the distribution of the cause variable is given by a Poisson lognormal distribution, which allows to explicitly regard discretization effects. We assume Fourier diagonal Field covariance operators. The generative model assumed provides synthetic causal data for benchmarking our model in comparison to existing State-of-the-art models, namely LiNGAM, ANM-HSIC, ANM-MML, IGCI and CGNN. We explore how well the above methods perform in case of high noise settings, strongly discretized data and very sparse data. BCI performs generally reliable with synthetic data as well as with the real world TCEP benchmark set, with an accuracy comparable to state-of-the-art algorithms.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Motivation and Significance of the Topic

Causal Inference regards the problem of drawing conclusions about how some entity we can observe does - or does not - influence or is being influenced by another entity. Having knowledge about such law-like causal relations enables us to predict what will happen ( the effect) if we know how the circumstances ( the cause) do change. For example, one can draw the conclusion that a street will be wet (the effect) whenever it rains (the cause). Knowing that it will rain, or indeed observing the rainfall itself, enables one to predict that the street will be wet. Less trivial examples can be found in the fields of epidemiology (identifying some bacteria as the cause of a desease) or economics (knowing how taxes will influence the GDP of a country).

As Peters et al. (2017)

remark, the mathematical formulation of these topics has only recently been approached. Especially within the fields of data science and machine learning specific tasks from causal inference have been attracting much interest recently.

Hernán et al. (2018) propose that causal inference stands as a third main task of data science besides description and prediction. Judea Pearl, best known for his Standard Reference Causality: Models, Reasoning and Inference, recently claimed that the task of causal inference will be the next “big problem” for Machine Learning (Pearl, 2018). Such a specific problem is the two variable causal inference, also addressed as the cause-effect problem by Peters et al. (2017). Given purely observational data from two random variables, and , which are directly causally related, the challenge is to infer the correct causal direction. Interestingly, this is an incorporation of a fundamental asymmetry between cause and effect which does always hold and can be exploited to tackle such an inference problem. Given two random variables, and which are related causally, (“ causes ”), there exists a fundamental independence between the distribution of the cause and the mechanism which relates the cause to the effect . This independence however does not hold in the reverse direction. Most of the proposed methods for the inference of such a causal direction make use of this asymmetry in some way, either by considering the independence directly (Daniusis et al., 2010; Mooij et al., 2016), or by taking into account the algorithmic complexity for the description of the factorization and comparing it to the complexity of the reverse factorization .

### 1.2 Structure of the Work

The rest of the paper will be structured as following. In Section 2 we will briefly outline and specify the problem setting. We also will review existing methods here, namely Additive Noise Models, Information Geometric Causal Inference and Learning Methods.

Section 3 will describe our inference model which is based on a hierarchical Bayesian model.

In Section 4 we will accompany the theoretical framework with experimental results. To that end we outline the “forward model” which allows to sample causally related data in 4.1. We describe a specific algorithm for the inference model in 4.2, which is then tested on various benchmark data (4.3). The performance is evaluated and compared to state-of-the-art methods mentioned in Section 2.

We conclude in Section 5 by assessing that our model generally can show competitive classification accuracy and propose possibilities to further advance the model.

## 2 Problem Setting and Related Work

Here and in the following we assume two random variables, and , which map onto measurable spaces and . Our problem, the two-variable causal inference, is to determine if causes or causes , given only observations from these random variables.

### 2.1 Problem Setting

Regarding the definition of causality we refer to the do-calculus introduced by Pearl (2000). Informally, the intervention can be described as setting the random variable to attain the value . This defines a causal relation (“ causes ”) via

 X→Y⇔P(y|do(x))≠P(y|do(x′)) (1)

for some being realizations of and being a realization of (Mooij et al., 2016).

We want to focus on the case of two observed variables, where either or holds. Our focus is on the specific problem to decide, in a case where two variables and are observed, whether holds or We suppose to have access to a finite number of samples from the two variables, i.e. samples from and from . Our task is to decide the true causal direction using only these samples:

###### Problem

Prediction of causal direction for two variables
Input: A finite number of sample data , where
Output: A predicted causal direction

### 2.2 Related Work

Approaches to causal inference from purely observational data are often divided into three groups (Spirtes and Zhang, 2016; Mitrovic et al., 2018), namely constraint-based, score-based and asymmetry-based methods. Sometimes this categorization is extended by considering learning methods as a fourth, separate group. Constraint-based and score-based methods are using conditioning on external variables. In a two-variable case there are no external variables so they are of little interest here.

Asymmetry-based methods exploit an inherent asymmetry between cause and effect. This asymmetry can be framed in different terms. One way is to use the concept of algorithmic complexity - given a true direction

, the factorization of the joint probability into

will be less complex than the reverse factorization Such an approach is often used by Additive Noise Models (ANMs). This family of inference models assume additive noise, i.e. in the case , is determined by some function , mapping to , and some collective noise variable , i.e. , where is independent of .

An early model called LiNGAM (Shimizu et al., 2006)

uses Independent Component Analysis on the data belonging to the variables. This model however makes the assumptions of linear relations and non-Gaussian noise.

A more common approach is to use some kind of regression (e.g. Gaussian process regression) to get an estimate on the function

and measure how well the model such obtained fits the data. The latter is done by measuring independence between the cause variable and the regression residuum (ANM-HSIC, Hoyer et al., 2009; Mooij et al., 2016) or by employing a Bayesian model selection (ANM-MML, Stegle et al., 2010).

Another way of framing the asymmetry mentioned above is to state that the mechanism relating cause and effect should be independent of the cause. This formulation is employed by the concept of IGCI (Information Geometric Causal Inference, Daniusis et al., 2010).

The recent advances in the field of deep learning are represented in an approach called

CGNN (Causal Generative Neural Networks, Goudet et al., 2017)

. The authors use Generative Neural Networks to model the distribution of one variable given samples from the other variable. As Neural Networks are able to approximate nearly arbitrary functions, the direction where such a artificial modelling is closer to the real distributions (inferred from the samples) is preferred.

Finally, KCDC (Kernel Conditional Deviance for Causal Inference, Mitrovic et al., 2018) uses the thought of asymmetry in the algorithmic complexity directly on the conditional distributions

. The model measures the variance of the conditional mean embeddings of the above distributions and prefers the direction with the less varying embedding.

## 3 A Bayesian Inference Model

Our contribution incorporates the concept of Bayesian model selection and builds on the formalism of IFT (Information Field Theory, Enßlin et al., 2009). Bayesian model selection compares two competing models, in our case and , and asks for the ratio of the marginalized likelihoods,

 OX→Y=P(d|X→Y,M)P(d|Y→X,M)

Here

denotes the hyperparameters which are assumed to be the same for both models and are yet to be specified.

In the setting of the present causal inference problem a similar approach has already been used by Stegle et al. (2010)

. This approach however does use a Gaussian mixture model for the distribution of the cause variable while we consider a Poissonian distribution following a log-normal field. IFT considers a

signal field via some relation where follows some probability

. Such a signal fields in principle has an infinite number of degrees of freedom, this makes it an interesting choice to model our distribution of the cause variable and the function relating cause and effect.

Throughout the following we will consider as the true underlying direction which we derive our formalism on. The derivation for will follow analogously by switching the variables. We will begin with deriving in 3.1 the distribution of the cause variable, . In 3.2 we continue by considering the conditional distribution

. Combining those results, we compute then the full Bayes factor in

3.3.

### 3.1 Distribution of the Cause Variable

Without imposing any constraints, we reduce our problem to the interval by assuming that . This can always be ensured by rescaling the data. We make the assumption that in principle the cause variable follows a lognormal distribution.

 P(x|β)∝eβ(x)

with

, being some signal field which follows a zero-centered normal distribution,

.
Here we write for the covariance operator .

We postulate statistical homogeneity for the covariance, that is

 Eβ∼P(β)[β(x)]=E[β(x+t)] Eβ∼P(β)[β(x)β(y)]=E[β(x+t)β(y+t)]

i.e. first and second moments should be independent on the absolute location. The

Wiener-Khintchine Theorem now states that the covariance has a spectral decomposition, i.e. it is diagonal in Fourier space, under this condition (see e.g. Chatfield, 2016)

. Denoting the Fourier transform by

, i.e. in the one dimensional case, . Therefore, the covariance can be completely specified by a one dimensional function:

 (FBF−1)(k,q)=2πδ(k−q)Pβ(k)

Here, is called the power spectrum.

Building on these considerations we now regard the problem of discretization. Measurement data itself is usually not purely continuous but can only be given in a somewhat discretized way (e.g. by the measurement device itself or by precision restrictions imposed from storing the data). Another problem is that many numerical approaches to inference tasks, such as Gaussian Process regression, use finite bases as approximations in order to efficiently obtain results (Mooij et al., 2016; Stegle et al., 2010). Here, we aim to directly confront these problems by imposing a formalism where the discretization is inherent.

So instead of taking a direct approach with the above formulation, we use a Poissonian approach and consider an equidistant grid in the interval. This is equivalent to defining bins, where the are the midpoints of the bins. We now take the measurement counts, which gives the number of -measurements within the

-th bin. For these measurement counts we now take a Poisson lognormal distribution as an Ansatz, that is, we assume that the measurement counts for the bins are Poisson distributed, where the means follow a lognormal distribution. We can model this discretization by applying a response operator

to the lognormal field. This is done in the most direct way via employing a Dirac delta distribution

 Rjx≡δ(x−zj)

In order to allow for a more compact notation we will use an index notation from now on, e.g. for some function or for some operator . Whenever the indices are suppressed, an integration (in the continuous case) or dot product (in the discrete case) is understood, e.g.

In the following we will use bold characters for finite dimensional vectors, e.g.

. By inserting such a finite dimensional vector in the argument of a function, e.g. we refer to a vector consisting of the function evaluated at each entry of , that is . Later on we will use the notation which raises some vector to a diagonal matrix ( (no summation implicated)). We will use this notation analogously for fields, e.g. (). Writing denotes the dot product of the vector with a vector of ones and hence corresponds to the summation of the entries of (

). Now we can state the probability distribution for

, the measurement count in bin :

 P(kj|λj)=λkjje−λjkj! λj=E(k|β)[kj]=ρeβzj=∫dxRjxeβx=ρ(Reβ)j

Therefore, considering the whole vector of bin means and the vector of bin counts at once:

 λ=ρReβ=ρeβ(z) P(k|λ)=∏jλkjje−λjkj!=∏j(Rjeβ)kje−Rjeβkj!=(∏j(Rjeβ)kj)e−1†Reβ∏jkj! P(x|k)=1N!

The last equation follows from the consideration that given the counts for the bins, only the positions of the observations is fixed, but the ordering is not. The observations can be ordered in ways.

Now considering the whole vector of bin counts at once, we get

 P(k|β)=e∑jkjβ(zj)e−ρ†eβ(z)∏jkj!=ek†β(z)−ρ†eβ(z)∏jkj! (2)

A marginalization in involving a Laplace approximation around the most probable leads to (see Appendix for a detailed derivation):

 P(x|Pβ,X→Y)≈1N!e+k†β0−ρ†eβ0−12β†0B−1β0∣∣ρBˆeβ0+1∣∣12∏jkj! (4) H(x|Pβ,X→Y)≈H0+12log|ρBˆeβ0+ˆ1|+log(∏jkj!)−k†β0+ρ†eβ0+12β†0B−1β0 (5)

where is called the information Hamiltonian and collects all terms which do not depend on the data .

### 3.2 Functional Relation of Cause and Effect

Similarly to

, we suppose a Gaussian distribution for the function

, relating to :

 R[0,1]∋f∼N(0|f,F)

Proposing a Fourier diagonal covariance once more, determined by a power spectrum ,

 (FFF−1)(k,q)=2πδ(k−q)Pf(k),

we assume additive Gaussian noise, using the notation and . We have

 y =f(x)+ϵ (6) ϵ ∼N(ϵ|0,E) E ≡diag(ς2,ς2,...)=ς21∈RN×N,

that is, each independent noise sample is drawn from a zero-mean Gaussian distribution with given variance .

Knowing the noise , the cause and the causal mechanism completely determines via Equation 6. Therefore, . We can now state the conditional distribution for the effect variable measurements , given the cause variable measurements . Marginalizing out the dependence on the relating function and the noise we get:

 P(y|x,Pf,ς,X→Y) =∫dNq(2π)Neiq†y−12q(~F+E)q =(2π)−N2∣∣~F+E∣∣−12e−12y†(~F+E)−1y (7)

In the equation above, denotes a the -matrix with entries 111 This type of matrix, i.e. the evaluation of covariance or kernel at certain positions, is sometimes called a Gram matrix. . Again, we give a detailed computation in the appendix.

### 3.3 Computing the Bayes factor

Now we are able to calculate the full likelihood of the data given our assumptions for the direction and vice versa . As we are only interested in the ratio of the probabilities and not in the absolute probabilities itself, it suffices to calculate the Bayes factor:

 OX→Y =P(d|Pβ,Pf,ς,X→Y)P(d|Pβ,Pf,ς,Y→X) =exp[H(d|Pβ,Pf,ς,Y→X)−H(d|Pβ,Pf,ς,X→Y)]

Above we used again the information Hamiltonian

Making use of Equations 4 and 7 we get, using the calculus for conditional distributions on the Hamiltonians, ,

 H(d|Pβ,Pf,ς,X→Y) =H(x|Pβ,X→Y)+H(y|x,Pf,ς,X→Y) =H0+log(∏jkj!)+12log|ρBˆeβ0+1|−k†β0+ +ρ†eβ0+12β†0B−1β0+12y†(~F+E)−1y+12∣∣~F+E∣∣ (8)

Where we suppressed the dependence of on (for the latter, the dependence is not explicit, but rather implicit as is determined by the minimum of the -dependent functional ).

We omit stating explicitly as the expression is just given by taking Equation 8 and switching and or and , respectively.

## 4 Implementation and Benchmarks

We can use our model in a forward direction to generate synthetic data with a certain underlying causal direction. We describe this process in 4.1. In 4.2 we give an outline on the numerical implementation of the inference algorithm. This algorithm is tested on and compared on benchmark data. To that end we use synthetic data and real world data. We describe the specific datasets and give the results in 4.3.

### 4.1 Sampling Causal Data via a Forward Model

To estimate the performance of our algorithm and compare it with other existing approaches a benchmark dataset is of interest to us. Such benchmark data is usually either real world data or synthetically produced. While we will use the TCEP benchmark set of Mooij et al. (2016) in 4.3.2, we also want to use our outlined formalism to generate artificial data representing causal structures. Based on our derivation for cause and effect we implement a forward model to generate data as following.

Algorithm 1 Sampling of causal data via forward model
Input: Power spectra ,
noise variance , number of bins ,
desired number222 As we draw the number of samples from Poisson distribution in each bin, we do not deterministically control the total number of samples of samples
Output: samples generated from a causal relation of either or

1. Draw a sample field from the distribution

2. Set an equally spaced grid with points in the interval :

3. Calculate the vector of Poisson means with

4. At each grid point , draw a sample from a Poisson distribution with mean :

5. Set ,

6. For each add times the element to the set of measured . Construct the vector

7. Draw a sample field from the distribution . Rescale s.th. .

8. Draw a multivariate noise sample from a normal distribution with zero mean and variance ,

9. Generate the effect data by applying to and adding :

10. With probability return , otherwise return ,

Comparing the samples for different power spectra (see Fig. 2), we decide to sample data with power spectra and , as these seem to resemble “natural” mechanisms, see Fig. 2. Figure 2: Different field samples from the distribution N(⋅|0,F†ˆPF) (on the left) with the power spectrum P(q)∝1q2+1 (top), P(q)∝1q4+1 (middle), P(q)∝1q6+1 (bottom). On the left, the field values themselves are plotted, on the right an exponential function is applied to those, as in our formulation λj∝eβ(zj) (Same colors / line styles on the right and the left indicate the same underlying functions) .

### 4.2 Implementation of the Bayesian Causal Inference Model

Based on our derivation in Section 3 we propose a specific algorithm to decide the causal direction of a given dataset and therefore give detailed answer for Problem 2.1. Basically the task comes down to find the minimum for the saddle point approximation and calculate the terms given in Equation 8:

Algorithm 2 2-variable causal inference
Input: Finite sample data , Hyperparameters
Output: Predicted causal direction

1. Rescale the data to the interval. I.e. and

2. Define an equally spaced grid of in the interval

3. Calculate matrices representing the covariance operators and evaluated at the positions of the grid, i.e.

4. Find the for which , as defined in Appendix A.1 ( Equation 14), becomes minimal

5. Calculate the -dependent terms of the information Hamiltonian in Equation 8 (i.e. all terms except )

6. Repeat steps 4 and 5 with and switched

7. Calculate the Bayes factor

8. If , return , else return

We provide an implementation of Algorithm 4.2 in Python. We approximate the operators as matrices , which allows us to explicitly numerically compute the determinants and the inverse. As the most critical part we consider the minimization of , i.e. step 4 in Algorithm 4.2. As we are however able to analytically give the curvature and the gradient of the energy to minimize, we can use a Newton-scheme here. We derive satisfying results (see Fig. LABEL:fig:demonstration_beta_min ) using the Newton-CG algorithm (Nocedal and Wright, 2006), provided by the SciPy-Library (Jones et al., 2001). After testing our algorithm on different benchmark data, we choose the default hyperparameters as

 Pβ=Pf∝1q4+1, (9) ς2=0.01, (10) r=512, (11) ρ=1. (12)

While fixing the power spectra might seem somewhat arbitrary, we remark that this corresponds to fixing a kernel e.g. as a squared exponential kernel, which is done in many publications (e.g. Mitrovic et al. (2018); Goudet et al. (2017)). Future Extensions of our method might learn and if the data is rich enough.

### 4.3 Benchmark Results

We compare our outlined model, in the following called BCI (Bayesian Causal Inference), to a number of state-of-the-art approaches. The selection of the considered methods is influenced by the ones in recent publications, e.g. Mitrovic et al. (2018); Goudet et al. (2017). Namely, we include the LiNGAM algorithm, acknowledging it as one of the oldest models in this field and a standard reference in many publications. We also use the ANM algorithm (Mooij et al., 2016) with HSIC and Gaussian Process Regression (ANM-HSIC) as well as the ANM-MML approach (Stegle et al., 2010). The latter uses a Bayesian Model Selection, arguably the closest to the algorithm proposed on this publication, at least to our best knowledge. We further include the IGCI algorithm, as it differs fundamentally in its formulation from the ANM algorithms and has shown strong results in recent publications (Mooij et al., 2016; Goudet et al., 2017; Mitrovic et al., 2018). We employ the IGCI algorithm with entropy estimation for scoring and a Gaussian distribution as reference distribution.

Finally, CGNN (Goudet et al., 2017) represents the rather novel influence of deep learning methods. We use the implementation provided by the authors, with itself uses Python with the Tensorflow (Abadi et al., 2015)

library. The most critical hyper-parameter here is, as the authors themselves mention, the number of hidden neurons which we set to a value of

, as this is the default in the given implementation and delivers generally good results. We use 32 runs each, as recommended by the authors of the algorithm.

A comparison with the KCDC algorithm would be interesting, unfortunately the authors did not provide any computational implementation so far (November 2018).

We compare the mentioned algorithms to BCI on basis of synthetic and real world data. For the synthetic data we use our outlined forward model as outlined in 4.1 with varying parameters. For the real world data we use the well-known TCEP dataset (Tuebingen Cause Effect Pairs, Mooij et al., 2016).

#### 4.3.1 Results for synthetic benchmark data

We generate our synthetic data adopting the power spectra for both, , . We further set =512, =300 and =0.05 as default settings. We provide the results of the benchmarks in Table 1. While BCI achieves almost perfect results (98%), the assessed ANM algorithms provide a perfect performance here.

As a first variation, we explore the influence of high and very high noise on the performance of the inference models. Therefore we set the parameter =0.2 for high noise and =1 for very high noise in Algorithm 4.1, while keeping the other parameters set to the default values. While our BCI algorithm is affected but still performs reliably with an accuracy of , especially the ANM algorithms perform remarkably robust in presence of the noise. This is likely due to the fact that the distribution of the true cause is not influenced by the high noise and this distribution is assessed in its own by those.

As our model uses a Poissonian approach, which explicitly considers discretization effects of data measurement, it is of interest how the performance behaves when using a strong discretization. We emulate such a situation by employing our forward model with a very low number of bins. Again we keep all parameters to default values and set =16 and =8 for synthetic data with high and very high discretization. Again the ANM models turn out to be robust again discretization. CGNN and IGCI perform significantly worse here. In the case of IGCI this can be explained by the entropy estimation, which simply removes non-unique samples. Our BCI algorithm is able to achieve over accuracy here.

We explore another challenge for inference algorithms by strongly reducing the number of samples. While we sampled about 300 observations with our other forward models so far, here we reduce the number of observed samples to 30 and 10 samples. In this case BCI performs very well compared to the other models, in fact it is able to outperform them in the case of just 10 samples being given. We note that of course BCI does have the advantage that it “knows” the hyperparameters of the underlying forward model. Yet we consider the results as encouraging, the advantage will be removed in the confrontation with real world data.

#### 4.3.2 Results for Real World Benchmark Data

The most widely used benchmark set with real world data is the TCEP (Mooij et al., 2016). We use the 102 2-variable datasets from the collection with weights as proposed by the maintainers. As some of the contained datasets include a high number of samples (up to 11000) we randomly subsample large datasets to 500 samples each in order to keep computation time maintainable. We did not include the LiNGAM algorithm here, as we experienced computational problems with obtaining results here for certain datasets (namely pair0098). Goudet et al. (2017) report the accuracy of LiNGAM on the TCEP dataset to be around . BCI performs generally comparable to established approaches as ANM and IGCI. CGNN performs best with an accuracy about here, a bit lower than the one reported by Goudet et al. (2017) of around . The reason for this is arguably to be found in the fact that we set all hyperparameters to fixed values, while Goudet et al. (2017) used a leave-one-out-approach to find the best setting for the hyperparameter .

Motivated by the generally strong performance of our approach in the case of sparse data, we also explore a situation where real world data is only sparsely available. To that end, we subsample all TCEP datasets down to 75 randomly chosen samples kept for each one. To circumvent the influence of the subsampling procedure we average the results over 20 different subsamplings. The results are as well given in Table 2. The loss in accuracy of our model is rather small.

## 5 Discussion and Conclusion

The Bayesian Causal Inference method for the 2-variable causal inference task introduced builds on the formalism of information field theory. In this regard, we employed the concept of Bayesian model selection and made the assumption of additive noise, i.e. . In contrast to other methods which do so, such as ANM-MML, we do not model the cause distribution by a Gaussian mixture model but by a Poisson Lognormal statistic.

We could show that our model is able to provide reliable classification accuracy in the present causal inference task. One difference from our model to existing ones is arguably to be found in the choice of the covariance operators. While most other publications use squared exponential kernels for Gaussian process regression, we choose a covariance which is governed by a power spectrum. This permits more structure at small scales than in methods using a squared exponential kernel.

As a certain weak point of BCI we consider the approximation of the uncomputable path integrals via the Laplace approximation. A thorough investigation of error bounds (e.g. Majerski, 2015) is yet to be carried out. As an alternative, one can think about sampling-based approaches to approximate the integrals. A recent publication (Caldwell et al., 2018)

introduced a harmonic-mean based sampling approach to approximate moderate dimensional integrals. Adopting such a technique to our very high dimensional case might be promising to improve

BCI.

Another interesting perspective is provided by deeper hierarchical models. While the outlined method took the power spectra and the noise variance as fixed hyperparameters it would also be possible to infer these as well in an extension of the method.

Yet, the implementation of our model with fixed noise variance and power spectra was able to deliver competitive results with regard to state-of-the-art methods in the benchmarks. In particular, our method seems to be slightly superior in the low sample regime, probably due to the more appropriate Poisson statistic used. We consider this as an encouraging result for a first work in the context of information field theory-based causal inference.

## Appendix A. Explicit Derivations

We give explicit derivations for the obtained results

### A.1 Saddle Point Approximation for the Derivation of the Cause Likelihood

Marginalizing we get

 P(x|Pβ,X→Y) =1N!∫D[β]P(x|β,X→Y)P(β|Pβ) =1N!|2πB|−12∫D[β]ek†β(z)−ρ†eβ(z)∏jkj!e−12β†B−1β= =|2πB|−12N!∏jkj!∫D[β]e−γ[β] (13)

where

 γ[β]≡−k†β(z)+ρ†eβ(z)+12β†B−1β. (14)

We approach this integration by a saddle point approximation. In the following we will denote the functional derivative by , i.e. .

Taking the first and second order functional derivative of w.r.t. we get

 ∂βγ[β]=−k†+ρ(eβ(z))†+β†B−1 ∂β∂βγ[β]=ˆρeβ(z)+B−1.

The above derivatives are still defined in the space of functions , that is

 k†u≡nbins∑j=1kj(~Rj)u (ˆρeβ(z))uv=ρnbins∑j=1(~Rj)u(~Rj)veβ(u).

The latter expression therefore represents a diagonal operator with as diagonal entries.

Let denote the function that minimizes the functional , i.e.

 δγ[β]δβ∣∣∣β=β0=0.

We expand the functional up to second order around ,

 ∫D[β]e−γ[β]= ∫D[β]e−γ[β0]−(δγ[β]δβ|β=β0)†β−12β†(δ2γ[β]δβ†β|β=β0)β+O(β3) ≈ e−γ[β0]∣∣ ∣∣2π(δ2γ[β]δβ2|β=β0)−1∣∣ ∣∣12 = e+k†β0−ρ†eβ0−12β†0B−1β0∣∣∣12π(ˆρeβ0+B−1)∣∣∣−12, (15)

where we dropped higher order terms of , used that the gradient at vanishes and evaluated the remaining Gaussian integral.

Plugging the result of Equation 15 into Equation 13 and using

 |2πB|−12∣∣∣12π(ˆρeβ0+B−1)∣∣∣−12=∣∣B(ˆρeβ0+B−1)∣∣−12=∣∣ρBˆeβ0+1∣∣−12

we get

 P(x|Pβ,X→Y)≈1N!e+k†β0−ρ†eβ0−12β†0B−1β0∣∣ρBˆeβ0+1∣∣12∏jkj!,and H(x|Pβ,X→Y)≈H0+12log|ρBˆeβ0+ˆ1|+log(∏jkj!)−k†β0+ρ†eβ0+12β†0B−1β0,

### A.2 Explicit Derivation of the Effect Likelihood

Here we will derive the explicit expression of the likelihood of the effect, which can be carried out analytically. We start with the expression for the likelihood of the effect data , given the cause data , the power spectrum , the noise variance and the causal direction . That is, we simply marginalize over the possible functions and noise .

 P(y|x,Pf,ς,X→Y) =∫D[f]dNϵP(y|x,f,ϵ,X→Y)P(ϵ|ς)P(f|Pf) =∫D[f]dNϵδ(y−f(x)−ϵ)N(ϵ|0,E)N(f|0,F)

Above we just used the equation and used the distributions for and . We will now use the Fourier representation of the delta distribution, specifically .

 δ(y−f(x)−ϵ)=∫dNq(2π)Neiq†(y−ϵ−f(x))=∫dNq(2π)Neiq†(y−ϵ−f(x))

Once more we employ a vector of response operators, mapping to ,

 Rx≡(R1x,...,RNx)T=(δ(x−x1),...,δ(x−xN))T.

This allows to represent the evaluation , i.e. as a linear dot-product. Using the well known result for Gaussian integrals with linear terms (see e.g. Greiner et al., 2013),

 ∫D[u]e−12u†Au+b†u=∣∣∣A2π∣∣∣−12e12b†Ab (16)

We are able to analytically do the path integral over ,

 P(y|x,Pf,ς,X→Y) =|2πF|−12∫D[f]dNϵdNq(2π)Neiq†(y−ϵ−R†f)−12f†F−1fN(ϵ|0,E) =∫dNϵdNq(2π)Neiq†(y−ϵ)+(−i)212q†R†FRqN(ϵ|0,E)

Now, we do the integration over the noise variable, , by using the equivalent of Equation 16 for the vector-valued case:

 P(y|x,Pf,ς,X→Y) =|2πE|−12∫dNϵdNq(2π)Neiq†(y−ϵ)−12q†R†FRq−12ϵ†E−1ϵ =∫dNq(2π)Neiq†y−12q(R†FR+E)q

In the following we will write , with entries . The integration over the Fourier modes , again via the multivariate equivalent of Equation 16, will give the preliminary result:

 P(y|x,Pf,ς,X→Y) =∫dNq(2π)Neiq†y−12q(~F+E)q =(2π)−N2∣∣~F+E∣∣−12e−12y†(~F+E)−1y

## References

• Abadi et al. (2015) Martín Abadi, Ashish Agarwal, others Chen, Devin, Kudlur, Steiner, Wattenberg, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
• Caldwell et al. (2018) Allen Caldwell, Raffael C. Schick, Oliver Schulz, and Marco Szalay. Integration with an Adaptive Harmonic Mean Algorithm. ArXiv e-prints, August 2018.
• Chatfield (2016) Chris Chatfield. The Analysis of Time Series: An Introduction, Sixth Edition. Chapman & Hall/CRC Texts in Statistical Science. CRC Press, 2016. ISBN 9780203491683.
• Daniusis et al. (2010) Povilas Daniusis, Dominik Janzing, Joris Mooij, Jakob Zscheischler, Bastian Steudel, Kun Zhang, and Bernhard Schölkopf. Inferring deterministic causal relations. In

Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence

, pages 143–150, Corvallis, OR, USA, July 2010. Max-Planck-Gesellschaft, AUAI Press.
• Enßlin et al. (2009) Torsten A. Enßlin, Mona Frommert, and Francisco S. Kitaura. Information field theory for cosmological perturbation reconstruction and nonlinear signal analysis. Phys. Rev. D, 80:105005, 11 2009.
• Goudet et al. (2017) Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, David Lopez-Paz, Isabelle Guyon, Michèle Sebag, Aris Tritas, and Paola Tubaro. Learning functional causal models with generative neural networks, 2017.
• Greiner et al. (2013) W. Greiner, D.A. Bromley, and J. Reinhardt. Field Quantization. Springer Berlin Heidelberg, 2013. ISBN 9783642614859.
• Hernán et al. (2018) Miguel A. Hernán, John Hsu, and Brian Healy. Data science is science’s second chance to get causal inference right: A classification of data science tasks, 2018.
• Hoyer et al. (2009) Patrik O. Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 689–696. Curran Associates, Inc., 2009.
• Jones et al. (2001) Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001.
• Majerski (2015) Piotr Majerski. Simple error bounds for the multivariate laplace approximation under weak local assumptions. arXiv preprint arXiv:1511.00302, 2015.
• Mitrovic et al. (2018) Jovana Mitrovic, Dino Sejdinovic, and Yee Whye Teh. Causal inference via kernel deviance measures. In Advances in Neural Information Processing Systems 31, pages 6986–6994. Curran Associates, Inc., 2018.
• Mooij et al. (2016) Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research, 17(32):1–102, 2016.
• Nocedal and Wright (2006) Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.
• Pearl (2000) Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. ISBN 0-521-77362-8.
• Pearl (2018) Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, pages 3–3, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5581-0.
• Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA, 2017.
• Shimizu et al. (2006) Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030, 2006.
• Spirtes and Zhang (2016) Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. Applied Informatics, 3(1):3, 2 2016.
• Stegle et al. (2010) Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M Mooij, and Bernhard Schölkopf. Probabilistic latent variable models for distinguishing between cause and effect. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1687–1695. Curran Associates, Inc., 2010.