1 Introduction
Causal inference from observational data, i.e. inferring cause and effect from data that was not collected through randomized controlled trials, is one of the most challenging and important problems in statistics [21]. One of the main assumptions in causal inference is that of causal sufficiency
. That is, to make sensible statements on the causal relationship between two statistically dependent random variables
and , it is assumed that there exists no hidden confounder that causes both and . In practice this assumption is often violated—we seldom know all factors that could be relevant, nor do we measure everything—and hence existing methods are prone to spurious inferences.In this paper, we study the problem of inferring whether and are causally related, or, are more likely jointly caused by an unobserved confounding variable . To do so, we build upon the algorithmic Markov condition (AMC) [8]
. This recent postulate states that the simplest—measured in terms of Kolmogorov complexity—factorization of the joint distribution coincides with the true causal model. Simply put, this means that if
causes both and the complexity of the factorization according to this model, , will be lower than the complexity corresponding to the model where causes , . As we obviously do not have access to, we propose to estimate it using latent factor modelling. Second, as Kolmogorov complexity is not computable, we use the Minimum Description Length (MDL) principle as a wellfounded approach to approximate it from above. This is the method that we develop in this paper.
In particular, we consider the setting where given a sample over the joint distribution of continuousvalued univariate or multivariate random variable , and a continuousvalued scalar . Although it has received little attention so far, we are not the first to study this problem. Recently, Janzing and Schölkopf [9, 10] showed how to measure the “structural strength of confounding” for linear models using resp. spectral analysis [9] and ICA [10]. Rather than implicitly measuring the significance, we explicitly model the hidden confounder via probabilistic PCA. While this means our approach is also linear in nature, it gives us the advantage that we can fairly compare the scores for the models and , allowing us to define a reliable confidence measure.
Through extensive empirical evaluation on synthetic and realworld data, we show that our method, CoCa, short for ConfoundedorCausal, performs well in practice. This includes settings where the modelling assumptions hold, but also in adversarial settings where they do not. We show that CoCa beats both baselines as well as the recent proposals mentioned above. Importantly, we observe that our confidence score strongly correlates with accuracy. That is, for those cases where we observe a large difference between the scores for causal resp. confounded, we can trust CoCa to provide highly accurate inferences.
The main contributions of this paper are as follows, we

[noitemsep,topsep=0pt]

extend the AMC with latent factor models, and propose to instantiate it via probabilistic PCA,

define a consistent and easily computable MDLscore to instantiate the framework in practice,

provide extensive evaluation on synthetic and real data, including comparisons to the stateoftheart.
This paper is structured as usual. In Sec. 2 we introduce basic concepts of causal inference, and hidden confounders. We formalize our information theoretic approach to inferring causal or confounded in Sec. 3. We discuss related work in Sec. 4, and present the experiments in Sec. 5. Finally, we wrap up with discussion and conclusions in Sec. 6.
2 Causal Inference and Confounding
In this work, we consider the setting where we are given samples from the joint distribution over two statistically dependent continuousvalued random variables and . We require to be a scalar, i.e. univariate, but allow to be of arbitrary dimensionality, i.e. univariate or multivariate. Our task is to determine whether it is more likely that jointly cause , or that there exists an unobserved random variable that is the cause of both and . Before we detail our approach, we introduce some basic notions, and explain why the straightforward solution does not work.
2.1 Basic Setup
It is impossible to do causal inference from observational data without making assumptions [21]. That is, we can only reason about what we should observe in the data if we were to change the causal model, if we assume (properties of) a causal model in the first place.
A core assumption in causal inference is that the data was drawn from a probabilistic graphical model, a casual directed acyclic graph (DAG). To have a fighting chance to recover this causal graph from observational data, we have to make two further assumptions. The first, and in practice most troublesome is that of causal sufficiency. This assumption is satisfied if we have measured all common causes of all measured variables. This is related to Reichenbach’s principle of common cause [25], which states that if we find that two random variables and are statistically dependent, denoted as , there are three possible explanations. Either causes , , or, the other way around, causes , , or there is a third variable that causes both and , . In order to determine the latter case, we need to have measured .
The second additional assumption we have to make is that of faithfulness, which is defined as follows.
Definition 1 (Faithfulness).
If a Bayesian network
is faithful to a probability distribution
, then for each pair of nodes and in , and are adjacent in iff. , for each , with .In other words, if we measure that is independent of , denoted as , there is no direct influence between the two in the underlying causal graph. This is a strong, but generally reasonable assumption; after all, violations of this condition do generally not occur unless the distributions have been specifically chosen to this end.
Under these assumptions, Pearl [21] showed that we can factorize the joint distribution over the measured variables,
(1) 
That is, we can write it as a product of the marginal distributions of each conditioned on its true causal parents . This is referred to as the causal Markov condition, and implies that, under all of the above assumptions, we can have a hope of reconstructing the causal graph from a sample from the joint distribution.
2.2 Crude Solutions That Do Not Work
Based on the above, many methods have been proposed to infer causal relationships from a dataset. We give a high level overview of the state of the art in Sec. 4. Here, we continue to discuss why it is difficult to determine whether a given pair
is confounded or not, and in particular, why traditional approaches based on probability theory or (conditional) independence, do not suffice.
To see this, let us first suppose that causes , and there are is no hidden confounder . We then have , and . Now, let us suppose instead that causes and , i.e. . Then we would have , and, importantly, while , we still observe , and hence cannot determine causal or confounded on that alone.
Moreover, as we are only given a sample over for which holds, but know nothing about or , we cannot directly measure . A simple approach would be to see if we can generate a such that ; for example through sampling or optimization. However, as we have to assign values for , this means we have degrees of freedom, and it easy to see that under these conditions it is always possible to generate a that achieves this independence, even when there was no confounding . A trivial example is to simply set .
A similarly flawed idea would be to decide on the likelihoods of the data alone, i.e. to see if we can find a for which . Besides having to choose a prior on , as we already achieve equality by initializing and have degrees of freedom, we again virtually always will find a for which this holds, regardless of whether there was a true confounder or not.
Essentially, the problem here is that it is too easy to find a where these conditions hold, which for a large part is due to the fact that we do not take the complexity of into account, and hence face the problem of overfitting. To avoid this, we take an information theoretic approach, such that in principled manner we can take both the complexity of , as well as its effect on and into account.
3 Telling Causal from Confounded by Simplicity
We base our approach on the algorithmic Markov condition, which in turn is based on the notion of Kolmogorov complexity. We first give short introductions to both notions, and then develop our approach.
3.1 Kolmogorov Complexity
The Kolmogorov complexity of a finite binary string is the length of the shortest program
for a universal Turing machine
that generates and then halts [18, 14]. Formally,(2) 
That is, program is the most succinct algorithmic description of , or, in other words, the ultimate lossless compressor of that string. For our purpose, we are particularly interested in the Kolmogorov complexity of a distribution ,
(3) 
which is the length of the shortest program for a universal Turing machine that approximates arbitrarily well [18].
By definition, Kolmogorov complexity will make maximal use of any structure in the input that can be used to compress the object. As such it is the theoretically optimal measure for complexity. Due to the halting problem, Kolmogorov complexity is also not computable, nor approximable up to arbitrary precision [18]. The Minimum Description Length (MDL) principle [4], however, provides a statistically wellfounded approach to approximate it from above. We will later use MDL to instantiate the framework we define below.
3.2 Algorithmic Markov Condition
Recently, Janzing and Schölkopf [8] postulated the algorithmic Markov condition (AMC), which states that if causes , the factorization of the joint distribution over and in the true causal direction has a lower Kolmogorov complexity than in the anticausal direction, i.e.
(4) 
holds up to an additive constant. Moreover, under the assumption of causal sufficiency this allows us to identify the true causal network as the least complex one,
(5) 
which again holds up to an additive constant.
3.3 AMC and Confounding
Although the algorithmic Markov condition relies on causal sufficiency, it does suggest a powerful inference framework where we do allow variables to be unobserved. For simplicity of notation, as well as generality, let us ignore for now, and instead consider the question whether is confounded by some factor . We can answer this question using the AMC by including a latent variable , where we assume the ’s to be independent, of which we know the joint distribution corresponding to measured and unmeasured , . If this is the case, we can again simply identify the corresponding minimal Kolmogorov complexity network via
(6) 
where are now the parents of among in the extended network. By adding terms we implicitly assume that there is no reverse causality .
This formulation gives us a principled manner to identify whether a given is a (likely) confounder of . Clearly, with the above we can score the hypothesis . However, it also allows us to fairly score the hypothesis , because if we choose to be a prior concentrated on a single point, this corresponds to Eq. (5) up to an additive constant. By the algorithmic Markov condition, we can now determine the most likely causal model, simply by comparing the two scores and choosing the one with the lower Kolmogorov complexity. This approach does not suffer from the same problems as in Sec. 2.2 as we explicitly take the complexity of into account. Moreover, and importantly, this formulation allows us to consider any distribution with any type of latent factor .
Two problems, however, do remain with this approach. First, we do not know the true distribution , nor even distributions or . Instead we only have empirical data over from which we can approximate , but this does not give us explicit information about , or the joint . Second, as stated above, Kolmogorov complexity is not computable and the criterion as such therefore not directly applicable. We will deal with the first problem next by making assumptions on the form of , and then in Sec. 3.5 will instantiate this criterion using the Minimum Description Length (MDL) principle.
3.4 Latent Factor Models
Even under the assumption that the are mutually independent, there are infinitely many possible distributions , and hence we have to make further choices to make the problem feasible. In our setting, a particularly natural choice is to use latent factor modelling. That is, we say the distribution over should be of the form
where the distribution of can be arbitrarily complex. Not only does this give us a very clear and interpretable hypothesis, namely that given , every should be independent of every other member of , i.e. , it also corresponds to the notion that should explain away as much of the information shared within as possible—very much in line with Eq. (6). Moreover, from a more practical perspective, it is also a wellstudied problem for which advanced techniques exist, such as Factor Analysis [19], GPLVM [16], Deep Generative Models [12, 26, 24], as well as Probabilistic PCA (PPCA) [34].
For the sake of simplicity we will here focus on using PPCA, which has the following linear form
(7)  
and is appropriate if we deal with realvalued variables without any constraints and assume Gaussian noise. If the data does not follow these assumptions one of the other models mentioned above may be a more appropriate choice. An appealing aspect of PPCA is that by marginalizing over we can rewrite it in only terms of the matrix [34], i.e.
(8)  
(9) 
which both dramatically reduces the computational effort as well as will allow us to make statements about the consistency of our method.
While in the simple form PPCA assumes linear relationships, we can also model nonlinear relationships by adding features to conditional distribution , e.g. using polynomial regression of on . While this increases the modelling power, it comes with an increase in computational effort as the simplification of Eq. (8) no longer holds.
3.5 Minimum Description Length
While Kolmogorov complexity is not computable, the Minimum Description Length (MDL) principle [27] provides a statistically wellfounded approach to approximate from above. To achieve this, rather than considering all Turing machines, in MDL we consider a model class for which we know that every model will generate the data and halt, and identify the best model as the one that describes the data most succinctly without loss. If we instantiate with all Turing machines that do so, the MDLoptimal model coincides with Kolmogorov complexity—this is also known as Ideal MDL [4]. In practice, we of course consider smaller model classes that are easier to handle and match our modelling assumptions.
In twopart, or, crude MDL, we score models by first encoding the model, and then the data given that model,
(10) 
where and are code length functions for the model, and the data conditional on the model, respectively.
Twopart MDL often works well in practice, but, by encoding the model separately it introduces arbitrary choices. In one part MDL—also known as refined MDL—we avoid these choices by encoding the data using the entire model class at once. In order for a code length function to be refined, it has to be asymptotically minimax optimal. That is, no matter what data of the same type and dimensions as we consider, the refined score for is within a constant from the score where we already know its corresponding optimal model , , and this constant is independent of the data. There exist different forms of refined MDL codes [4]. For our setup it is convenient to use the full Bayesian definition,
where is a prior on the model class . In our case, that is, for the PPCA model from 7 each pair corresponds to one model , and hence the model class to all possible of which the posterior is given by Eq. (7), i.e. we have
(11) 
We can now put all the pieces together, and use the above theory to determine whether a pair is more likely causally related or confounded by an unobserved .
3.6 Causal or Confounded?
Given the above theory, determining which of and is more likely, is fairly straightforward. To do so, we consider two model classes, one for each of these two hypotheses, and determine which of the two leads to the most succinct description of the data sample over and .
First, we consider the causal model class that consists of models where causes in linear fashion,
(12)  
and writing instead of we encode the data as
(13)  
(14) 
where we approximate the integral by sampling
weight vectors
from the distribution defined by Eq. (12).Second, we consider the confounded model class , where the correlations within and are entirely explained by a hidden confounder modelled by PPCA, i.e.
(15)  
(16) 
where the samples for are drawn from the model we inferred using PPCA, i.e., according to Eq. (7). Like for the causal case, the more samples we consider, the better the approximation, but the higher the computational cost.
By MDL we can now check which hypothesis better explains the data, by simply considering the sign of . If this is less than zero, the confounded model does a better job at describing the data than the causal model and vice versa. We refer to this approach as CoCa.
To make the CoCa scores comparable between different data sets, we further introduce the confidence score
which is simply a normalized version of that accounts for both the intrinsic complexities of the data as well as the number of samples. If the absolute value of is small both model classes explain the data approximately equally well, and hence we are not very confident in our result and should perhaps refrain from making a decision.
Last, we consider the question of whether we can say when our method will properly distinguish between the cases we care about? For this, we use a general result for MDL on the consistency of deciding between two model classes when the data is generated by a model contained in either of these classes [4]. That is, if we let be samples for and then
(17) 
with strict inequalities if is contained in only one of the two classes. This means that in the limit we will infer the correct conclusion if the true model is within the model classes we assume. Moreover, since our refined MDL formulation is also consistent for model selection [4], following Sec. 3.2 we expect that even if is contained in both model classes the shortest description of the model corresponds to the true generative process. Importantly, even when the true model is not in either of our model classes, we can still expect reasonable inferences with regard to these model classes; by the minimax property of refined codes we use, we encode every model as efficiently (up to a constant) as possible, which promises reliable performance and confidence scores even in adversarial cases. As we will see shortly, the experiments confirm this.
4 Related Work
Causal inference is arguably one of the most important problems in statistical inference, and hence has attracted a lot of research attention [28, 21, 32]. The existence of confounders, selection bias and other statistical problems make it impossible to infer causality from observational data alone [21]. When their assumptions hold, constraintbased [32, 33, 36] and scorebased [2] causal discovery can, however, reconstruct causal graphs up to Markov equivalence. This means, however, they are not applicable to determine the causal direction between just and .
By making assumptions on the shape of the causal process Additive Noise Models (ANMs) can determine the causal direction between just and . In particular, ANMs assume independence between the cause and the residual (noise), and infer causation if such a model can be found in one direction but not in the other [30, 31, 5, 37]. A more general framework for inferring causation than any of the above is given by the Algorithmic Markov Condition (AMC) [17, 8] which is based on finding the least complex – in terms of Kolmogorov complexity – causal network for the data at hand. Since Kolmogorov complexity is not computable [18], practical instantiations require a computable criterion to judge the complexity of a network, which has been proposed to do using Renyientropies [13], information geometry [3, 7, 11], and MDL [1, 20]. All of these methods assume causal sufficiency, however, and are not applicable in the case where there are hidden confounders.
Rather than inferring the causal direction between and , estimating the causal effect of onto is also an active topic of research. To do so in the presence of latent variables, Hoyer et al. [6]
solve the overcomplete independent component analysis (ICA) problem, whereas Wang and Blei
[35] and Ranganath and Perotte [23] control for plausible confounders using a given factor model.Most relevant to this paper is the recent work by Janzing and Schölkopf on determining the “structural strength of confounding” for a continuousvalued pair , which they propose to measure using resp. spectral analysis [9] and ICA [10]. Like us, they also focus on linear relationships, but in contrast to us define a onesided significance score, rather than a twosided information theoretic confidence score. In the experiments we will compare to these two methods.
5 Experiments
In this section we empirically evaluate CoCa. In particular, we consider performance in telling causal from confounded for both inmodel and adversarial settings on both synthetic and realworld data. We compare to the recent methods by Janzing and Schölkopf [9, 10]. We implemented CoCa in Python using PyMC3 [29] for posterior inference via ADVI [15]. All code is available for research purposes.^{1}^{1}1http://eda.mmci.unisaarland.de/coca/
Throughout this section we infer onedimensional factor models , noting that higherdimensional gave similar results. We use samples to calculate the MDL scores. All experiments were executed singlethreaded on an Intex Xeon E52643 v3 machine with 256GB memory running Linux, and each run took on the order of seconds to finish.
5.1 Synthetic Data
To see whether CoCa works at all, we start by generating synthetic data with known ground truth close to our assumptions. For the confounded case, we generate samples over as follows
while for the causal case, we generate as
To see how our performance depends on the precise generating process, we consider the following source distributions,
We expect best CoCa performance when the generating process uses the or distributions as these are closest to the assumptions made in Eq. (7) and Eq. (12).
To see how the accuracy of CoCa depends on the confidence assigned to each inference, we consider decision rate (DR) plots. In these, we consider the accuracy over the top pairs sorted descending by absolute confidence,
. This metric is commonly used in the literature on causal inference as it gives more information about the performance of our classifier than simple accuracy scores.
We consider the case where we fix the dimensionality of to be , and vary the dimensionality of to be and further restrict to be , as these are precisely the model assumptions made by CoCa. We show the resulting DR plot in the left plot of Fig. 1. We see that for all dimensionalities of the pairs for which CoCa is most confident are also most likely to be classified correctly. While for there is (too) little information about that can be inferred by the factor model, for CoCa is both highly confident and accurate over all decisions.
Next, we move away from our model assumptions and aggregate over all the possibilities listed above. We show the results on the righthand side of Fig. 1. We observe essentially the same pattern, except that all the lines drop off slightly earlier than in the left plot. Experiments where we chose independently at random resulted in similar results and are hence not shown for conciseness. This shows us that our method continues to work even when the assumptions we make no longer hold.
Importantly, all results in both experiments are significant with regard to the confidence interval of a fair coin flip—except for which is significant only for the 75% of tuples where it was most confident. Further, in none of these cases was the method biased towards classifying datasets as causal or confounded.
To see how CoCa fares for a broader variety of combinations of dimensionalities of and , in Fig. 2 we plot a heatmat of the area under the decision rate curve (AUDR) of CoCa. As expected we see that when is fixed we become more accurate as increases. Further as increases for fixed our performance degrades gracefully—this is because we infer a of dimensionality one, which deviates further from the true generating process as the dimensionality of the true increases. Note that all CoCa AUDR scores are above , whereas a random classifier would obtain a score of only .
Finally, we compare CoCa to the only two competitors we are aware off; the two recent approaches by Janzing and Schölkopf, of which one is based on spectral analysis (SA) [9] and the other on independent component analysis (ICA) [10]. The implementation of both methods require to be multidimensional. We hence consider the cases where , while allowing to be any of the distributions listed above. We show the results in Fig. 3.
As SA and ICA provide an estimate measuring the strength of confounding without any twosided confidence score, we used as a substitute for such a score. That the corresponding lines are shaped as expected gives us some assurance that this is a reasonable choice.
We see that for all dimensionalities CoCa outperforms these competitors by a margin where the respective methods are most confident, but also that the overall accuracies are almost indistinguishable. We further note that as the dimensionality of becomes large relative to the differences in performance between the approaches reduces.
5.2 Simulated Genetic Networks
Next, we consider more realistic synthetic data. For this we consider the DREAM 3 data [22] which was originally used to compare different methods for inferring biological networks. We use this data both because the underlying generative network is known, and because the generative dynamics are biologically plausible [22]. That is, the relationships are highly nonlinear, and therefore an interesting case to evaluate how CoCa performs when our assumptions do not hold at all. Out of all networks in the dataset, we consider the ten largest networks, those of 50 and 100 nodes, which are associated with time series of lengths 500 and 1000, respectively. Since CoCa was not designed to work with time series, we treat the data as if it were generated from an i.i.d. source.
For each network we take pairs of univariate and such that either of the following two cases holds

[noitemsep,topsep=0pt]

has a causal effect on and there exists no common parent , or

have a common parent and there are no causal effects between and .
Although in theory we could also consider tuples with , for this dataset there were too few such tuples to have sufficient statistical power. Further, since the original networks are heavily biased towards causality rather than to common parents we take all the confounded tuples and then uniformly sample an equal number of causal tuples from the set of all such tuples.
We show the decision rate plot when applying CoCa to these pairs after aggregating over all the networks in the lefthand side plot of Fig. 4. Like before, we see that CoCa is highly accurate for those tuples where it is most confident. In comparison to the results for in Fig. 1, we see that performance drops more quickly, which is readily explained by the fact that the simulated dynamics are highly nonlinear. Note however, that our results are nevertheless still statistically significant with regard to a fair coin flip for up the 75% pairs CoCa that is most confident about. To further explain the behavior of CoCa on this dataset, we plot the absolute confidence scores we obtain on the right of Fig. 4. We see that particularly for the first 25% of the decisions the confidences we obtain are much larger than for the remaining pairs. This corresponds very nicely to the plot on the left, as the first 25% of our decisions are also those where we compare most favorably to the baseline.
5.3 Tübingen Benchmark Pairs
To consider realworld data suited for causal inference, we now consider the Tübingen benchmark pairs dataset.^{2}^{2}2https://webdav.tuebingen.mpg.de/causeeffect/ This dataset consists of (mostly) pairs of univariate variables for which plausible directions of causality can be decided assuming no hidden confounders. For many of these, however, it is either known, or plausible to posit that they are confounded rather than directly causally related. For example, for pairs 65–67 certain stock returns are supposedly causal, but given the nature of the market would likely be better explained by common influences on the returns of the stock options.
We therefore code every pair in the benchmark dataset as either causal (if we think the directly causal part to be stronger), confounded (if we expect the common cause to be the main driver), or unclear (if we are not sure which component is more important) and apply CoCa to the pairs in the first two categories. This leaves 47 pairs, of which we judged 41 to be mostly causal and 6 to be mostly confounded.^{3}^{3}3The complete list can be found in the online appendix at http://eda.mmci.unisaarland.de/coca/.
In Fig. 5 we show the decision rate plots across the datasets weighed according to the benchmark definition. As in the previous cases, CoCa is most accurate where it is most confident, while declining to the baseline as we try to classify points about which CoCa is less and less certain. We note that for these cases CoCa was biased towards saying that datasets represented truly causal relationships, even when we judged them to be driven by confounding. Despite this, CoCa does better than the naive baseline of “everything is causal” by assigning more confidence to those datasets which according to our judgment were indeed truly causal.
5.4 Optical Data
Finally, we consider real world optical data [9]. In these experiments, is a lowresolution ( pixels) image shown on the screen of a laptop and is the brightness measured by a photodiode at some distance from the screen. The confounders are an LED in front of the photodiode and another LED in front of the camera, both controlled by random noise, where the strength of confounding is controlled by the brightness of these LEDs.
We evaluate CoCa on each of the provided datasets, and plot the resulting values in Fig. 6. The strength of confounding increases from the left to right, and values larger than zero indicate that CoCa judged the data to be causal, while values smaller than zero indicate confounding. We see that towards an intermediate confounding strength of our method is very uncertain about its classification, while towards the extreme ends of pure causality or pure confounding it is very confident, and correct in being so.
6 Discussion and Conclusions
We considered the problem of distinguishing between the case where the data has been generated via a genuinely causal model and the case where the apparent cause and effect are in fact confounded by unmeasured variables. We proposed a practical information theoretic way of comparing these cases on the basis of MDL and latent variable models that can be efficiently inferred using variational inference. Through experiments we showed that CoCa works well in practice—including in cases where the data generating process is quite different from our models assumptions. Importantly, we showed that CoCa is particularly accurate when it is also confident, more so than its competitors.
For future work, we will investigate the behavior of CoCa if we use more complex latent variable models, as these allow for modelling more complex relations. These methods, however, also come with a much higher computational cost and without theoretical guarantees of consistency, but may work well in practice. In addition, we would like to be able to infer more complete networks on while taking into account the presence of confounders. However, this will likely lead to inconsistent inference of edges unless we can find a theoretically wellfounded method of telling apart direct and indirect effects. To the best of our knowledge, as of now, no such method is known.
Acknowledgements
David Kaltenpoth is supported by the International Max Planck Research School for Computer Science (IMPRSCS). Both authors are supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.
References
 [1] K. Budhathoki and J. Vreeken. MDL for causal inference on discrete data. In ICDM, pages 751–756. IEEE, 2017.
 [2] D. M. Chickering. Learning equivalence classes of bayesiannetwork structures. JMLR, 2:445–498, 2002.
 [3] P. Daniusis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Schölkopf. Inferring deterministic causal relations. In UAI, pages 143–150, 2010.
 [4] P. D. Grünwald. The Minimum Description Length Principle. MIT press, 2007.
 [5] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In NIPS, pages 689–696, 2009.
 [6] P. O. Hoyer, S. Shimizu, A. J. Kerminen, and M. Palviainen. Estimation of causal effects using linear nonGaussian causal models with hidden variables. Int. J. Approx. Reason., 49:362–378, 2008.
 [7] D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf. Informationgeometric approach to inferring causal directions. Artif. Intell., 182:1–31, 2012.
 [8] D. Janzing and B. Schölkopf. Causal inference using the algorithmic markov condition. IEEE TIT, 56:5168–5194, 2010.
 [9] D. Janzing and B. Schölkopf. Detecting confounding in multivariate linear models via spectral analysis. Journal of Causal Inference, 6(1), 2018.

[10]
D. Janzing and B. Schölkopf.
Detecting noncausal artifacts in multivariate linear regression models.
In ICML, pages 2245–2253. JMLR, 2018.  [11] D. Janzing, B. Steudel, N. Shajarisales, and B. Schölkopf. Justifying informationgeometric causal inference. In Measures of Complexity, pages 253–265. Springer, 2015.
 [12] D. P. Kingma and M. Welling. Autoencoding variational bayes. CoRR, abs/1312.6114, 2013.
 [13] M. Kocaoglu, A. G. Dimakis, S. Vishwanath, and B. Hassibi. Entropic causal inference. In AAAI, pages 1156–1162, 2017.
 [14] A. Kolmogorov. On tables of random numbers. Indian J. Stat. Ser. A, 25(4):369–376, 1963.
 [15] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation variational inference. JMLR, 18:430–474, 2017.

[16]
N. Lawrence.
Probabilistic Nonlinear Principal Component Analysis with Gaussian Process Latent Variable Models.
JMLR, 6:1783–1816, 2005.  [17] J. Lemeire and E. Dirkx. Causal models as minimal descriptions of multivariate systems. http://parallel.vub.ac.be/~jan/, 2006.
 [18] M. Li and P. Vitányi. An introduction to Kolmogorov complexity and its applications. Springer Science & Business Media, 2009.
 [19] J. C. Loehlin. Latent Variable Models: An Introduction to Factor, Path, and Structural Analysis. Psychology Press, 1998.
 [20] A. Marx and J. Vreeken. Telling cause from effect by MDLbased local and global regression. In ICDM, pages 307–316. IEEE, 2017.
 [21] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.
 [22] R. J. Prill, D. Marbach, J. SaezRodriguez, P. K. Sorger, L. G. Alexopoulos, X. Xue, N. D. Clarke, G. AltanBonnet, and G. Stolovitzky. Towards a rigorous assessment of systems biology models: the dream3 challenges. PloS One, 5(2), 2010.
 [23] R. Ranganath and A. Perotte. Multiple causal inference with latent confounding. CoRR, abs/1805.08273, 2018.
 [24] R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. In AISTATS, 2015.
 [25] H. Reichenbach. The direction of time. Dover, 1956.
 [26] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. CoRR, abs/1505.05770, 2015.
 [27] J. Rissanen. Modeling by shortest data description. Automatica, 14(1):465–471, 1978.
 [28] D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol., 66:688–701, 1974.
 [29] J. Salvatier, T. V. Wiecki, and C. Fonnesbeck. Probabilistic programming in Python using PyMC3. PeerJ Comp Sci, 2, 2016.
 [30] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear nongaussian acyclic model for causal discovery. JMLR, 7:2003–2030, 2006.
 [31] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. O. Hoyer, and K. Bollen. Directlingam: A direct method for learning a linear nongaussian structural equation model. JMLR, 12:1225–1248, 2011.
 [32] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press, 2000.
 [33] P. Spirtes, C. Meek, and T. Richardson. An algorithm for causal inference in the presence of latent variables and selection bias in computation, causation and discovery. In UAI. MIT Press, 1999.
 [34] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. J. R. Statist. Soc. B, 61:611–622, 1999.
 [35] Y. Wang and D. M. Blei. The blessings of multiple causes. CoRR, abs/1805.06826, 2018.
 [36] J. Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artif. Intell., 172:1873–1896, 2008.
 [37] K. Zhang and A. Hyvärinen. On the identifiability of the postnonlinear causal model. In UAI, pages 647–655, 2009.
Appendix A Appendix
a.1 Coding of the Tübingen Pairs
Here we give a full list of which pairs of the Tübingen pairs dataset we considered to be mainly causal, confounded, or which we were uncertain about.

[noitemsep,topsep=0pt]

Causal: 13–16, 25–37, 43–46, 48, 54, 64, 69, 71–73, 76–80, 84, 86–87, 93, 96–98, 100

Confounded: 65–67, 74–75, 99

Uncertain: 1–12, 17–24, 38–42, 47, 49–53, 55–63, 68, 70, 81–83, 85, 88–92, 94–95
For example for pairs 5–11 it was unclear to us to what extent the age of an abalone should be considered as a causal factor to its length, height, weight, or other measurements, and to what extent all of these should simply be confounded by the underlying biological processes of development.
As another example, for pair 99 we believed that it is reasonable to suggest that the correlation between language test score of a child and socioeconomic status of its family might more plausibly be explained by the intelligence of parents and child — which are strongly correlated themselves.