1. Introduction
Causal inference from observational data—that is, identifying cause and effect in data that was not collected through carefully controlled randomised trials—is a fundamental problem in both business and science (Spirtes et al., 2000; Pearl, 2009). A particularly interesting setting is to tell cause from effect between a pair of random variables and
, given data over the joint distribution. That is, to identify which of
or is the most likely causal direction.In recent years, a number of important ideas have been proposed that allow for accurate causal inference based on properties of the joint distribution. These ideas include that of the Additive Noise Model (ANM), where we assume the effect is a function of the cause with additive noise independent of the cause (Shimizu et al., 2006; Peters et al., 2010; Peters et al., 2014), and that of the algorithmic Markov condition (Janzing and Schölkopf, 2010; Budhathoki and Vreeken, 2016) which is based on Kolmogorov Complexity. Loosely speaking, the key idea is that if causes , the shortest description of the joint distribution is given by the separate descriptions of and . That is, if , these two distributions will be less dependent than and . However, as Kolmogorov complexity is not computable, any method using this observation requires a computable approximation of this notion, which in general involves arbitrary choices (Sgouritsa et al., 2015; Vreeken, 2015; Liu and Chan, 2016; Janzing et al., 2012).
In this paper, for the first time, we define a causal inference rule based on the algorithmic Markov condition using stochastic complexity. More in particular, we approximate Kolmogorov complexity via the Minimum Description Length (MDL) principle using a score that is minimax optimal with regard to the model class under consideration. This means that even if the true data generating distribution does not reside in the model class under consideration, we still obtain the optimal encoding for the data relative to (Grünwald, 2007). Best of all, unlike Kolmogorov complexity, stochastic complexity is computable.
We show the strength of this approach by instantiating it for pairs of univariate discrete data using the class of multinomials. For this class the stochastic complexity is computable remarkably efficiently, by which our score has only a lineartime computational complexity. Through experiments we show that our method, cisc, for causal inference by stochastic complexity, performs very well in practice. The strength of the minimax property shows when we consider synthetic data where we vary the data generating process—cisc outperforms the state of the art by a margin, including for outofmodel distributions such as geometric, hypergeometric, and Poisson. On the Tübingen benchmark data set of 95 univariate pairs, cisc significantly outperforms the existing proposals for discrete data, with an accuracy of 100% over the 21 pairs it is most certain about, and an overall accuracy of , which is comparable to the state of the art for causal inference on continuousvalued data. Last, but not least, we perform three case studies which show cisc indeed infers sensible causal directions from realworld data.
In sum, the main contributions of this paper are as follows.

[noitemsep,topsep=2pt]

we propose the first computable framework for causal inference by the algorithmic Markov condition with provable minimax optimality guarantees,

define a causal indicator for pairs of discrete variables based on stochastic complexity,

show how to efficiently compute it,

provide extensive experimental results on synthetic, benchmark, and realworld data, and

make our implementation and all used data available
The paper is structured as usual. We introduce notation and give preliminaries in Sec. 2, and give a brief primer to causal inference by Kolmogorov complexity in Sec. 3. We present cisc, our practical instantiation based on stochastic complexity score in Sec. 4. Related work is discussed in Sec. 5, and we evaluate cisc empirically in Sec. 6. We round up with discussion in Sec. 7 and conclude in Sec. 8.
2. Preliminaries
In this section, we introduce notations and background definitions we will use in subsequent sections.
2.1. Kolmogorov Complexity
The Kolmogorov complexity of a finite binary string is the length of the shortest binary program
for a Universal Turing machine
that generates , and then halts (Kolmogorov, 1965; Li and Vitányi, 1993). Formally, we haveSimply put, is the most succinct algorithmic description of , and the Kolmogorov complexity of is the length of its ultimate lossless compression. Conditional Kolmogorov complexity, , is then the length of the shortest binary program that generates , and halts, given as input.
The amount of algorithmic information contained in about is , where is the shortest binary program for , defining analogously. Intuitively, it is the number of bits that can be saved in the description of when the shortest description of is already known. Algorithmic information is symmetric, i.e. , where denotes equality up to an additive constant, and therefore also called algorithmic mutual information (Li and Vitányi, 1993). Two strings and are algorithmically independent if they have no algorithmic mutual information, i.e. .
For our purpose, we also need the Kolmogorov complexity of a distribution. The Kolmogorov complexity of a probability distribution
, , is the length of the shortest program that outputs to precision on input (Grünwald and Vitányi, 2008). More formally, we haveWe refer the interested reader to Li & Vitányi (Li and Vitányi, 1993) for more details on Kolmogorov complexity.
3. Causal Inference by Complexity
Given two correlated variables and , we are interested in inferring their causal relationship. In particular, we want to infer whether causes , whether causes , or they are only correlated. In doing so, we assume causal sufficiency. That is, there is no confounding variable, i.e. hidden common cause of and . We use to indicate causes .
We base our causal inference method on the following postulate:
Postulate 1 (independence of input and mechanism (Sgouritsa et al., 2015)).
If , the marginal distribution of the cause , and the conditional distribution of the effect given the cause, are independent — contains no information about — and vice versa since they correspond to independent mechanisms of nature.
This postulate provides the foundation for many successful causal inference frameworks designed for a pair of variables (Janzing and Steudel, 2010; Janzing et al., 2012; Sgouritsa et al., 2015; Schölkopf et al., 2012). We can think of conditional as the mechanism that transforms values into values, i.e. generates effect for cause . The postulate is justified if we are dealing with a mechanism of nature that does not care what input we provide to it ( in this case). This independence will not hold in the other direction as and may contain information about each other as both inherit properties from and . This creates an asymmetry between cause and effect.
It is insightful to consider the following example where amount of radiation per solar cell (cause) causes power generation in the cell (effect). We can just affect only by actions such as moving the solar cell to a shady place, and varying the angle to the sun to affect . Likewise we can change only by actions such as using more efficient cells. However it is hard to find actions that change without affecting or vice versa.
The notion of independence, however, is abstract. Accordingly, different formalisations have been proposed. Janzing et al. (Janzing et al., 2012) define independence in terms of information geometry. Liu & Chan (Liu and Chan, 2016) formulate independence in terms of the distance correlation between marginal and conditional empirical distribution. Janzing & Schölkopf (Janzing and Schölkopf, 2010) formalise independence using algorithmic information theory, and postulate algorithmic independence of and . Since algorithmic formulation captures all types of dependencies, and has a sound theoretical foundation, it is, arguably, a better mathematical formalisation of Postulate 1. Using algorithmic information theory, we arrive at the following postulate.
Postulate 2 (algorithmic independence of Markov kernels (Janzing and Schölkopf, 2010)).
If , the marginal distribution of the cause and the conditional distribution of the cause given the effect are algorithmically independent, i.e. .
Postulate 2 is equivalent to saying that if , factorizing the joint distribution over and into and , will lead to simpler — in terms of Kolmogorov complexity — models than factorizing it into and (Janzing and Schölkopf, 2010). The following theorem is hence a consequence of the algorithmic independence of input and mechanism.
Theorem 3.1 (Th. 1 in Mooij et al. (2010)).
If is a cause of ,
holds up to an additive constant.
In other words, we can perform causal inference simply by identifying that direction between and for which the factorization of the joint distribution has the lowest Kolmogorov complexity. Although this inference rule has sound theoretical foundations, the problem remains that Kolmogorov complexity is not computable because of the widely known halting problem. In practice, we therefore need other, computable, notions of independence or information. We can, for instance, approximate Kolmogorov complexity from above through lossless compression (Li and Vitányi, 1993). More generally, the Minimum Description Length (MDL) principle (Rissanen, 1978; Grünwald, 2007) provides a statistically sound and computable means for approximating Kolmogorov complexity (Vereshchagin and Vitanyi, 2004; Grünwald, 2007).
4. Causal Inference by Compression
In this section, we discuss how stochastic complexity can be used for practical causal inference. We gradually move towards that goal starting with MDL, and covering the basics along the way.
4.1. Minimum Description Length Principle
The Minimum Description Length (MDL) (Rissanen, 1978) principle is a practical version of Kolmogorov complexity. Instead of all possible programs, it considers only programs for which we know they generate and halt. That is, lossless compressors.
In MDL theory, programs are often referred to as models. The MDL principle has its root in the twopart decomposition of the Kolmogorov complexity (Li and Vitányi, 1993). It can be roughly described as follows (Grünwald, 2007). Given a set of models and data , the best model is the one that minimises , where is the length, in bits, of the description of the model, and is the length, in bits, of the description of the data when encoded with the model . Intuitively represents the compressible part of the data, and represents the noise in the data.
This is called twopart MDL, or crude MDL. To use crude MDL in practice, we have to define our model class , and the description methods for as well as . If the models under consideration define probability distributions, we can use optimal prefix code given by Shannon entropy, , where is the probability mass or density function of according to . The definition of , however, is tricky — can vary from one encoding to the other, introducing arbitrariness in the process.
The refined version of MDL overcomes this arbitrariness by encoding and together. Unlike crude MDL, refined MDL encodes with the (entire) model class , resulting in single onepart code (Grünwald, 2007). The onepart code length is also called the stochastic complexity of with respect to .
The code is designed in such a way that if there exists a model for which is minimal then will also be minimal. Codes with such property are also called universal codes. There exist various types of universal codes. Although the coding schemes are different across those codes, the resulting code lengths are almost the same (Grünwald, 2007). In this work, we consider the NML universal code in particular.
Next we explain stochastic complexity in detail using the NML universal code.
4.2. Stochastic Complexity
Let be an i.i.d. sample of observed outcomes, where each outcome is an element of a space of observations . Let , where , be the parameter space. A model class is a family of probability distributions consisting of all the different distributions that can be produced by varying the parameters . Formally, a model class is defined as
To encode the data optimally with respect to the model class , we can use the code corresponding to the distribution
induced by the maximum likelihood estimate
of the data for a given model class , since this distribution assigns shorter code length, i.e. higher likelihood, to the data than any of the other distributions in the model class. The Normalized Maximum Likelihood (NML) distribution is then defined aswhere the normalizing term is the sum over maximum likelihoods of all possible datasets of size under the model class . For discrete data, is defined as
where is the fold Cartesian product indicating set of all possible datasets of size with domain . When the data is defined over a continuous sample space, the summation symbol in Equation 4.2 is replaced by an integral.
The NML distribution has a number of important theoretical properties. First, it gives a unique solution to the minimax problem posed by Shtarkov (Shtarkov, 1987),
That is, for any data , assigns a probability, which differs from the highest achievable probability within the model class — the maximum likelihood — by a constant factor . In other words, the NML distribution is the minimax optimal universal model with respect to the model class (Myung et al., 2006). The NML distribution represents the behaviour of all the distributions in the model class .
Second, it also provides solution to another minimax problem formulated by Rissanen (Rissanen, 2001), which is given by
where is the worstcase data generating distribution, and is the expectation over . That is, even if the true data generating distribution does not reside in the model class under consideration, still gives the optimal encoding for the data relative to .
These properties are very important and relevant when modelling realworld problems. In most cases, we do not know the true data generating distribution. In such cases, ideally we would want to encode our data as best as possible — close to the optimal under the true distribution. The NML distribution provides a theoretically sound means for that.
The stochastic complexity of data relative to a model class using the NML distribution is defined as
(1) 
The term is the parametric complexity of the model class . It indicates how well can fit random data.
The stochastic complexity of data under a model class gives the shortest description of the data relative to . Hence the richer the , the closer we are to Kolmogorov complexity. Intuitively, it is also the amount of information, in bits, in the data relative to the model class. Moreover, it is evident from the formulation that the stochastic complexity of data, relative to a model class, depends only on the data and the model class, but not on the particular way the models are specified.
4.3. Causal Inference by Stochastic Complexity
Unless stated otherwise, we write for , and for . The stochastic complexity of data relative to model class corresponds to the complexity of the NML distribution of the data relative to . This means we can use the stochastic complexity of as an approximation of the Kolmogorov complexity of . As such, it provides a general, yet computable, theoretically sound foundation for causal inference based on algorithmic information theory.
For ease of notation, wherever clear from context we write for . To infer the causal direction, we look over total stochastic complexity in two directions — to and vice versa. The total stochastic complexity from to , approximating is given by
and that from to is given by
Following Theorem 3.1, using the above indicators we arrive at the following causal inference rules.

If , we infer .

If , we infer .

If , we are undecided.
That is, if describing and then describing given is easier — in terms of stochastic complexity — than vice versa, we infer is likely the cause of . If it is the other way around, we infer is likely the cause of . If both ways of describing are the same, we remain undecided. We refer to this framework as cisc, which stands for causal inference by stochastic complexity.
Causal inference using stochastic complexity has a number of powerful properties. First, unlike Kolmogorov complexity, stochastic complexity is computable. Second, the inference rule is generic in the sense that we are not restricted to one data type or distribution—we are only constrained by the model class under consideration, yet by the minimax property of NML we know that even if the data generating distribution is adversarial, we still identify the best encoding relative to .
Next we discuss how can we instantiate cisc for discrete data.
4.4. Multinomial Stochastic Complexity
We consider discrete random variable
with values. Furthermore we assume that our data is multinomially distributed. The space of observations is then . The multinomial model class is defined aswhere is the simplexshaped parameter space given by
with . The maximum likelihood parameters for a multinomial distribution are given by , where is the number of times an outcome is seen in . Then the distribution induced by the maximum likelihood parameters for under the model class is given by
The normalizing term is given by
(2) 
Then the NML distribution for under the model class is given by
Then the stochastic complexity of for the model class is given by
(3) 
Computational Complexity — We can compute the counts in by going through the data once. However, computing the normalizing sum (Equation 4.4), and hence the parametric complexity, is exponential in the number of values . As a result, the computational complexity of the multinomial stochastic complexity (Equation 4.4) is dominated by by computation time of the normalizing sum.
However, we can approximate the normalising sum up to a finite floatingpoint precision in sublinear time with respect to the data size given precomputed counts (Mononen and Myllymäki, 2008). More precisely, the computational complexity of the sublinear algorithm is , where is the floatingpoint precision in digits. In the experiments we use . Altogether we can compute the multinomial stochastic complexity in .
4.5. Computing Conditional Complexity
So far we only discussed how to compute the stochastic complexity of data under a model class. For our purpose, we also need to compute the conditional stochastic complexity and vice versa. Let be the stochastic complexity of conditioned on . Then the conditional stochastic complexity is the sum of over all possible values of .
Let be the domain of . Then the stochastic complexity of given is defined as
Computational Complexity — We can compute in . To compute the conditional stochastic complexity , we have to compute over all . Hence the computational complexity of conditional stochastic complexity is . Likewise, for , we have . Altogether the computational complexity of cisc is .
5. Related Work
Inferring causal direction from observational data is a challenging task due to the lack of controlled randomised experiments. However, it has also attracted quite a lot of attention over the years (Pearl, 2000; Spirtes et al., 2000; Shimizu et al., 2006; Janzing and Schölkopf, 2010). Yet, most of the causal inference frameworks are built for continuous realvalued data.
Constraintbased approaches like conditional independence test (Spirtes et al., 2000; Pearl, 2000) are one of the widely used causal inference frameworks. However, they require at least three observed random variables. Therefore they cannot distinguish between and as the factorization of the joint distribution is the same in both direction, i.e. .
In recent years, several methods have been proposed that exploit the sophisticated properties of the joint distribution. The linear trace method (Janzing et al., 2010; Zscheischler et al., 2011) infers linear causal relations of the form , where is the structure matrix that maps the cause to the effect, using the linear trace condition. The kernelized trace method (Chen et al., 2013) can infer nonlinear causal relations, but requires the causal relation to be deterministic, functional, and invertible. In contrast, we do not make any assumptions on the causal relation between the variables.
One of the key frameworks for causal inference are the Additive Noise Models (ANMs) (Shimizu et al., 2006). ANMs assume that the effect is a function of the cause and the additive noise that is independent of the cause. Causal inference is then done by finding the direction that admits such a model. Over the years, many frameworks for causal inference from realvalued data have been proposed using ANMs (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyvärinen, 2009; Peters et al., 2014).
Algorithmic information theory provides a sound general theoretical foundation for causal inference (Janzing and Schölkopf, 2010). The key idea is that if causes , the shortest description of the joint distribution is given by the separate descriptions of the distributions and (Janzing and Schölkopf, 2010). It has also been used in justifying the additive noise model based causal discovery (Janzing and Steudel, 2010).
However, as Kolmogorov complexity is not computable, practical instantiations require computable notions of independence. For instance, the informationgeometric approach (Janzing et al., 2012) defines independence via orthogonality in information space. Cure (Sgouritsa et al., 2015) defines independence in terms of the accuracy of the estimations of and . Using algorithmic information theory, Vreeken (Vreeken, 2015) proposes a causal framework based on relative conditional complexity and instantiates it with cumulative entropy to infer the causal direction in continuous realvalued data. Budhathoki & Vreeken (Budhathoki and Vreeken, 2016)
propose a decision tree based approach for causal inference on univariate and multivariate binary data.
All above methods consider either continuous realvalued or binary data. Causal inference from discrete data has received much less attention. Peters et al. (Peters et al., 2010) (dr
) extend additive noise models to discrete data. However regression is not ideal for modelling categorical variables, and as it relies on the dependence measure, the choice of which affects the outcome. Liu & Chan
(Liu and Chan, 2016) (dc) define independence in terms of the distance correlation between empirical distributions and to infer the causal direction from categorical data. As such, it does not look over all possible space of the observed samples and hence overfits.In contrast, we look over all possible space of the observed samples. Moreover, we provide a general, yet computable, theory for causal inference that is applicable to any type of data. In particular, we directly approximate Kolmogorov complexity using a score that is minimax optimal with regard to the model class under consideration. The computational complexity of our instantiation, cisc, is linear in sample size, regardless of the domain of the variables. In the experiments, we consider both dc and dr for comparison.
6. Experiments
We implemented cisc in Python and provide the source code for research purposes, along with the used datasets, and synthetic dataset generator.^{1}^{1}1http://eda.mmci.unisaarland.de/cisc/ All experiments were executed singlethreaded on Intel Xeon E52643 v3 machine with GB memory running Linux. We consider synthetic, benchmark, and realworld data. In particular, we note that cisc is parameterfree. We compare cisc against Discrete Regression (dr) (Peters et al., 2010), and dc (Liu and Chan, 2016). In particular, we use significance level of for the independence test in dr, and threshold of for dc.
6.1. Synthetic Data
To evaluate cisc on the data with known ground truth, we consider synthetic data. Generating nontrivial synthetic data with identifiable causal direction is surprisingly difficult, though.^{2}^{2}2Ideally we would generate data with known , and evaluate our inference methods accordingly, yet as Kolmogorov complexity is not computable it is not apparent how to do this in general. We generate synthetic causeeffect pairs with ground truth using the additive noise model (ANM). That is, first we generate the cause , and then generate the effect using the model given by
where is a function, and is additive noise that is independent of . Following Peters et al. (2010), we sample from the following distributions, using independently generated uniform noise.

uniform from ,

binomial with parameters ,

geometric with parameter ,

hypergeometric with parameters ,

poisson with parameter ,

negative binomial with parameters , and

multinomial with parameters .
We note that even though we generate data following ANM from to , the joint distribution might admit an additive noise model in the reverse direction. Therefore in some cases where we say that is the true direction, might also be equally plausible, and hence full accuracy might not be achievable in some cases. However, this happens in only few trivial instances (Peters et al., 2010).
We choose parameters of the distributions randomly for each model class. We choose uniformly between and , uniformly between and , uniformly between and , uniformly between and , uniformly between and , randomly s.t. , function uniformly between to , and noise uniformly between to , where is uniformly randomly chosen between and .
Accuracy — From each model class, we sample different models, and hence different causeeffect pairs. For each model, we sample points, i.e. . In Figure 1, we compare the accuracy (percentage of correct decisions) of cisc against dc and dr for various model classes. We see that cisc either outperforms or is as good as the other methods in all but one case. This certainly proves the generality of cisc.
Although we compute the stochastic complexity under multinomial model class, we are still able to perform as good with other model classes. This is due to the optimality property of the NML distribution – even though the true data generating distribution is not inside the model class under consideration, the NML distribution still gives the optimal encoding relative to . And as we see, it works well in most cases.
Decision Rate — Next we investigate the accuracy of cisc against the fraction of decisions cisc is forced to make. To this end, for each model class, we sample different causeeffect pairs. For each causeeffect pair, we sample points. We sort the pairs by their absolute score difference in two directions ( vs. ), i.e. in descending order. Then we compute the accuracy over top pairs. The decision rate is the fraction of top causeeffect pairs that we consider. Alternatively, it is also the fraction of causeeffect pairs whose is greater than some threshold . For undecided pairs, we flip the coin. For other methods, we follow the similar procedure with their respective absolute score difference.
In Figure 2, we show the decision rate versus accuracy for different model classes. We see that both cisc and dr are highly accurate up to a very high decision rate in all cases. Both cisc and dr are highly accurate on the causeeffect pairs where the absolute score difference is very high — where the methods are most decisive. dc, on the other hand, doesn’t perform well in all cases. The only setting where dc
has a relatively good performance is in the family of Uniform distributions.
The results indicate that we can increase the threshold , and hence the decision rate, for higher accuracy.
Scalability — Next we empirically investigate the scalability of cisc. First, we examine runtime with regard to the sample size. To this end, we fix the domain size of the causeeffect pairs to , i.e. . Then for a given sample size, we sample uniformly randomly between and . Likewise for .
In Figure 3, we show the runtime of cisc, dc, and dr for various sample sizes. We observe that both cisc and dc (overlapping line) finish within seconds. dr, on the other hand, takes in the order of hours.
Next we fix the sample size to and vary the domain size . We observe that both cisc and dc again finish within seconds over the whole range. As dr iteratively searches over the entire domain, it shows a nonlinear runtime behaviour with respect to the domain size.
Overall, these results indicate that dr is fairly accurate, but relatively slow. dc, on the other hand, is fast, yet inaccurate. cisc is both highly accurate, and fast.
6.2. Benchmark Data
Next we evaluate cisc on benchmark causeeffect pairs with known ground truth (Mooij et al., 2016). In particular, we take univariate causeeffect pairs. So far there does not exist a discretization strategy that provably preserves the causal relationship between variables. Since each causeeffect pair is from a different domain, using one discretization strategy over all the pairs is also unfair. Moreover, we do not know the underlying domain of the data. As a result, we treat the data as discrete for all the pairs.
In Figure 4, we compare the accuracy of cisc against dc and dr at various decision rate together with the confidence interval for a random coin flip. If we look over all the pairs, we find that cisc infers correct direction in roughly of all the pairs. When we consider only those pairs where cisc is most decisive—with a very high value of , it is accurate on top of the pairs, accurate on top of the pairs, which is onpar with the topperforming causal inference frameworks for continuous realvalued data (Sgouritsa et al., 2015; Janzing et al., 2012). On the other hand, the results from both dc and dr are insignificant at almost every decision rate.
6.3. Qualitative Case Studies
Next we evaluate cisc on realworld data for exploratory purpose.
Abalone — First we consider the Abalone
dataset, which is available from the UCI machine learning repository.
^{3}^{3}3http://archive.ics.uci.edu/ml/ The dataset contains the physical measurements of abalones, which are large, edible sea snails.Out of the nine measurements, we consider the sex (), length (), diameter (), and height (). The length, diameter, and height of the abalone are all measured in millimetres, and have , and different values, respectively whereas the sex of the abalone is nominal (, , or ). Following Peters et al. (Peters et al., 2010), we regard the data as discrete, and consider , , and as the ground truth as sex causes the size of the abalone and not the other way around. cisc infers correct direction in all three cases.
Car Evaluation — The Car Evaluation dataset is available from the UCI machine learning repository. It has rows, and is derived from a hierarchical decision model. It contains the evaluation of a car for buying purpose based on six characteristics of the car.
We consider the estimated safety () of the car against the evaluation () of the car. The safety feature of the car takes a nominal value (, , or ), and the evaluation feature of the car also takes a nominal value (, , , or ). We regard as the ground truth as safety of the car causes the decision on buying the car, but not vice versa. cisc identifies the correct direction.
Adult — The Adult dataset is taken from the UCI machine learning repository and consists of records from the census database of the US in .
Out of attributes, we consider only three – education (), occupation (), and income (). The domain of education attribute consists of dropout, associates, bachelors, doctorate, hsgraduate, masters, and profschool. For occupation, we have admin, armedforce, bluecollar, whitecollar, service, sales, professional, and otheroccupation as possible values. Lastly, for income attribute, we have two values: >50K and <=50.
As intuitively education causes income, and not vice versa, we regard as the ground truth. Similarly, as occupation causes income, we regard as the ground truth. We run cisc on both pairs and . We observe that for both pairs cisc infers the causal direction correctly.
Overall, these results illustrate that cisc finds sensible causal directions from realworld data.
7. Discussion
The experiments show that cisc works well in practice. cisc reliably identifies true causal direction regardless of the data distribution. It is remarkably fast. On benchmark data, it’s performance is comparable to the stateoftheart causal inference frameworks for continuous realvalued data. Moreover, the qualitative case studies show that the results are sensible.
In this work, we give a general framework for causal inference based on the solid foundations of information theory. To apply the framework in practice, we just have to compute the stochastic complexity relative to a model class. The richer the model class, the better the solution. Although computing the stochastic complexity involves looking over all possible datasets, theoretically it is still computable, and there do exist efficient algorithms for certain model classes. The proposed framework lays a clear computable foundation for algorithmic causal inference principle postulated by Janzing & Schölkopf (Janzing and Schölkopf, 2010).
Although the results show the strength of the proposed framework, and of cisc in particular, we see many possibilities to further improve. We instantiated the framework using multinomial stochastic complexity on discrete data. We see that cisc performs relatively well even in cases where the data is not sampled from the multinomial model class. This is due to the optimality property of the multinomial distribution — even if the true data generating distribution is not inside the model class under consideration, the NML distribution still gives the optimal encoding for the data relative to . It would be an engaging future work to instantiate the framework for other types of data (e.g. continuous realvalued, mixed, etc.) and model classes (e.g. family of Gaussians, Dirichlets, etc.). The key aspect to study would be efficient algorithms for computing the stochastic complexity for such model classes.
We define conditional stochastic complexity as the sum of the stochastic complexities of conditioned on over all . This way we look over local stochastic complexities of parts of relative to each value of . Perhaps we can compute the conditional stochastic complexity globally relative to . It would also be interesting to explore factorized normalized maximum likelihood models (Roos et al., 2008) to instantiate the framework for multivariate data (Budhathoki and Vreeken, 2016).
To infer the causal relationship between variables and , we assume that there is no confounding variable . It would be interesting to use the framework to additionally discover the confounding variables. The rough idea is that factorizing the joint complexity in presence of the confounding variable leads to the smallest stochastic complexity compared to factorizing into and or and .
Another avenue for future work would be to use the framework for causal discovery. The proposed framework infers causal relationship between given two variables and . It would be interesting to explore how the framework can be employed to discover (mine) the causal models directly from the data.
8. Conclusion
We considered causal inference from observational data. We proposed a general, yet computable framework for informationtheoretic causal inference with optimality guarantees. In particular, we proposed to perform causal inference by stochastic complexity.
To illustrate the strength of this, we proposed cisc for pairs of univariate discrete variables, using stochastic complexity over the class of multinomial distributions. Extensive evaluation on synthetic, benchmark, and realworld data showed that cisc is highly accurate, outperforming the state of the art by a margin, and scales extremely well with regard to both sample and domain sizes.
Future work includes considering richer model classes, as well as structure learning for the discovery of causal models from data.
Acknowledgements.
Kailash Budhathoki is supported by the International Max Planck Research School for Computer Science. Both authors are supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.References
 (1)
 Budhathoki and Vreeken (2016) Kailash Budhathoki and Jilles Vreeken. 2016. Causal Inference by Compression. In ICDM. IEEE, 41–50.

Chen
et al. (2013)
Z. Chen, K. Zhang, and
L. Chan. 2013.
Nonlinear Causal Discovery for High Dimensional Data: A Kernelized Trace Method. In
ICDM. IEEE, 1003–1008.  Grünwald (2007) Peter Grünwald. 2007. The Minimum Description Length Principle. MIT Press.
 Grünwald and Vitányi (2008) Peter D. Grünwald and Paul M. B. Vitányi. 2008. Algorithmic Information Theory. CoRR abs/0809.2754 (2008).
 Hoyer et al. (2009) PO. Hoyer, D. Janzing, JM. Mooij, J. Peters, and B. Schölkopf. 2009. Nonlinear causal discovery with additive noise models. In NIPS. 689–696.
 Janzing et al. (2010) D. Janzing, P. Hoyer, and B. Schölkopf. 2010. Telling cause from effect based on highdimensional observations. In ICML. JMLR, 479–486.
 Janzing et al. (2012) Dominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniušis, Bastian Steudel, and Bernhard Schölkopf. 2012. Informationgeometric approach to inferring causal directions. AIJ 182183 (2012), 1–31.
 Janzing and Schölkopf (2010) D. Janzing and B. Schölkopf. 2010. Causal Inference Using the Algorithmic Markov Condition. IEEE TIT 56, 10 (2010), 5168–5194.
 Janzing and Steudel (2010) D. Janzing and B. Steudel. 2010. Justifying Additive Noise ModelBased Causal Discovery via Algorithmic Information Theory. OSID 17, 2 (2010), 189–212.
 Kolmogorov (1965) A.N. Kolmogorov. 1965. Three Approaches to the Quantitative Definition of Information. Problemy Peredachi Informatsii 1, 1 (1965), 3–11.
 Li and Vitányi (1993) M. Li and P. Vitányi. 1993. An Introduction to Kolmogorov Complexity and its Applications. Springer.
 Liu and Chan (2016) Furui Liu and Laiwan Chan. 2016. Causal Inference on Discrete Data via Estimating Distance Correlations. Neur. Comp. 28, 5 (2016), 801–814.
 Mononen and Myllymäki (2008) Tommi Mononen and Petri Myllymäki. 2008. Computing the Multinomial Stochastic Complexity in SubLinear Time. In PGM. 209–216.
 Mooij et al. (2016) Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. 2016. Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks. JMLR 17, 32 (2016), 1–102.
 Mooij et al. (2010) J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, and B. Schölkopf. 2010. Probabilistic latent variable models for distinguishing between cause and effect. In NIPS. Curran, 1687–1695.
 Myung et al. (2006) Jay I. Myung, Daniel J. Navarro, and Mark A. Pitt. 2006. Model selection by normalized maximum likelihood. J. Math. Psych. 50, 2 (2006), 167–179.
 Pearl (2000) Judea Pearl. 2000. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA.
 Pearl (2009) Judea Pearl. 2009. Causality: Models, Reasoning and Inference (2nd ed.). Cambridge University Press, New York, NY, USA.
 Peters et al. (2010) J. Peters, D. Janzing, and B. Schölkopf. 2010. Identifying Cause and Effect on Discrete Data using Additive Noise Models. In AISTATS. JMLR, 597–604.
 Peters et al. (2014) J. Peters, JM. Mooij, D. Janzing, and B. Schölkopf. 2014. Causal Discovery with Continuous Additive Noise Models. JMLR 15 (2014), 2009–2053.
 Rissanen (1978) Jorma Rissanen. 1978. Modeling by shortest data description. Automatica 14, 1 (1978), 465–471.
 Rissanen (2001) Jorma Rissanen. 2001. Strong optimality of the normalized ML models as universal codes and information in data. IEEE TIT 47, 5 (2001), 1712–1717.
 Roos et al. (2008) T. Roos, T. Silander, P. Kontkanen, and P. Myllymäki. 2008. Bayesian network structure learning using factorized NML universal models. In Proc. Information Theory and Applications Workshop (ITA). IEEE.
 Schölkopf et al. (2012) B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. 2012. On Causal and Anticausal Learning. Omnipress, New York, NY, USA, 1255–1262.
 Sgouritsa et al. (2015) E. Sgouritsa, D. Janzing, P. Hennig, and B. Schölkopf. 2015. Inference of Cause and Effect with Unsupervised Inverse Regression. JMLR, 847–855.
 Shimizu et al. (2006) Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, and Antti Kerminen. 2006. A Linear NonGaussian Acyclic Model for Causal Discovery. JMLR 7 (2006), 2003–2030.
 Shtarkov (1987) Y. M. Shtarkov. 1987. Universal sequential coding of single messages. Problems of Information Transmission 23, 3 (1987), 175–186.
 Spirtes et al. (2000) P. Spirtes, C. Glymour, and R. Scheines. 2000. Causation, Prediction, and Search. MIT press.
 Vereshchagin and Vitanyi (2004) N.K. Vereshchagin and P.M.B. Vitanyi. 2004. Kolmogorov’s Structure functions and model selection. IEEE TIT 50, 12 (2004), 3265– 3290.
 Vreeken (2015) Jilles Vreeken. 2015. Causal Inference by Direction of Information. In SDM. SIAM, 909–917.
 Zhang and Hyvärinen (2009) Kun Zhang and Aapo Hyvärinen. 2009. On the Identifiability of the Postnonlinear Causal Model. In UAI. AUAU Press, 647–655.

Zscheischler
et al. (2011)
J. Zscheischler, D.
Janzing, and K. Zhang. 2011.
Testing whether linear equations are causal: A free probability theory approach. AUAI Press, 839–847.