1 Introduction
Detecting causal relationships from data is a significant issue in many disciplines. The understanding of causal relations between variables can help to understand how a system behaves under intervention, stabilize future predictions, and has many other important implications. Identifying causal links (causal discovery) from observed data alone is only possible with further assumptions and/or additional data. Despite the various causal discovery methods available, the problem of finding the causal structure between two random variables remains notoriously hard. In this paper, we use additional data and assume a very natural principle to solve that task. Our work is based on a mathematical framework proposed by Pearl Pea09 , that formalizes causality and causal relations. It introduces causal models that represent an (unknown) underlying data generation mechanism responsible for the distribution of the sampled data PJS17 . We include sampled data from situations (environments) where interventions took place together with samples from pure observations. Recent developments in that direction revealed promising results PBM16 ; HPM18 , but often these methods are conservative, leading to situations where no direction is preferred. This paper focuses on the bivariate discrete case and is based on a natural and weak principle. The principle of independent mechanism assumes that the data generating mechanism is independent of the data that is feed into such a mechanism. From this principle, we derive an invariance relation that states that it does not matter if we observe an effect due to an observation of its cause or due to an intervention on its cause. Distributions that are generated by an underlying causal model fulfill these invariance relations. If and
are discrete random variables, then we can characterize the support set of joint distributions that fulfill these relations by embedding the distributions from observational and interventional samples into a higher dimensional space and creating a link between them. That means we first embed the empirical distributions into a higher dimensional space and then find the best approximation of this embedding to the probability distributions that are compatible with the invariance principle such that the relative entropy between them minimizes. We call this approach an informationtheoretic approximation to causal models (IACM) since the relative entropy can be interpreted as an error telling us how much a finite sample deviates from a sample that comes from an assumed underlying causal model. It turns out that solving this optimization problem is equivalent to solving a linear optimization problem, which ends up in an efficient algorithm. We use IACM to formulate a causal discovery algorithm that infers the causal direction of two random variables. For this, we approximate to a causal model were
causes and to a model were causes . We prefer the direction that has lower relative entropy. With respective preprocessing, this can also be applied to continuous data.If we additionally assume that the underlying causal model is monotonic w.r.t. or , then we can include this assumption into the support set characterization used by our approach. In the case of binary random variables, an approximation to a monotonic causal model enables us to calculate probabilities about how necessary, sufficient, or necessary and sufficient a cause is for an effect as defined in Pea09 . We will use this as a strength of a causal link and include it in our causal discovery algorithm.
For the rest of this paper, we assume that we have two random variables and that attain values in finite ranges and , respectively. The contribution of this paper is twofold. The first contribution is an approximation of (empirical) distributions to a set of distributions that is compatible with an invariance condition induced by an assumed causal model. The second contribution is a method for causal discovery based on this approximation procedure. This method can also be applied if we have observed data from and that are heterogeneous and continuous. In experiments, we were able to verify the strength of our causal discovery approach, especially in the case that we have discrete ranges with low cardinality, we outperformed alternative stateoftheart methods.
The paper is organized as follows. Section 2 introduces causal models and formulates the invariance statement. In Section 3 we present an informationtheoretic approximation of distributions to one that is generated by causal models. We derive the theoretic foundation, illustrate the results for the binary case, and formulate the approximation algorithm. Section 4 shows applications of the approximation algorithm. In particular, the calculation of probabilities for causes and the application to causal discovery. Section 5 describes experiments to verify our approach and we conclude in Section 6.
2 Causal Models
We describe causal relations in the form of a directed graph with a finite vertex set and a set of directed edges . A directed edge from to
is an ordered pair
and often represented as an arrow between vertices, e.g. . For a directed edge the vertex is a parent of and is a child of . The set of parents of a vertex is denoted by . We only consider directed graphs that have no cycles and call them directed acyclic graphs (DAGs). In a DAG we interpret the vertices as random variables and a directed edge as a causal link between and . We say that is a direct cause of and is a direct effect of . We further specify causal links by introducing functional relations between them.Definition 1
A structural causal model (SCM) is a tuple where is a collection of structural assignments
where are the parents of and is a joint distribution over the noise variables that are assumed to be jointly independent.
We consider an SCM as a model for a data generating process PJS17 . This enables us to model a system in an observational state and under perturbations at the same time. An SCM defines a unique distribution over the variables . Perfect interventions are done by replacing an assignment in an SCM. Given an SCM we can replace the assignment for by . The distribution of that new SCM is denoted by and called intervention distribution PJS17 ; Pea09 . When modeling causality, we assume the principle of independent mechanism. Roughly speaking, this principle states that a change in a variable does not change the underlying causal mechanism, see PJS17 . Formally for an SCM, this means that a change in a child variable will not change the mechanism that is responsible to obtain an effect from . From this principle the following invariance statement follows:
(1) 
where is the conditional density of evaluated at for some . Informally, this means that, if is a cause of , then it doesn’t matter if we observe when is present due to an observation of or when is present due to an intervention on .
3 Approximation to Causal Models
3.1 The General Case
Given two random variables with finite ranges , and data from observations of as well as from interventions on or .^{1}^{1}1We can also relax the assumption of having interventional data and assume that the data are heterogeneous and show a rich diversity. Alternatively, we can say that we have data of and from different environments, where each environment belongs to a different intervention on or . We further assume that the data from different interventions are independent of each other, and that there is no confounding variable. In practical applications, the interventional data can be obtained from experiments or more implicitly from heterogeneous data. Condition (1), is in general, not fulfilled by empirical distributions obtained from such data. We derive a method that enables us to find a joint probability distribution of and that fulfill (1) and is closest to a given empirical distribution in an informationtheoretic sense.
Without loss of generality we assume that the intervention took place on with values in and . In the following we assume that the elements of are in a fixed order. We summarize , , where are the observed data and the interventional data. We define that takes values in and with we denote the joint distribution over . The space of probability distributions on is denoted by and for the marginalization of is defined by with , where and . The next Lemma gives us a characterization of distributions that fulfill (1).
Lemma 1
The set of joint probability distributions for , for all which fulfill the consistency condition (1) is called and given as
where .
The proof is given in the Supplementary Material. The support of is therefore given by
Given observational and interventional samples of and and its corresponding empirical distributions , for we try to find a distribution such that
(2) 
We can always find a joint distribution such that (2) holds, since the distributions for all are independent to each other. Although this does not guarantee , we can try to find a distribution in that has minimal relative entropy to . This minimal relative entropy can be interpreted as an approximation error to an assumed causal model. The relative entropy or KullbackLeibler divergence (KLdivergence) between two distributions is defined as follows
We use the convention that for , see also CT91 ; Kak99 . This leads to:
(3) 
That is a nonlinear minmin optimization problem with linear constraints. It turns out that in our situation, the problem simplifies to a linear optimization problem.
Proposition 1
The proof is given in the Supplementary Material and an application of the Lagrangian multiplier method. The statements of Proposition 1 holds also for any other support set characterization rather than . The global approximation error is given by . Inspired by JMZLZDSS12 ; DJMZSZS10 ; JBGS13 and by the intuition that the information loss from to should be smaller than the other way round (due to an assumed mechanism from to ) the quantity could also be seen as a kind of approximation error to the causal model .
3.2 The Binary Case
To illustrate our approach, we consider the binary case. That means , and . The set of consistent probability distributions is characterized by
and therefore . A probability distribution
is a nonnegative vector with
elements that sums up to . We encode the conditions (2) into a contraint matrix that takes the following formand into a corresponding righthand side
The nonnegativity can be encoded in an identity matrix
of length and a zero vector of length as the righthand side. A probability distribution that solves (2) is then a solution to the following linear optimization problemThe proof of Proposition 1 tells us that a distribution that fulfill condition (1) and is as close as possible to in an informationtheoretic sense can be obtained by the following reweighting of
3.3 Implementation
The procedure in Subsection 3.2 can be generalized for arbitrary finite ranges of and . The pseudocode of the algorithm is shown in Algorithm 1. The size of the finite ranges is denoted by and . We further assume that we have for every interventional data available. Therefore, the constraint matrix has dimension . The first row of contains at each column, the following rows contain the support patterns of and the final rows contain the support pattern of . The function getConstraintDistribution prepares the right hand side of accordingly. Since we assumed that the intervention took place on the underlying assumed causal model is . Note that is depending on .
We implemented this procedure in Python and used the cvxpy
package to solve the linear program. The dimension of
will grow exponentially with the size of ranges for and . However, it turns out that it is enough to consider low range sizes . For possibly preprocessed sample data of andwith much higher or continuous range sizes, we apply an equalfrequency discretization based on quantiles, that is done before calculating and feeding
into Algorithm 1.4 Applications
The approximation approach described in Section 3 has several applications. We describe two of them in the following subsections.
4.1 Probabilities for Causes
To measure the effect of a causeeffect relation Pearl proposed in Pea09 counterfactual statements that give information about the necessity, the sufficiency, and the necessity and sufficiency of causeeffect relations. A counterfactual statement is a dostatement in a hypothetical situation that can, in general, not observed or simulated. Formally this means we condition an SCM to an observed situation and apply a dooperator. The corresponding intervention distribution reads for example which means the probability that equals if would have been where indeed we observed that is and is .
Definition 2
Let be random variables in an SCM such that is a (hypothetical) cause of and .

The probability that is necessary as a cause for an effect is defined as where .

The probability that is sufficient as a cause for an effect is defined as

The probability that is necessary and sufficient as a cause for an effect is defined as
In general, counterfactual statements cannot be calculated from observational data and without knowing the true underlying SCM. However, Pearl identified situations in which we can exploit the presence of observational and interventional data to calculate the probabilities defined above. One such situation is when the underlying SCM is monotonic.
Definition 3
An SCM with for two random variables and is called monotonic relative to , if and only if is monotonic in independent of .
If and are binary and if is increasing monotonic relative to , then Theorem 9.2.15 in Pea09 give us
(4) 
(5) 
(6) 
Similar if is decreasing monotonic relative to , then we could also derive in the same fashion as Pearl did it the following formulas
(7) 
(8) 
(9) 
By approximating empirical observational and interventional distributions to a monotonic causal model we can calculate , , and . To do this, we need to further restrict the set and note that the monotonicity of implies that either and has zero probability or that and has zero probability. This means that either or has to hold in addition to the conditions given in . We define as the set of probability conditions with an underlying monotonic increasing data generation process and as the set of probability conditions with an underlying monotonic decreasing data generation process. An approximation in the sense of Subsection 3.1 to or instead of will only change the definition of , the rest will remain the same. In order to calculate , , and we approximate to and , choose the one with the least approximation error and use the formulas given above. We state the pseudocode in Algorithm 2.
4.2 Causal Discovery
When we assume that we can test how well the given data fit that assumption and obtain an approximation error . Switching the roles of and we get . The direction with the smallest error is the one we infer as the causal direction. If the error difference is below a small tolerance , we consider both directions as equal and return "no decision". If and are binary and the error to the monotone models is smaller than to the nonmonotone models, then we apply Algorithm 2 to determine for both directions and use this as a decision criterion for the preferred direction (the direction with the higher determines the direction). In general, some kind of data preprocessing and discretization before applying the causal discovery method is of advantage. In our implementation, we included several different preprocessing steps that treat and different depending on the assumed causal direction, see Supplementary Material for more details.
5 Experiments
We test Algorithm 3 with synthetic and realworld benchmark data against alternative causal discovery methods. It runs with
, and with a heuristic preprocessing procedure described in the Supplementary Material. Depending on the input data structure, we used
or as an approximation error, see also Supplementary Material for more details.5.1 Pairwise Causal Discovery Methods
Among the various causal discovery approaches for continuous, discrete, nonlinear bivariate data, we select those that do not include any training of labeled causeeffect pairs to have a fair comparison. One wellknown method uses additive noise models (ANM) that assume SCMs with additive noise and applies for continuous and discrete data HJMPS09 ; PJS11 . Furthermore, we select an informationgeometric approach (IGCI) JMZLZDSS12 designed for continuous data and some recent methods for discrete data that are using minimal description length (CISC) BV17 , Shannon entropy (ACID) BV18 , and a compact representation of the causal mechanism (HCR) CQZZH18 . We further select conditional distribution similarity (CDS) F16 , regression error based causal inference (RECI) BJWSS18 , and (nonlinear) invariant causal prediction (nonICP, ICP) HPM18 ; PBM16 as baseline methods. For all methods we used the default parameter settings.^{2}^{2}2For HCR, nonICP, and ICP we use the Rpackages from the references, for CISC, ACID the corresponding Python code and for ANM, IGCI, CDS, RECI the Python package causal discovery toolbox KG19 .
5.2 Synthetic Data
We generate a set of synthetic data that are different in its structure (linear, nonlinear, discrete, nondiscrete) and its range size. These synthetic data consists of observed data and data from perfect interventions. We use SCMs with additive or multiplicative noise to generate these data or where are sampled independently from a
distribution for which the degrees of freedom are chosen randomly from
and we randomly decide if we use an additive or multiplicative model. The nonlinear function is randomly selected between the following functions , , and . The linear function is given by , where is randomly selected from the interval . The discrete data are generated using a bins discretization. We simulate perfect interventions on by setting them to every value in the range if the range is discrete and to some randomly selected value if the range is continuous. The sample size is chosen randomly from and we run simulations for each configuration. Figure 1 shows the averaged accuracy of correct inferred causal direction for linear synthetic data relative to the difference in range size for small range sizes. Our method performs substantially better for positive differences than all alternative approaches. A similar picture can be seen in Figure 1 for nonlinear synthetic data. Therefore, it seems that our method is well suited for situations where the range size of the cause is greater than the range size of the effect. In Figure 2 we see that our method performs also well for synthetic linear and nonlinear continuous data.5.3 RealWorld Data
As a benchmark set consisting of realworld data, we use manually labeled continuous causeeffect pairs (CEP) from different contexts MPJZS16 ; DG19 . As realworld discrete data sets, we use anonymous discrete causeeffect pairs where food intolerances cause health issues (Food)^{3}^{3}3This dataset, given as discrete timeseries data, has been provided by the author and the causal direction has been independently confirmed by medical tests., the Pittsburgh bridges dataset (Bridge, pairs) from DG19 as it has been used in CQZZH18 , and the Abalone dataset (Abalone,
pairs) from the UCI Machine Learning Repository
DG19 . For the CEP data set IACM outperforms all other methods. Also for discrete realworld data we can see in Figure 2 that IACM successfully recovers all causal directions and can keep pace with stateoftheart methods.6 Discussion
In this paper, we proposed a way how empirical distributions coming from observations and experiments can be approximated to ones that follow the restrictions enforced by an assumed causal model. This can be used to calculate probabilities of causation and leads to a new causal discovery method. In our experiments, we could confirm that our approach can compete with the current stateoftheart methods, also on realworld datasets (continuous and discrete) and without the explicit knowledge of experimental data. Especially for the discrete setting in which the range size of the cause is greater than the range size of the effect, our method has advantages compared to other approaches. Since IACM ran with small range sizes and , it seems that in many cases the essential causeeffect information can be encoded with much less information than we might have in the data. This is interesting by itself and could serve as a base for future research.
Broader Impact
As all methods for causal discovery that use observational and/or data from implicit interventions the work in this paper could help to avoid unethical experiments. Furthermore, it contributes to a more relieable detection of causal relations, since it fills a gap in the existing causal discovery landscape. Therefore, our research can help during the evaluation and design of studies with few discrete features and contribute to more solid conclusions of those studies. On the other hand there is a potential risk that our method is used for data that are not following the assumptions of this paper. This may lead to wrong causal directions and to wrong conclusions, but can be avoided by checking the assumptions on the data before applying our method. Finally, it should be noted that this article may inspire future research projects in the field.
References

(BJW18)
P. Bloebaum, D. Janzing, T. Washio, S. Shimizu, and B. Schölkopf,
Causeeffect inference by comparing regression errors
, International Conference on Artificial Intelligence and Statistics (2018), 900–909.
 (BV17) K. Budhathoki and J. Vreeken, MDL for causal inference on discrete data, 2017 IEEE International Conference on Data Mining (ICDM) (2017), 751–756.
 (BV18) , Accurate causal inference on discrete data, 2018 IEEE International Conference on Data Mining (ICDM) (2018), 881–886.
 (CLRS01) T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms, MIT Press, Cambridge, 2001.
 (CQZ18) R. Cai, J. Qiao, K. Zhang, Z. Zhang, and Z. Hao, Causal discovery from discrete data using hidden compact representation, Adv Neural Inf Process Syst (2018), 2666‐2674.
 (CT91) T. Cover and J. Thomas, Elements of information theory, John Wiley and Sons, 1991.
 (DG19) D. Dua and C. Graff, UCI machine learning repository, 2019.
 (DJM10) P. Daniušis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Schölkopf, Inferring deterministic causal relations, Proceedings of the TwentySixth Conference on Uncertainty in Artificial Intelligence (2010), 143–150.
 (Fon19) J. Fonollosa, Conditional distribution variability measures for causality detection, Cause Effect Pairs in Machine Learning (I. Guyon, A. Statnikov, and B. Batu, eds.), Springer, 2019.
 (HDPM18) C. HeinzeDeml, J. Peters, and N. Meinshausen, Invariant causal prediction for nonlinear models, Journal of Causal Inference 6 (2018), no. 2.
 (HJM09) P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Schölkopf, Nonlinear causal discovery with additive noise models, In Neural Information Processing Systems (NIPS) (2009), 689–696.
 (JBGWS13) D. Janzing, D. Balduzzi, M. GrosseWentrup, and B. Schölkopf, Quantifying causal influences, The Annals of Statistics 41 (2013), no. 5, 2324–2358.
 (JMZ12) D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf, Informationgeometric approach to inferring causal directions, Artificial Intelligence 182183 (2012), 1–31.
 (Kak99) Y. Kakihara, Abstract methods in information theory, World Scientific Publishing Co. Pte. Ltd., 1999.
 (KG19) D. Kalainathan and O. Goudet, Causal discovery toolbox: Uncover causal relationships in python.
 (MPJ16) J. Mooij, J. Peters, D. Janzing, J. Zscheischler, and S. B., Distinguishing cause from effect using observational data: methods and benchmarks, Journal of Machine Learning Research 17 (2016), 1–102.

(PBM16)
J. Peters, P. Bühlmann, and N. Meinshausen,
Causal inference by using invariant prediction: identification and confidence intervals
, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (2016), no. 5, 947–1012.  (Pea09) J. Pearl, Causality, models, reasoning, and inference, Cambridge University Press, 2009.
 (PJS11) J. Peters, D. Janzing, and B. Schölkopf, Causal inference on discrete data using additive noise models, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011), 2436–2450.
 (PJS17) J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference, MIT Press, 2017.
Supplementary Material for InformationTheoretic Approximation to Causal Models
Supplementary Material for InformationTheoretic Approximation to Causal Models
Appendix A Proofs
a.1 Proof of Lemma 1
Lemma 1
The set of joint probability distributions for , for all which fulfill the consistency condition (1) is called and given as
where .
Proof.
The consistency condition (1) implies the following relation for some and
These relation implies
which characterizes the joint distributions that satisfy (1).
a.2 Proof of Proposition 1
Proposition 1
The optimization problem (3) simplifies to the following linear optimization problem
with .
Proof.
We first consider the inner minimization problem of (3) for a given joint distribution . This is a constrained optimization problem where the constraints in are equivalent to the equation
since is a probability distribution. Therefore, the Lagrange functional of this minimization problem reads
with as Lagrange multiplier. Using the Lagrange multiplier method we obtain explicit expressions for the approximating distribution
for and for all . Thus we have solved the inner minimization problem explicitly and the relative entropy simplifies to
Therefore, we can now optimize on the space of possible joint distributions and (3) simplifies to
Since is a monotone function it suffices to maximize given the constraints. But this is nothing than a linear optimization problem which can be solved by linear programming using the simplex algorithm, see, for example, CLRS01 .
Appendix B Application to Timeseries Data
Algorithm 1 can also be applied when we assume that the underlying causal model has a time lag of , which is , and the observational and interventional data have a time order. We only have to shift the incoming data for and so that Algorithm 1 applies to , and has to take care that we preserve the data order during preprocessing steps. If we do not know the exact time lag we can run the approximation several times with different time lags to find the approximation with the lowest error.
Appendix C Experiments
c.1 Data Preprocessing
Algorithm 3 includes several preprocessing steps for sample data from and that can be parametrized. The preprocessing steps split the data into obersvational data and interventional data for and accordingly. These steps are:

discretecluster: discretize data from and using KBinsDiscretizer to a number of bins. Apply a
means clustering to the discretized data with fixed number of clusters. We split the data according to the variance in the identified cluster. If
is the assumed cause, then the cluster with the lowest variance in determines the observed data and the union of the other clusters the interventional data. If is the assumed cause, then we apply the same procedure by exchanging with . 
clusterdiscrete: the steps of discretecluster are interchanged. The data are first clustered, then discretized, and split according the variance in the clusters.
c.2 Parameter Setting in Experiments
In the experimental runs we used the following parameter configuration to run IACM. If or we used no preprocessing, since the binary discretization build into IACM itself seems sufficient enough. If we used discretecluster as preprocessing and used as a decision criterion for the causal direction. In the other case when we used discretecluster but as decision criterion. For data with continuous or large range sizes we used clusterdiscrete as preprocessing and the approximation error as decision criterion. The number of bins used in the discretization part is chosen such that it is a little above the range size of the data. The number of clusters is determined by a simple search procedure. We apply KMeans clustering for every number of clusters between
and the number of bins to the data and determine the relative entropies between the observational data and between the interventional data resulting from the cluster split as described above. The number of clusters with the lowest sum of those relative entropies is chosen for preprocessing. Furthermore, before feeding data into preprocessing we applied a robust scaling to them that is robust w.r.t. outliers. Our experiments also show that these heuristics defining the meta parameters for IACM is by far not optimal. We left that for future research.