Log In Sign Up

Information-Theoretic Approximation to Causal Models

by   Peter Gmeiner, et al.

Inferring the causal direction and causal effect between two discrete random variables X and Y from a finite sample is often a crucial problem and a challenging task. However, if we have access to observational and interventional data, it is possible to solve that task. If X is causing Y, then it does not matter if we observe an effect in Y by observing changes in X or by intervening actively on X. This invariance principle creates a link between observational and interventional distributions in a higher dimensional probability space. We embed distributions that originate from samples of X and Y into that higher dimensional space such that the embedded distribution is closest to the distributions that follow the invariance principle, with respect to the relative entropy. This allows us to calculate the best information-theoretic approximation for a given empirical distribution, that follows an assumed underlying causal model. We show that this information-theoretic approximation to causal models (IACM) can be done by solving a linear optimization problem. In particular, by approximating the empirical distribution to a monotonic causal model, we can calculate probabilities of causation. It turns out that this approximation approach can be used to successfully solve causal discovery problems in the bivariate, discrete case. Experimental results on both labeled synthetic and real-world data demonstrate that our approach outperforms other state-of-the-art approaches in the discrete case with low cardinality.


Telling Cause from Effect using MDL-based Local and Global Regression

We consider the fundamental problem of inferring the causal direction be...

A Primer on Causal Analysis

We provide a conceptual map to navigate causal analysis problems. Focusi...

Causal Inference on Discrete Data using Additive Noise Models

Inferring the causal structure of a set of random variables from a finit...

Causal Inference on Multivariate and Mixed-Type Data

Given data over the joint distribution of two random variables X and Y, ...

Information in probability: Another information-theoretic proof of a finite de Finetti theorem

We recall some of the history of the information-theoretic approach to d...

Causal Discovery by Kernel Intrinsic Invariance Measure

Reasoning based on causality, instead of association has been considered...

Genome-Wide Association Studies: Information Theoretic Limits of Reliable Learning

In the problems of Genome-Wide Association Study (GWAS), the objective i...

1 Introduction

Detecting causal relationships from data is a significant issue in many disciplines. The understanding of causal relations between variables can help to understand how a system behaves under intervention, stabilize future predictions, and has many other important implications. Identifying causal links (causal discovery) from observed data alone is only possible with further assumptions and/or additional data. Despite the various causal discovery methods available, the problem of finding the causal structure between two random variables remains notoriously hard. In this paper, we use additional data and assume a very natural principle to solve that task. Our work is based on a mathematical framework proposed by Pearl Pea09 , that formalizes causality and causal relations. It introduces causal models that represent an (unknown) underlying data generation mechanism responsible for the distribution of the sampled data PJS17 . We include sampled data from situations (environments) where interventions took place together with samples from pure observations. Recent developments in that direction revealed promising results PBM16 ; HPM18 , but often these methods are conservative, leading to situations where no direction is preferred. This paper focuses on the bivariate discrete case and is based on a natural and weak principle. The principle of independent mechanism assumes that the data generating mechanism is independent of the data that is feed into such a mechanism. From this principle, we derive an invariance relation that states that it does not matter if we observe an effect due to an observation of its cause or due to an intervention on its cause. Distributions that are generated by an underlying causal model fulfill these invariance relations. If and

are discrete random variables, then we can characterize the support set of joint distributions that fulfill these relations by embedding the distributions from observational and interventional samples into a higher dimensional space and creating a link between them. That means we first embed the empirical distributions into a higher dimensional space and then find the best approximation of this embedding to the probability distributions that are compatible with the invariance principle such that the relative entropy between them minimizes. We call this approach an information-theoretic approximation to causal models (IACM) since the relative entropy can be interpreted as an error telling us how much a finite sample deviates from a sample that comes from an assumed underlying causal model. It turns out that solving this optimization problem is equivalent to solving a linear optimization problem, which ends up in an efficient algorithm. We use IACM to formulate a causal discovery algorithm that infers the causal direction of two random variables. For this, we approximate to a causal model were

causes and to a model were causes . We prefer the direction that has lower relative entropy. With respective preprocessing, this can also be applied to continuous data.

If we additionally assume that the underlying causal model is monotonic w.r.t. or , then we can include this assumption into the support set characterization used by our approach. In the case of binary random variables, an approximation to a monotonic causal model enables us to calculate probabilities about how necessary, sufficient, or necessary and sufficient a cause is for an effect as defined in Pea09 . We will use this as a strength of a causal link and include it in our causal discovery algorithm.

For the rest of this paper, we assume that we have two random variables and that attain values in finite ranges and , respectively. The contribution of this paper is twofold. The first contribution is an approximation of (empirical) distributions to a set of distributions that is compatible with an invariance condition induced by an assumed causal model. The second contribution is a method for causal discovery based on this approximation procedure. This method can also be applied if we have observed data from and that are heterogeneous and continuous. In experiments, we were able to verify the strength of our causal discovery approach, especially in the case that we have discrete ranges with low cardinality, we outperformed alternative state-of-the-art methods.

The paper is organized as follows. Section 2 introduces causal models and formulates the invariance statement. In Section 3 we present an information-theoretic approximation of distributions to one that is generated by causal models. We derive the theoretic foundation, illustrate the results for the binary case, and formulate the approximation algorithm. Section 4 shows applications of the approximation algorithm. In particular, the calculation of probabilities for causes and the application to causal discovery. Section 5 describes experiments to verify our approach and we conclude in Section 6.

2 Causal Models

We describe causal relations in the form of a directed graph with a finite vertex set and a set of directed edges . A directed edge from to

is an ordered pair

and often represented as an arrow between vertices, e.g. . For a directed edge the vertex is a parent of and is a child of . The set of parents of a vertex is denoted by . We only consider directed graphs that have no cycles and call them directed acyclic graphs (DAGs). In a DAG we interpret the vertices as random variables and a directed edge as a causal link between and . We say that is a direct cause of and is a direct effect of . We further specify causal links by introducing functional relations between them.

Definition 1

A structural causal model (SCM) is a tuple where is a collection of structural assignments

where are the parents of and is a joint distribution over the noise variables that are assumed to be jointly independent.

We consider an SCM as a model for a data generating process PJS17 . This enables us to model a system in an observational state and under perturbations at the same time. An SCM defines a unique distribution over the variables . Perfect interventions are done by replacing an assignment in an SCM. Given an SCM we can replace the assignment for by . The distribution of that new SCM is denoted by and called intervention distribution PJS17 ; Pea09 . When modeling causality, we assume the principle of independent mechanism. Roughly speaking, this principle states that a change in a variable does not change the underlying causal mechanism, see PJS17 . Formally for an SCM, this means that a change in a child variable will not change the mechanism that is responsible to obtain an effect from . From this principle the following invariance statement follows:


where is the conditional density of evaluated at for some . Informally, this means that, if is a cause of , then it doesn’t matter if we observe when is present due to an observation of or when is present due to an intervention on .

3 Approximation to Causal Models

3.1 The General Case

Given two random variables with finite ranges , and data from observations of as well as from interventions on or .111We can also relax the assumption of having interventional data and assume that the data are heterogeneous and show a rich diversity. Alternatively, we can say that we have data of and from different environments, where each environment belongs to a different intervention on or . We further assume that the data from different interventions are independent of each other, and that there is no confounding variable. In practical applications, the interventional data can be obtained from experiments or more implicitly from heterogeneous data. Condition (1), is in general, not fulfilled by empirical distributions obtained from such data. We derive a method that enables us to find a joint probability distribution of and that fulfill (1) and is closest to a given empirical distribution in an information-theoretic sense.

Without loss of generality we assume that the intervention took place on with values in and . In the following we assume that the elements of are in a fixed order. We summarize , , where are the observed data and the interventional data. We define that takes values in and with we denote the joint distribution over . The space of probability distributions on is denoted by and for the marginalization of is defined by with , where and . The next Lemma gives us a characterization of distributions that fulfill (1).

Lemma 1

The set of joint probability distributions for , for all which fulfill the consistency condition (1) is called and given as

where .

The proof is given in the Supplementary Material. The support of is therefore given by

Given observational and interventional samples of and and its corresponding empirical distributions , for we try to find a distribution such that


We can always find a joint distribution such that (2) holds, since the distributions for all are independent to each other. Although this does not guarantee , we can try to find a distribution in that has minimal relative entropy to . This minimal relative entropy can be interpreted as an approximation error to an assumed causal model. The relative entropy or Kullback-Leibler divergence (KL-divergence) between two distributions is defined as follows

We use the convention that for , see also CT91 ; Kak99 . This leads to:


That is a nonlinear min-min optimization problem with linear constraints. It turns out that in our situation, the problem simplifies to a linear optimization problem.

Proposition 1

The optimization problem (3) simplifies to the following linear optimization problem

with .

The proof is given in the Supplementary Material and an application of the Lagrangian multiplier method. The statements of Proposition 1 holds also for any other support set characterization rather than . The global approximation error is given by . Inspired by JMZLZDSS12 ; DJMZSZS10 ; JBGS13 and by the intuition that the information loss from to should be smaller than the other way round (due to an assumed mechanism from to ) the quantity could also be seen as a kind of approximation error to the causal model .

3.2 The Binary Case

To illustrate our approach, we consider the binary case. That means , and . The set of consistent probability distributions is characterized by

and therefore . A probability distribution

is a non-negative vector with

elements that sums up to . We encode the conditions (2) into a contraint matrix that takes the following form

and into a corresponding right-hand side

The non-negativity can be encoded in an identity matrix

of length and a zero vector of length as the right-hand side. A probability distribution that solves (2) is then a solution to the following linear optimization problem

The proof of Proposition 1 tells us that a distribution that fulfill condition (1) and is as close as possible to in an information-theoretic sense can be obtained by the following re-weighting of

3.3 Implementation

The procedure in Subsection 3.2 can be generalized for arbitrary finite ranges of and . The pseudo-code of the algorithm is shown in Algorithm 1. The size of the finite ranges is denoted by and . We further assume that we have for every interventional data available. Therefore, the constraint matrix has dimension . The first row of contains at each column, the following rows contain the support patterns of and the final rows contain the support pattern of . The function getConstraintDistribution prepares the right hand side of accordingly. Since we assumed that the intervention took place on the underlying assumed causal model is . Note that is depending on .

3:Solve LP problem: s.t. and
5: or depending on the setting
Algorithm 1 IACM(, , , )

We implemented this procedure in Python and used the cvxpy

package to solve the linear program. The dimension of

will grow exponentially with the size of ranges for and . However, it turns out that it is enough to consider low range sizes . For possibly preprocessed sample data of and

with much higher or continuous range sizes, we apply an equal-frequency discretization based on quantiles, that is done before calculating and feeding

into Algorithm 1.

4 Applications

The approximation approach described in Section 3 has several applications. We describe two of them in the following subsections.

4.1 Probabilities for Causes

To measure the effect of a cause-effect relation Pearl proposed in Pea09 counterfactual statements that give information about the necessity, the sufficiency, and the necessity and sufficiency of cause-effect relations. A counterfactual statement is a do-statement in a hypothetical situation that can, in general, not observed or simulated. Formally this means we condition an SCM to an observed situation and apply a do-operator. The corresponding intervention distribution reads for example which means the probability that equals if would have been where indeed we observed that is and is .

Definition 2

Let be random variables in an SCM such that is a (hypothetical) cause of and .

  • The probability that is necessary as a cause for an effect is defined as where .

  • The probability that is sufficient as a cause for an effect is defined as

  • The probability that is necessary and sufficient as a cause for an effect is defined as

In general, counterfactual statements cannot be calculated from observational data and without knowing the true underlying SCM. However, Pearl identified situations in which we can exploit the presence of observational and interventional data to calculate the probabilities defined above. One such situation is when the underlying SCM is monotonic.

Definition 3

An SCM with for two random variables and is called monotonic relative to , if and only if is monotonic in independent of .

If and are binary and if is increasing monotonic relative to , then Theorem 9.2.15 in Pea09 give us


Similar if is decreasing monotonic relative to , then we could also derive in the same fashion as Pearl did it the following formulas


By approximating empirical observational and interventional distributions to a monotonic causal model we can calculate , , and . To do this, we need to further restrict the set and note that the monotonicity of implies that either and has zero probability or that and has zero probability. This means that either or has to hold in addition to the conditions given in . We define as the set of probability conditions with an underlying monotonic increasing data generation process and as the set of probability conditions with an underlying monotonic decreasing data generation process. An approximation in the sense of Subsection 3.1 to or instead of will only change the definition of , the rest will remain the same. In order to calculate , , and we approximate to and , choose the one with the least approximation error and use the formulas given above. We state the pseudo-code in Algorithm 2.

3:if  then
4:     Calculate using and formulas (4) - (6)
6:     Calculate using and formulas (7) - (9)
7:end if
8:return , ,
Algorithm 2 CalcCausalProbabilities()

4.2 Causal Discovery

When we assume that we can test how well the given data fit that assumption and obtain an approximation error . Switching the roles of and we get . The direction with the smallest error is the one we infer as the causal direction. If the error difference is below a small tolerance , we consider both directions as equal and return "no decision". If and are binary and the error to the monotone models is smaller than to the non-monotone models, then we apply Algorithm 2 to determine for both directions and use this as a decision criterion for the preferred direction (the direction with the higher determines the direction). In general, some kind of data preprocessing and discretization before applying the causal discovery method is of advantage. In our implementation, we included several different preprocessing steps that treat and different depending on the assumed causal direction, see Supplementary Material for more details.

1: preprocessing of w.r.t.
2: preprocessing of w.r.t.
3:if  AND monotone model is preferred then
4:     use CalcCausalProbabilities to get for and
5:     If then return direction with highest
9:     If then return no decision
10:end if
11:If then return else return
Algorithm 3 IACMDiscovery(X, Y)

5 Experiments

We test Algorithm 3 with synthetic and real-world benchmark data against alternative causal discovery methods. It runs with

, and with a heuristic preprocessing procedure described in the Supplementary Material. Depending on the input data structure, we used

or as an approximation error, see also Supplementary Material for more details.

5.1 Pairwise Causal Discovery Methods

Among the various causal discovery approaches for continuous, discrete, nonlinear bivariate data, we select those that do not include any training of labeled cause-effect pairs to have a fair comparison. One well-known method uses additive noise models (ANM) that assume SCMs with additive noise and applies for continuous and discrete data HJMPS09 ; PJS11 . Furthermore, we select an information-geometric approach (IGCI) JMZLZDSS12 designed for continuous data and some recent methods for discrete data that are using minimal description length (CISC) BV17 , Shannon entropy (ACID) BV18 , and a compact representation of the causal mechanism (HCR) CQZZH18 . We further select conditional distribution similarity (CDS) F16 , regression error based causal inference (RECI) BJWSS18 , and (nonlinear) invariant causal prediction (nonICP, ICP) HPM18 ; PBM16 as baseline methods. For all methods we used the default parameter settings.222For HCR, nonICP, and ICP we use the R-packages from the references, for CISC, ACID the corresponding Python code and for ANM, IGCI, CDS, RECI the Python package causal discovery toolbox KG19 .

5.2 Synthetic Data

We generate a set of synthetic data that are different in its structure (linear, nonlinear, discrete, non-discrete) and its range size. These synthetic data consists of observed data and data from perfect interventions. We use SCMs with additive or multiplicative noise to generate these data or where are sampled independently from a

-distribution for which the degrees of freedom are chosen randomly from

and we randomly decide if we use an additive or multiplicative model. The nonlinear function is randomly selected between the following functions , , and . The linear function is given by , where is randomly selected from the interval . The discrete data are generated using a -bins discretization. We simulate perfect interventions on by setting them to every value in the range if the range is discrete and to some randomly selected value if the range is continuous. The sample size is chosen randomly from and we run simulations for each configuration. Figure 1 shows the averaged accuracy of correct inferred causal direction for linear synthetic data relative to the difference in range size for small range sizes. Our method performs substantially better for positive differences than all alternative approaches. A similar picture can be seen in Figure 1 for nonlinear synthetic data. Therefore, it seems that our method is well suited for situations where the range size of the cause is greater than the range size of the effect. In Figure 2 we see that our method performs also well for synthetic linear and nonlinear continuous data.

Figure 1: Averaged accuracies of correct inferred causal direction for linear (a) and nonlinear (b) synthetic data relative to the difference in range size for small range sizes.

5.3 Real-World Data

As a benchmark set consisting of real-world data, we use manually labeled continuous cause-effect pairs (CEP) from different contexts MPJZS16 ; DG19 . As real-world discrete data sets, we use anonymous discrete cause-effect pairs where food intolerances cause health issues (Food)333This dataset, given as discrete timeseries data, has been provided by the author and the causal direction has been independently confirmed by medical tests., the Pittsburgh bridges dataset (Bridge, pairs) from DG19 as it has been used in CQZZH18 , and the Abalone dataset (Abalone,

pairs) from the UCI Machine Learning Repository

DG19 . For the CEP data set IACM outperforms all other methods. Also for discrete real-world data we can see in Figure 2 that IACM successfully recovers all causal directions and can keep pace with state-of-the-art methods.

(a) Accuracies of synthetic linear and nonlinear continuous data and of continuous real-world data (CEP).
(b) Accuracies of different discrete real-world data sets.
Figure 2: Accuracies of correct inferred causal directions for continuous and discrete data.

6 Discussion

In this paper, we proposed a way how empirical distributions coming from observations and experiments can be approximated to ones that follow the restrictions enforced by an assumed causal model. This can be used to calculate probabilities of causation and leads to a new causal discovery method. In our experiments, we could confirm that our approach can compete with the current state-of-the-art methods, also on real-world datasets (continuous and discrete) and without the explicit knowledge of experimental data. Especially for the discrete setting in which the range size of the cause is greater than the range size of the effect, our method has advantages compared to other approaches. Since IACM ran with small range sizes and , it seems that in many cases the essential cause-effect information can be encoded with much less information than we might have in the data. This is interesting by itself and could serve as a base for future research.

Broader Impact

As all methods for causal discovery that use observational and/or data from implicit interventions the work in this paper could help to avoid unethical experiments. Furthermore, it contributes to a more relieable detection of causal relations, since it fills a gap in the existing causal discovery landscape. Therefore, our research can help during the evaluation and design of studies with few discrete features and contribute to more solid conclusions of those studies. On the other hand there is a potential risk that our method is used for data that are not following the assumptions of this paper. This may lead to wrong causal directions and to wrong conclusions, but can be avoided by checking the assumptions on the data before applying our method. Finally, it should be noted that this article may inspire future research projects in the field.


  • (BJW18) P. Bloebaum, D. Janzing, T. Washio, S. Shimizu, and B. Schölkopf, Cause-effect inference by comparing regression errors

    , International Conference on Artificial Intelligence and Statistics (2018), 900–909.

  • (BV17) K. Budhathoki and J. Vreeken, MDL for causal inference on discrete data, 2017 IEEE International Conference on Data Mining (ICDM) (2017), 751–756.
  • (BV18)  , Accurate causal inference on discrete data, 2018 IEEE International Conference on Data Mining (ICDM) (2018), 881–886.
  • (CLRS01) T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms, MIT Press, Cambridge, 2001.
  • (CQZ18) R. Cai, J. Qiao, K. Zhang, Z. Zhang, and Z. Hao, Causal discovery from discrete data using hidden compact representation, Adv Neural Inf Process Syst (2018), 2666‐2674.
  • (CT91) T. Cover and J. Thomas, Elements of information theory, John Wiley and Sons, 1991.
  • (DG19) D. Dua and C. Graff, UCI machine learning repository, 2019.
  • (DJM10) P. Daniušis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Schölkopf, Inferring deterministic causal relations, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (2010), 143–150.
  • (Fon19) J. Fonollosa, Conditional distribution variability measures for causality detection, Cause Effect Pairs in Machine Learning (I. Guyon, A. Statnikov, and B. Batu, eds.), Springer, 2019.
  • (HDPM18) C. Heinze-Deml, J. Peters, and N. Meinshausen, Invariant causal prediction for nonlinear models, Journal of Causal Inference 6 (2018), no. 2.
  • (HJM09) P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Schölkopf, Nonlinear causal discovery with additive noise models, In Neural Information Processing Systems (NIPS) (2009), 689–696.
  • (JBGWS13) D. Janzing, D. Balduzzi, M. Grosse-Wentrup, and B. Schölkopf, Quantifying causal influences, The Annals of Statistics 41 (2013), no. 5, 2324–2358.
  • (JMZ12) D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf, Information-geometric approach to inferring causal directions, Artificial Intelligence 182-183 (2012), 1–31.
  • (Kak99) Y. Kakihara, Abstract methods in information theory, World Scientific Publishing Co. Pte. Ltd., 1999.
  • (KG19) D. Kalainathan and O. Goudet, Causal discovery toolbox: Uncover causal relationships in python.
  • (MPJ16) J. Mooij, J. Peters, D. Janzing, J. Zscheischler, and S. B., Distinguishing cause from effect using observational data: methods and benchmarks, Journal of Machine Learning Research 17 (2016), 1–102.
  • (PBM16) J. Peters, P. Bühlmann, and N. Meinshausen,

    Causal inference by using invariant prediction: identification and confidence intervals

    , Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (2016), no. 5, 947–1012.
  • (Pea09) J. Pearl, Causality, models, reasoning, and inference, Cambridge University Press, 2009.
  • (PJS11) J. Peters, D. Janzing, and B. Schölkopf, Causal inference on discrete data using additive noise models, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011), 2436–2450.
  • (PJS17) J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference, MIT Press, 2017.

Supplementary Material for Information-Theoretic Approximation to Causal Models

Supplementary Material for Information-Theoretic Approximation to Causal Models

Appendix A Proofs

a.1 Proof of Lemma 1

Lemma 1

The set of joint probability distributions for , for all which fulfill the consistency condition (1) is called and given as

where .


The consistency condition (1) implies the following relation for some and

These relation implies

which characterizes the joint distributions that satisfy (1).

a.2 Proof of Proposition 1

Proposition 1

The optimization problem (3) simplifies to the following linear optimization problem

with .


We first consider the inner minimization problem of (3) for a given joint distribution . This is a constrained optimization problem where the constraints in are equivalent to the equation

since is a probability distribution. Therefore, the Lagrange functional of this minimization problem reads

with as Lagrange multiplier. Using the Lagrange multiplier method we obtain explicit expressions for the approximating distribution

for and for all . Thus we have solved the inner minimization problem explicitly and the relative entropy simplifies to

Therefore, we can now optimize on the space of possible joint distributions and (3) simplifies to

Since is a monotone function it suffices to maximize given the constraints. But this is nothing than a linear optimization problem which can be solved by linear programming using the simplex algorithm, see, for example, CLRS01 .

Appendix B Application to Timeseries Data

Algorithm 1 can also be applied when we assume that the underlying causal model has a time lag of , which is , and the observational and interventional data have a time order. We only have to shift the incoming data for and so that Algorithm 1 applies to , and has to take care that we preserve the data order during preprocessing steps. If we do not know the exact time lag we can run the approximation several times with different time lags to find the approximation with the lowest error.

Appendix C Experiments

c.1 Data Preprocessing

Algorithm 3 includes several preprocessing steps for sample data from and that can be parametrized. The preprocessing steps split the data into obersvational data and interventional data for and accordingly. These steps are:

  • discrete-cluster: discretize data from and using KBinsDiscretizer to a number of bins. Apply a

    -means clustering to the discretized data with fixed number of clusters. We split the data according to the variance in the identified cluster. If

    is the assumed cause, then the cluster with the lowest variance in determines the observed data and the union of the other clusters the interventional data. If is the assumed cause, then we apply the same procedure by exchanging with .

  • cluster-discrete: the steps of discrete-cluster are interchanged. The data are first clustered, then discretized, and split according the variance in the clusters.

c.2 Parameter Setting in Experiments

In the experimental runs we used the following parameter configuration to run IACM. If or we used no preprocessing, since the binary discretization build into IACM itself seems sufficient enough. If we used discrete-cluster as preprocessing and used as a decision criterion for the causal direction. In the other case when we used discrete-cluster but as decision criterion. For data with continuous or large range sizes we used cluster-discrete as preprocessing and the approximation error as decision criterion. The number of bins used in the discretization part is chosen such that it is a little above the range size of the data. The number of clusters is determined by a simple search procedure. We apply KMeans clustering for every number of clusters between

and the number of bins to the data and determine the relative entropies between the observational data and between the interventional data resulting from the cluster split as described above. The number of clusters with the lowest sum of those relative entropies is chosen for preprocessing. Furthermore, before feeding data into preprocessing we applied a robust scaling to them that is robust w.r.t. outliers. Our experiments also show that these heuristics defining the meta parameters for IACM is by far not optimal. We left that for future research.