1 Introduction
In many research fields, scientists do experiments to determine the causal relationships between variables. But in disciplines such as ecology and economics, the values of the variables can only be observed passively rather than be controlled by scientific experiments. As a result, causal relationships have to be obtained by leveraging the data. In the early researches in Pearl (2008)Spirtes et al. (2000), graphic models are constructed by calculating the conditional independence among at least three variables. However, inferring the causal direction between two variables is a tougher and more challenging task over years since the graphs represent causes and causes determine the same Markov equivalence class Gillispie and Perlman (2001).
In the past decade, a family of methods based on the functional causal models (FCM) Zhang et al. (2016) are developed to deal with the problem of causal direction inference Mooij et al. (2016). For the causal relationship represented by
, the very basic idea of the FCM based methods lies in the fact that, the effect variable’s probability density function
should be more related with the transfer function , than the cause variable’s probability density function . Many models and methods are proposed in the recent years by using this rule, either implicitly or explicitly, such as the ANM (Additive Noise Models) in Peters et al. (2011)Peters et al. (2014), the PNL (Post Nonliear) in Zhang et al. (2016)Zhang and Hyvarinen (2009) and the IGCI (Information Geometric Causal Inference) in Mooij et al. (2016) and Janzing et al. (2014). More recently, the RECI (Regression Error Causal Inference) method in Bloebaum et al. (2018) uses the regression errors to determine the causal direction. In Goudet et al. (2018), deep neural models are leveraged to fit the data. And in the Slope method in Marx and Vreeken (2019), the minimum description lengths of the regression models are taken into consideration. To this day, causal direction inference remains a research hot spot in the data science community Janzing (2019).The causal direction inference algorithms can be taken as a map from the observation data to a binary value. This is quite different from the model based classification and regression algorithms commonly appeared in the machine learning and data mining literature. These methods usually map the training data into a set of model parameters
in the real number domain . The result of will not be changed dramatically by adding or deleting one or a few data points in the training data set. This is not the case in causal direction inference. The result of the causal direction can be changed simply by adding or deleting one sample point, just as depicted in the Figure 1. This is an unacceptable feature for the task of causal inference. The causal relationships are supposed to be stable and rarely changed over time. In addition, an opposite causal direction will lead to totally different judgments and decisions when applied in large and complex systems in many application fields.In this work, parallel ensemble Zhou (2012) based methods for causal direction inference are proposed to improve the accuracy and stability of the causal direction decisions. Ensemble method is a research hot spot in the recent years to deal with the growing data volumesAlJarrah et al. (2015)
. Many conventional methods can be greatly improved by using an ensemble framework. For instance, gradient boost decision tree model
Ke et al. (2017), which is the sequential ensemble version of the conventional decision tree Quinlan (1986) is very popular nowadays. In a parallel ensemble method, the final result is obtained by aggregating the results of every base learners. If the base learners are capable of performing better than random guesses, the final result of their ensemble are guaranteed to be better than the results of individual base learners. The only problem is that in parallel ensemble, the base learners use subsets of the original data set, that will reduce the accuracy rates of the base learners. Consequently, the trade off between the number of base learners and the resampling rates should be carefully handled. In addition, the parallel ensemble mechanism may make the time complexity increase by almost times, where is the number of the base learners. Fortunately, the algorithm is very easy to be deployed in parallel computing environments.There are two contributions in this paper. First, a parallel ensemble framework for causal direction inference is proposed. Second, the accuracy of the framework is analyzed when IGCI is used as the base method. So the optimal choices of the sub set size and the number of base learners are investigated based on this accuracy analysis. The idea of parallel ensemble has not been used in the field of causal direction inference. We find two empirical research papers in social scienceSliva et al. (2017) and economics Athey et al. (2019), where they use the idea of simple ensemble to determine the causal relationships by collecting the results from several different causal inference methods. In this paper, our base learners use the same algorithms, so the accuracy can be analyzed theoretically.
In the rest of this paper, the preliminaries of the causal criterion of the base learners are given in the Section 2. The novel algorithms and the corresponding accuracy analysis are proposed in the Section 3. Experiments are demonstrated in the Section 4. And the conclusions are given at the end of the article.
2 Preliminaries
When two variables and are statistically related, there are five possible scenarios Mooij et al. (2016). i) causes ; ii) causes ; iii) dependence and feedback; iv) hidden confounder (which means and have a common cause); v) selection bias (which means and have a common effect, and they are observed conditionally on this variable). Note that the relationship in the real world may be more sophisticated, for instances, the feedback and hidden confounder may exist at the same time (a combination of scenarios iii and iv). We would like to simplified the discussion by considering only the scenarios i and ii. That is to say, we assumed that there is no feedback, no hidden confounders and no selection bias in the observation data.
2.1 Causal criterion in the base method
In causal direction inference problems, the values of the cause and the effect are observed and stored in the data set , where and , is the number of sample pairs. In the context of functional casual models, if is the cause variable and is the effect variable, their relationship can be expressed as . The observations may be contaminated by noise, so the data should be expressed as , and . Note that many causal inference methods use the additive noise assumption to tell cause from effect such as ANM, PNL and LinGAM. In some other methods such as IGCI, observation noise can also be part of the reason for their unstable performances.
Similarly, if the truth is causes , the relationship can be described as . The task of the causal direction inference is to find out the correct model that produces the data set . From the perspective of regression method, and can fit the data points equally well. So it is an uneasy task to tell the correct direction. By the way, or
can have explicit formulas, or they can also be described by nonparametric models such as Gaussian process.
The IGCI method mentioned in the introduction will be chosen as the base learners in this work. Not only because it achieves one of the best results on the benchmark data sets, but also because that no regression models are used to fit the data in this method. The results of the regression based methods such as ANM may be influenced by the problems such as under fitting or over fitting. However, the discussion of this paper will be focus on the effect of the ensemble. In addition, the framework proposed in this work can also use other causal direction inference methods as the base learners.
For any nonlinear function , let be produced by and is the inverse function of . The probability density of the effect variable can be expressed as . So the method IGCI assumes that and should be more related with each other than and . This asymmetric property can be used to infer the causal direction between and .
Suppose that causes , the causal criterion of IGCI can be formally expressed as
(1) 
Thus the causal direction of and can be obtained by comparing and , who are the mathematical expectation of and respectively.
(2) 
(3) 
The data pairs in (2) are sorted by , and the data pairs in (3) are sorted by , so the data pairs are in different orders when calculating and . indicates that causes , and indicates that causes .
Note that if there are repeated values in the data set, some of the denominators in (2) and (3) will be zero. So the following procedures are used instead. First record the number of elements that have the same value with as , and then remove the repeated elements. Let the new data sequence be . The in (2) can be calculated as
(4) 
Similarly, record the number of elements that have the same value with as and remove the repeated elements, the in (3) can be calculated as
(5) 
2.2 The unstable performance of the base method
Just as mentioned in the induction part, the results of the causal direction inference are rather unstable. An example is given here to raise the issue. The data pairs are collected by FluxnetYu et al. (2019). denotes the night temperature and is the flux at night at the same place. The respiration of plants will be stronger at higher temperature. So here causes . One pair of data is recorded each day. So there are 365 data pairs in this data set. The results of the IGCI method are depicted in the Figure 1. The experiments are done simultaneously with the data collecting process. The first experiment is done with the data pairs collected from the very beginning to the 200th day and the second experiment is performed with the data pairs collected up to the 201th day, etc. The causal decisions of the base method are depicted in the Figure 1. It shows that the results of the base method are rather unstable and almost half of the results are incorrect. The algorithm’s output can be changed simply by adding one data pair. Specifically speaking , a new data point breaks the sorted sequences of and in (2) and (3). If the new slopes in one direction are sufficiently larger than the other direction, then the sign of the result may be changed. This is the source of instability in the IGCI method. It is partly due to the short of data samples but we believe that the stochastic nature of the specific data set plays a more important role. This is an undesirable feature for the task of causal direction inference. Performances like this are also reported in other applicationsZhang et al. (2018).
3 Parallel ensemble methods for causal direction inference
3.1 Parallel Ensemble Causal Direction Inference
If the correct rate of a causal direction inference method is higher than , a more stable and probably correct result can be anticipated by combining multiple answers of the model using the majority vote strategy. A parallel ensemble framework is proposed in this section. The original data set is resampled to generate a number of sub data sets for base learners. The final decision is the majority vote of the base learners. The algorithm is described in the Table 1. and are the original data with length . is the number of the base learners. is the length of the resampled data sets, and
. The votes are recorded in the vector
, for one causal direction and for the other. The finally decision can be obtained by .Note that if is too close to , the diversity of the base learners may be insufficient. They tends to give the same answer. So the performance of the new algorithm will be very close to that of the base learner on the original data set. Otherwise if is much smaller than , the accuracy of the base learners will drop with . An optimal value of k should be determined by taking the above two issues into consideration. On the other hand, it is easy to see that the accuracy will increase with . We would like to investigate the influences of the parameter settings in the next subsection.
PECI(Parallel Ensemble Causal direction Inference) 

: , , , 
: causal direction 
1 Let be an zero vector; 
2 for t from 1 to T: 
3 Sample data pairs randomly from , as ; 
4 Sort data by and calculate by Eq.(4) 
5 Sort data by and calculate by Eq.(5) 
6 if let 
7 else if let 
8 end for 
9 if output causal direction 
10 else if output causal direction 
11 else output unknown 
3.2 Accuracy Analysis of PECI
In the base method, the logarithmic slope
can be taken as a random variable with arbitrary distribution. Then according to the central limitation theorem,
approximates to the Gaussian distribution. This is written as Lemma 1 and the error rate of the base method can be consequently obtained in the Lemma 2. The upper bound of the error rate of the proposed method can be derived in Theorem 1.
Lemma 1
Let the logarithmic slopes in the equation (2) be a set of random variables with independent identical distribution
whose mathematical expectation and variance are
and respectively. If the number of data points is sufficiently large, the variable in the equation (2) can be considered as a random variable with normal distribution
.Lemma 1 can be derived from the central limitation theorem directly. Identically, in the other direction, the variable in (3) can also be regarded as a random variable with normal distribution .
Lemma 2
Lemma 2 can be simply derived by the fact that the based method will give correct results when , so the error rate can be obtained by calculating (for the case causes ). The subscript denotes the length of the data set. The error rate will be decreased by collecting more data pairs when . Note that if , the output of the function will be negative, and then the error rate of the base method will be greater than , which means the correct rate can be worse than random guesses and collecting more data pairs does not help improving the performances.
From Lemma 2 and The Hoeffding’s inequality, the upper bound of the error rate of PECI can be derived.
Theorem 1
(The upper bound of the error rate of PECI.) Let be the number of ensemble tasks and be the resampling size of each ensemble task in the algorithm PECI. The upper bound of the error rate can be obtained by
(7) 
where in the Lemma 2.
The proof of theorem 1 is provided in the appendix. From the above theorem we can see that the upper bound decreases with and . However, the choice is not that simple. The size of the sub data set should not be too small, sufficient number of data points should be sampled to keep the accuaracy of the base method. On the other hand, it should also not be too large, otherwise the sub data sets may be lack of diversity, which causes the results of the base method being the same. So we should find a appropriate between and . It seems that we should let the value of to be as large as possible. However, the number of is limited by , since ensemble with identical tasks is meaningless. From the above discussion we can see that the ensemble framework does not guarantee better results, unless the parameters are carefully selected. So the next question is to give the condition under which a better result can be guaranteed.
Corollary 1
The inequalities (8) and (9) can be easily obtained from Lemma 2 and Theorem 1. Note that the left hand side of the inequality (8) is nonmonotonic which means there is an optimal value for . However, it is not a solid conclusion that the optimal value of is . We would like to leave that to the further researches.
3.3 Weighted PECI
From the Hoeffding’s inequality in the Lemma 3 in the appendix section we can see that the upper bound also holds when the results of the base estimators are weighted. The values of
can be taken as the confidence of the causal direction decisions. So we can use them to formulate the weight of the votes to improve the PECI method. However, the absolute value of should be normalized toto avoid the extreme outliers. The
function in (11) and the transformation of the sigmoid function in (
10) are most frequently used in the machine learning literature. They can be represented as(10) 
and
(11) 
Note that can be either positive or negative. The final decisions can still be obtained by . The weighted algorithm is summarized in the Table 2. The major difference with PECI lies in the Line 6.
WPECI(Weight Parallel Ensemble Causal direction Inference) 

: data pair , , , 
: causal direction 
1 Let be an zero vector; 
2 for t from 1 to T: 
3 Sample data pairs randomly from , as ; 
4 Sort and calculate by Eq.(4) 
5 Sort and calculate by Eq.(5) 
6 Calculate by Eq.(10) or Eq. (11) 
7 end for 
8 if output causal direction 
9 else if output causal direction 
10 else output unknown 
4 Experiments
4.1 Experiments on artificial data
There are two groups of experiments in this subsection. In the first group, the data sets are produced by functions with explicit analytical equations. And in the second group, the data sets are produced by Gaussian process model, where the input and output of the model cannot be expressed by any explicit analytical equations. The second group of data is actually proposed by Mooij et al. (2016) as a benchmark for causal direction inference.
4.1.1 Data from explicit analytical equation
In this part, the data pairs are produced by an explicit formula with additive noise and . The values of are generated by Gaussian distribution with zero mean and unit variance. and
are zero mean Gaussian white noise whose variances are set to be
. The data points are normalized by to be within . There are 2000 data pairs in each simulation. And the simulations are repeated 10000 times. The values of accuracy are calculated based on the numbers of correct results. Note that the additive noise can also be taken as the unobserved latent variables. And the same 10000 data sets are used for different k and T, so the result of the based method is a horizontal line.In the experiments depicted in the Figure 2, the number of tasks is set to be , and the length of resampling ranges from to . From the result we can see that the when is close to the original data length , the accuracy rates of the based method and that of the ensemble method are close. This is because the data sets obtained by the resampling process are very close to the original data set when is very close to . The accuracy rates of the ensemble methods increase while decreases from to about . After that the accuracy rates decrease with . The performance of the PECI with majority vote is even worse than the base method for . So this is actually caused by the decrease of the base learners’ accuracy rates. The trade off between the accuracy rates of every single base learners and the diversity of them as a whole is clearly revealed by this picture.
In the experiments depicted in Figure 3, is set to be , which equals . The number of base learners ranges from to . From the results we can see that the accuracy rates of the ensemble methods increase with the number of base learners. And the curve of PECI fits well with (7) in the Theorem 1. It is worth mentioning that, the value of cannot be arbitrarily large. It should be bounded by , and if it is too close to this upper bound, there will be lack of diversity for the based learners, so computing resources will be wasted.
In Figure 4, the running times of the PECI method on the parallel computing environment with different numbers of threads are depicted. The simulation is done on a workstation with Intel Xeon CPU E52650v3 2.3GHz. The parallel computing toolbox of MatlabSharma and Martin (2009) is used. The loop between Line 2 and Line 8 in the algorithm PECI in the Table 1 are set to be run in parallel. The running time decreases significantly with the increase of the CPU cores being used. The parallel ensemble method naturally fits the parallel computing environments.
4.1.2 Data from Gaussian processes
The data sets in this part are the artificial benchmark data sets in Mooij et al. (2016). There are four groups of data sets who are produced by Gaussian process model with the additive noise. A brief description of the data generation is given here for the integrity of the article. This data set can be downloaded from the link provided in Mooij et al. (2016) and further details can also be found in its appendix. For SIM, SIMln and SIMG, the values of are sampled from the Gaussian process and the values of are sampled from the Gaussian process . is the Gram matrix of an RBF kernel function with parameter , the superscript represents the i th element of the vector. and
is the Identity matrix.
, and and are also randomly sampled from Gaussian processes independently. The data sets in SIMln has relatively low noise. The distributions of the data sets in SIMG are most similar to Gaussian distributions. The data sets in SIMc are generated with comfounders. For SIMc the values of and are sampled from and .There are 100 identically distributed data sets in each group. And the data length of each single data set is . is set to be and is . The simulation results are recorded in the Table 4.
Data set  IGCI  PECI with  WPECI with  WPECI with 
Accuracy  base method  majority vote  sigmoid weight  tanh weight 
Simln  62%  63%  65%  62% 
SimG  85%  91%  90%  89% 
Simc  46%  46%  45%  45% 
Sim  42%  40%  39%  39% 

From the results in the Table 4, we can see that if the base method performs better than random guess, the accuracy of the PECI will be better than the base method. Otherwise if the base method performs worse than random guess, the result of PECI will be even worse. This is accord with the common sense of the ensemble methods.
4.2 Simulations on the real world data
The real world benchmark data repository proposed in Mooij et al. (2016) currently has 108 data sets from different disciplines and the number is ever increasing. In the simulations, we let the parameter and let be changed with as if , if , if , if , and if . The above choice is based on two considerations. First, the size of the sub data sets should have enough data points in order to keep the accuracy of the base method. Second, more diversity can be obtained by smaller compared with . Beside IGCI, the other two recently proposed methods are also joined as the base method here to enrich the results. These two methods have to use a regression model to fit the data. For the method SlopeMarx and Vreeken (2019), the algorithm chooses the model with minimum description length among nine choices automatically as the regression model. And for the method RECI Bloebaum et al. (2018), nineth order polynomial model is used as the regression model.
Name of the  Accuracy of  Accuracy of ensemble method with  
base method  base method  majority vote  sigmoid weight  tanh weight 
Slope Marx and Vreeken (2019)  59.2%  60.1%  61.1%  61.1% 
IGCI  65.7%  68.5%  67.6%  66.7% 
RECI Bloebaum et al. (2018)  75.0%  80.6%  80.6%  80.6% 
The base method IGCI gives 71 correct causal directions in all the 108 data sets, and PECI with majority vote gives 74 correct answers, while WPECI with sigmoid weights and tanh weights give 73 and 72 correct answers respectively. The base method Slope gives 64 correct causal directions, and the corresponding ensemble method with majority vote gives 65 correct answers. The correct numbers with sigmoid weights and tanh weights are both 66 respectively. The base method RECI gives 81 correct results and its three ensemble counterparts all give 87 correct results.
The results show again that if the base method has a higher accuracy, the performance of its ensemble version will give better results. For the ensemble methods with Slope and RECI, weighted algorithms outperform the algorithm with majority vote. But for the base method IGCI, it is not the case. We carefully examined the data set Number 20, on which the method with majority vote succeeded but the weighted algorithm failed. In the data set is the latitude and is the average temperature. The number of correct decisions are larger than the incorrect ones, but the absolute values of in the incorrect decisions are much larger than the correct ones. It dues to the fact that the stations with very close latitude can have very different temperature records. And the results are always very close to the edge . Another reason we believe here is that the data set is relatively small. Only 349 pairs are recorded. However when we set the parameter from about to , WPECI is capable of giving the correct answer. If the parameter can be carefully tuned for each data set, the number of the corrections of the ensemble method can be increased. However it is impossible or may be unfair to do so.
Finally, let us take a look at the unstable example given in the preliminary section. This data set is recorded as the 83th data set in the above experiment. The setups are the same with that in the Section 2.
The causal decisions of the base learners are depicted in the top picture of Figure 5, which is identical with the Figure 1. It shows that the results of the base method are rather unstable and almost half of the results are incorrect. The second to the fourth pictures show the performances of PECI under , , and . The improvements are obvious. Every steps in the simulation gives the same and correct result.
5 Conclusions
In this paper parallel ensemble based causal direction inference algorithms are proposed. The accuracy rates of the methods are investigated theoretically. Higher correct rates and more stable results are obtained on both the artificial data sets and the real world data sets. In addition, the parallel ensemble framework can be conveniently implemented on parallel computing environments. So the performance of the causal direction inference can be improved with controllable time expenses.
Acknowledgement
This work is supported by NSFC61803337, NSFC61803338, ZJSTFLGF18F020011.
Appendix
Lemma 3
(The Hoeffding’s inequality) Let , where each of , ,…, is a sum of independent random variables. and need not be mutually independent for . is positive and . If for are identically distributed, the following inequalities holds for any .
(12) 
(13) 
where denotes the mathematical expectation.
Note that the Hoeffding’s inequalities with independent assumption are more frequently used in the research field of computer science. However, the inequalities still hold when the random variables are not independent. And this is more in line with the reality in this paper. The upper bound of the PECI can then be obtained by Lemma 13.
Proof of Theorem 1. Let , where is defined in the algorithm PECI. and . If the correct causal direction is , the expectation of can be obtained by Lemma 2 as , then the error rate of the ensemble method is
Then the upper bound (7) can be obtained by using (13). Otherwise if the correct causal direction is , the error rate of the ensemble method is
Then the upper bound (7) can be obtained by using (12). Note that it is assumed the base learns give results better than random guesses, which means when the truth is . ∎
References
 Efficient machine learning for big data: a review. Big Data Research 2 (3), pp. 87–93. Cited by: §1.
 Ensemble methods for causal effects in panel data settings. In AEA Papers and Proceedings, Vol. 109, pp. 65–70. Cited by: §1.

Causeeffect inference by comparing regression errors.
In
International Conference on Artificial Intelligence and Statistics
, pp. 900–909. Cited by: §1, §4.2, Table 4.  Enumerating markov equivalence classes of acyclic digraph models. In Conference on Uncertainty in Artificial Intelligence, Cited by: §1.

Learning functional causal models with generative neural networks
. InExplainable and Interpretable Models in Computer Vision and Machine Learning
, pp. 39–80. Cited by: §1.  Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pp. 409–426. Cited by: Appendix.
 Justifying informationgeometric causal inference. arXiv preprint arXiv:1402.2499. Cited by: §1.
 The causeeffect problem: motivation, ideas, and popular misconceptions. In Cause Effect Pairs in Machine Learning, pp. 3–26. Cited by: §1.
 Lightgbm: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §1.
 Telling cause from effect by local and global regression. Knowledge and Information Systems 60 (3), pp. 1277–1305. Cited by: §1, §4.2, Table 4.
 Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research 17 (1), pp. 1103–1204. Cited by: §1, §2, §4.1.2, §4.1, §4.2.
 Causality: models, reasoning, and inference. Cambridge University Press. Cited by: §1.
 Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (12), pp. 2436–2450. Cited by: §1.
 Causal discovery with continuous additive noise models. The Journal of Machine Learning Research 15 (1), pp. 2009–2053. Cited by: §1.
 Induction of decision trees. Machine Learning 1 (1), pp. 81–106. Cited by: §1.
 MATLAB®: a language for parallel computing. International Journal of Parallel Programming 37 (1), pp. 3–36. Cited by: §4.1.1.
 Modeling causal relationships in sociocultural systems using ensemble methods. In Advances in CrossCultural Decision Making, pp. 43–56. Cited by: §1.
 Causation, prediction, and search. Vol. 81, MIT press. Cited by: §1.
 Anticipating global terrestrial ecosystem state change using fluxnet. Global change biology. Cited by: §2.2.
 On the identifiability of the postnonlinear causal model. In Conference on Uncertainty in Artificial Intelligence, pp. 647–655. Cited by: §1.
 On estimation of functional causal models: general results and application to the postnonlinear causal model. ACM Transactions on Intelligent Systems and Technology (TIST) 7 (2), pp. 13. Cited by: §1.
 Causal direction inference for network alarm analysis. Control Engineering Practice 70, pp. 148–153. Cited by: §2.2.
 Ensemble methods: foundations and algorithms. Chapman and Hall/CRC. Cited by: §1.
Comments
There are no comments yet.