1 Introduction
Recent breakthrough in deep learning has been significantly advancing Artificial Intelligence (AI). We witness great success of deep learning in many applications such as image classification, image recognition, speech recognition, natural language processing etc. Deep learning methods, or specifically deep neural networks have become the dominant approach for machine learning and AI and thus attracts tremendous amount of attention from both the academia and the industry. However, there are still a number of open challenges remained to tackle for deep learning. To name a few, heavy demand on labeled data, bad generalizability, vulnerable against adversarial attacks and lack of interpretability of deep learning methods are the most notorious ones. Recently, it is advocated in the AI community that causality might be one of the tools to solve the aforementioned open problems. It has been argued that causality, instead of “superficial association" is invariant cross domain. Machine learning algorithms that learn, and utilize the causal relationship amongst variables provide better generalization performance, robustness against adversarial attacks and better interpretability. Besides the area of machine learning and AI
bengio2019meta ; scholkopf2013semi ; peters2016causal ; lopez2017discovering , causal discovery also play an important role in economics, sociology, bioinformatic and medical science etc.However, how to unveil the causal relationship among variables from pure observational (or postintervention) data is challenging. A bunch of methods have been proposed in the past three decades including Bayesian network
pearl2003causality , Structural Equation Models (SEM) shimizu2006linear ; hoyer2009nonlinear ; heinze2018causal . However, these methods have their limitations. For example, Bayesian networks via constainedbased approach or scorebased approach are not able to fully identify the groundtruth graphs but only up to “Markov equivalent class" pearl2003causality . In addition, they are not able to solve the more fundamental problem, i.e., causal discovery for a causeeffect pair.To solve these problems, researchers have been working out new theory and algorithms which try to dig out more regularities from the data distribution spirtes2016causal . Amongst them, the Independent Mechanism (IM) principle janzing2010causal
is considered to be a promising direction. The basic idea behind the IM principle is that nature is parsimonious in the sense that the mechanism generating the cause and the mechanism mapping the cause to the effects are independent, i.e. the probability distribution of the cause
and the conditional distribution mapping the cause to the effectcontain no information of each other. It has been shown that the factorization of the joint distribution according to the causal direction usually yield simpler terms than that in the anticausal direction
sun2008causal ; janzing2010causal , i.e.(1) 
where denotes the Kolmogorov complexity which is essentially not computable. Researchers have been proposing computable metrics including RKHS norm sun2008causal ; chen2014causal , Minimal Description Length budhathoki2017mdl etc., to mimic the Kolmogorov complexity in order to derive a practical algorithm for pairwise causal discovery. Our method proposed in this paper falls into this category. According to the IM principle, the conditional distribution does not depend on
which naturally leads to the conjecture that intrinsic information, e.g. higher order central moments that characterize the shape of
does not essentially depend on the value of . In this paper, we prove that existing stateoftheart normbased approach along this direction is not sufficient as it sacrifices important information of the original conditional distributions. Instead, we propose a Kernel Intrinsic Invariance Measure (KIIM) to capture the intrinsic invariance of the conditional distribution, i.e. the higher order statistics corresponding to the shapes of the density functions. We show our algorithm can be reduced to an eigendecomposition task on a kernel matrix measuring intrinsic deviance/invariance.The rest of the paper is organized as follows: in Sec.2, we introduce the basic idea of a recent stateoftheart method named Kernel Deviance Measure and its limitation; in Sec.3, we give a brief introduction to Reproducing Kernel Hilbert Space (RKHS) embeddings which serve as the tool of our method; in Sec.4, we give a rigorous justification of the limitation of existing methods and then show how our proposed method address those issues; in Sec.5, we verify the effectiveness of our proposed method followed by a conclusion in Sec.6.
2 Related Work
Recently, authors in mitrovic2018causal proposed an idea which exploits the variation of the conditional distribution of the hypothetic effect given the hypothetic cause. They argued that the there is less variability in the causal direction than that in the anticausal direction. An motivating example that is used in mitrovic2018causal
as follows. Suppose we have two random variables that follow the generating mechanism as
, where . As illustrated in Fig. 1, it is obvious that the conditional distribution instantiated at different value are almost identical except for the location; however in the anticausal direction, the conditional distribution instantiated at different valuehave significant structural variation including the number of modes, skewness, kurtosis ect. This piece of structural variation in conditional distributions leads to the socalled “causeeffect asymmetry" for causal discovery. The basic idea is to investigate how invariant the conditional distribution (instantiated at different values) is and one prefers the direction with less variation or in other words, more invariance. To achieve it, they proposed the following Kernel Deviance Measure:
(2) 
where is a Reproducing Kernel Hilbert Space (RKHS) entailed by a positive definite kernel and is the kernel mean embedding of the conditional distribution instantiated at .
The causal discovery rule is straightforward by comparing the scores, i.e. , if , , if , otherwise no conclusion is drawn.
Positive results were reported in mitrovic2018causal which suggests that causal discovery via invariance is a promising direction. However, we notice that the above method has significant limitation that should be addressed. Before we introducing our method, we give some preliminary knowledge on RKHS embeddings.
3 Preliminary on Reproducing Kernel Hilbert Space Embeddings
Kernel methods scholkopf2001learning are a class of machine learning algorithms that map the data from the original space implicitly to a high dimensional or even infinite dimensional feature space
. One can get rid of computing the coordinates of the data in that space explicitly if the algorithm can reduce to inner products of feature vectors of all data points which can be easily calculated as the kernel function of any two data points. This is called the kernel trick
scholkopf2001learning . The kernel function essentially act as a similarity function between a pair of data points and thus kernel methods are categorized as a typical method of instantbased learning.where and is a positive definite kernel. The kernel mean embedding smola2007hilbert of a probability density is defined as:
(3) 
One can simply interpret the kernel mean embedding as the vector of (higher order) moments. This interpretation is exactly true if one uses a polynomial kernel , where . It has been shown that if the kernel is characteristic song2013kernel , e.g. a Gaussian kernel, then the mapping a probability distribution to its kernel mean embedding is injective, i.e. we lose no information during the mapping. The conditional embeddings of the conditional distribution is a sweep out a family of points in the RKHS song2013kernel , each one of which is essentially the kernel mean embedding of the conditional distribution indexed by a fixed value of the conditioning variable . It is shown in song2013kernel that under a mild assumption that , the conditional mean embedding can be obtained by Eq.4:
(4) 
where and
. The empirical estimation of the kernel mean embedding and the conditional mean embedding given a set of observation
and :(5) 
where , , is the kernel Gram matrix of , i.e. and with
as an identity matrix and
is a vector ofof appropriate dimension. The (conditional) kernel mean embeddings provide compact and nonparametric representation of the (conditional) distribution. Manipulations of the probability distribution such as complicated operations on probability distribution in Bayesian inference can easily reduces to matrix manipulation in the RKHS. For example, Maximum Mean Discrepancy (MMD)
gretton2008kernel was proposed for two sample test. A Kernel Bayes Rule (KBR) fukumizu2013kernel was proposed to conduct Bayesian inference in the RKHS. Given that the RKHS embedding has solid theoretical support and is easy to use, it is adopted in our paper to measure the intrinsic invariance of the conditional distributions.4 Causal Discovery by Intrinsic Invariance
Although positive results were reported on synthetic dataset and some real data in mitrovic2018causal , there are some potential problems regarding the discriminative power of the RKHS normbased method which is essentially calculating the variance of the conditional mean embedding norms. Back to the motivating example, we notice that in the anticausal direction, the conditional density in red and the one in green are symmetric with respect to the yaxis and the structural variation between the red and the green one is significant. However, the norms of the RKHS mean embeddings of these two conditional distributions would be equal which leads to some issues of the discriminative power of the direct normbased method mitrovic2018causal ,i.e. a direct normbased approach might lose the discriminative power to distinguish two distributions with significant structural variability. We give a formal justification of this conjecture in the next section.
4.1 Discriminative Power Issues of the RKHSnormvariance approach
The major limitation of the direct normbased approach is that the mapping of a probability distribution to the norm of its RHKS mean embedding is not injective, i.e., there might two distinct probability distributions sharing the same RKHS mean embeddingnorm . Consequently, a deviance measure that simply calculate the variance of the RKHS such norms might not be discriminative enough for causal discovery. In the following lemma, we show that the norm of the kernel mean embeddings and
which correspond to the probability density function
and are equal if a stationary (translation invariant) kernel is used.Lemma 1.
Denote the domain of as , and if is symmetric with respect to the origin, given two probability densities and where , we attain
(6) 
where is the kernel mean embedding with respect to a stationary kernel .
Proof.
According to Bochner’s theorem rudin1962fourier , for a stationary kernel , we have:
where is the dimension of the feature space. We attain , where and and thus .
Similarly, we have and thus . Consequently, we show that . ∎
According to Lemma 1, we see that even thought two probability densities are very different, e.g. for skewed distribution, and are different, but they share the same norm.
Similar conclusion can be drawn for more general cases and is justified in Lemma 2.
Lemma 2.
Given an arbitrary probability density , where is a Reproducing Kernel Hilbert Space (RKHS) entailed by a positive definite kernel , then with high probability there exists at least one probability density and such that
where and are the kernel mean embeddings of and respectively.
Proof.
Given a positive definite kernel , according to Mercer’s Theorem scholkopf2001learning , we have , where ,where and . We further assume are integrable, , then we write and thus . For an arbitrary probability density , we can represent it as:
where is a vector of coefficient. By definition, the RKHS mean embedding of is obtained as , where is a diagonal matrix with . The norm of can be easily calculated as .
Now we construct another probability density . Without loss of generality, we assume that . Similarly, we have . In order to make , we construct and in the way that:
(7) 
where . We attain . Let and , we obtain:
(8) 
In order to ensure a solution exists for Eq. 8, we need to prove that . We show that
Consequently, we prove that Eq.8 holds and there exists two solutions, i.e. and , where and . Two solutions collapse to one if and only if which rarely happens as it requires mutual adjustment of the probability density function and the kernel function. ∎
The intuitive interpretation of the proof can also be elucidated in Fig. 2. The solution of the first equation forms a line and the solution of the second equation forms an ellipse and thus the solution of Eq. 7 is the intersection of the line and the ellipse. Note that the intersection should happen as is already a solution to Eq. 7. With high probability, there are two distinct intersection points as shown in Fig. 2 except for some rare cases that the points collapse to a single point when the line is the tangent line of the ellipse. This is rare because it requires mutual adjustment between , and which in turn essentially requires the mutual adjustment between and . According to Lemma 2, we see that the RKHS norm which is directly applied to the conditional distribution instantiated at different value is not discriminative enough. There are conditional distributions with significant distinction but they can have equal norms and thus it leads to some problems for the proposed KCDC algorithm in mitrovic2018causal .
4.2 Causal Discovery via Kernel Intrinsic Invariance Measure
Realizing the limitation of the norm based approach, we propose our method which measures the norm of the difference of the kernel mean embeddings corresponding to conditional distributions instantiated at different values, instead of measuring the difference of their norms. However, a naive application of this idea might not work because even in the causal direction, conditional distributions instantiated at different cause values are not NOT identical. They could be different with each other in terms of the location and the scale, although we are more interested in higher order moments that are more relevant to the shape of the density function. For example, in a toy example proposed in mitrovic2018causal , we have , where . Even in the causal direction, we get . Conditional distributions instantiated at different
are all Gaussian distributions but they have different means and thus they are not identical, neither are their kernel mean embeddings and the corresponding norms (this can be easily verified if one use a polynomial kernel). However, the location and scale information of a distribution are not that interesting to us when it comes to causal discovery as we are more keen on the higher order statistics that reflect shape information.
This observation motivates our method to capture more intrinsic information of the probability density function. Mathematically, we define the following score that capture the “intrinsic" variation of the conditional distribution instantiated at different values of the conditioning variable or for two hypothetic directions. Without loss of generality, we show definition of the score in the direction of :
(9) 
The interpretation of Eq.9 is that we calculate the norm of the difference of conditional distributions at different values. The score is zero if and only if all conditional distributions are the same according the injectiveness of the kernel mean embedding. The matrix is introduced to find the subspace which removes the effect of some trivial deviance like location and scale. Empirically, we can calculate the kernel embedding of the conditional distribution instantiated at different as , where , is the kernel gram matrix of and . Note that the solution of the above optimization problem lies in the span of , and thus we can represent by a linear combination of , i.e. , where is the coefficient matrix. Consequently, we attain
(10) 
To avoid trivial solution , we pose the constraint that . Consequently, we have
(11) 
The intuitive interpretation of the proposed method is that we use the projection matrix to find intrinsic deviance of the conditional mean embedding. The intrinsic deviance captures higher order statistics which is more relevant to the shape of the probability distribution function while discards the some less important information, e.g. the location and scale of a distribution. As an illustrative example, suppose we use a polynomial kernel , it can be easily shown that we essentially map to a space with polynomials of the feature, i.e. . Therefore, the kernel mean embedding of a distribution is in fact a vector of moments up to degree , i.e.
where denotes the
th order moment. If the conditional distributions only differ from each other with mean and standard deviation (the first and second order moments), then the projection matrix
is expected to find the subspace that contains only higher order moments.However, how to decide the rank of is an open question. In this paper, we propose a simple but effective algorithm to choose the right rank, see Alg. 1. The basic idea is to project of the subspace corresponding to the smallest eigenvalues which preserve at least of the energy of the whole spectrum.
4.3 Robust Kernel Intrinsic Invariance Measure by Importance Reweighting
Real world data is usually contaminated with noise and outliers. The estimation of the kernel mean embedding might be significantly biased due to the outliers in data. Furthermore, when estimating the conditional mean embedding, we want to eliminate any effect of the marginal distribution of the hypothetic cause due to finite sample size. We adopt an importance reweighting scheme as follows:
(12) 
and , where is a reference distribution. The empirical estimation is then obtained by
(13) 
where is a diagonal reweighting matrix with . The main body of the algorithm does not change except for the calculation of and in Alg. 1. We name this variant of our algorithm RwKIIM meaning Reweighted Kernel Intrinsic Invariance Measure.
5 Experiment
In this section, we conduct experiments using both synthetic data and a real world dataset called Tuebigen Cause Effect Pairs (TCEP) . We compare our methods with some stateoftheart methods including KCDC^{1}^{1}1Although positive results were reported in mitrovic2018causal , unfortunately we are not able to reproduce the results reported in the paper., IGCI janzing2012information , ANM hoyer2009nonlinear and LiNGAMshimizu2006linear ^{2}^{2}2https://github.com/DiviyanKalainathan/CausalDiscoveryToolbox
. For IGCI, we use the entropy based methods with two different reference distribution (Gaussian and Uniform distribution). We use
for the regularization hyperparameter when calculating the conditional mean embedding in Eq.5. In the following experiment, we use the composite kernel for KIIM which is the multiplication of the RBF kernelwith median heuristic for kernel width and a log kernel
and a rational quadratic kernel . For KCDC, we use log kernel for the input and rational quadratic kernel for the output as in mitrovic2018causal .ANM1  KCDC  KIIM  RwKIIM  IGCI(entropy,Gaussian)  IGCI(entropy, Uniform)  ANM 

Gaussian  
Uniform  
ANM2  KCDC  KIIM  RwKIIM  IGCI(entropy,Gaussian)  IGCI(entropy, Uniform)  ANM 
squaredGaussian  
Uniform  
MNM1  KCDC  KIIM  RwKIIM  IGCI(entropy,Gaussian)  IGCI(entropy, Uniform)  ANM 
Gaussian  
Uniform  
MNM2  KCDC  KIIM  RwKIIM  IGCI(entropy,Gaussian)  IGCI(entropy, Uniform)  ANM 
Gaussian  
Uniform  
Complex  KCDC  KIIM  RwKIIM  IGCI(entropy,Gaussian)  IGCI(entropy, Uniform)  ANM 
Gaussian  
Uniform 
5.1 Synthetic Data
In this section, we evaluate pairwise causal discover algorithms on data generated by 5 different data generation mechanisms following mitrovic2018causal . Details of these mechanisms are given as follows. ANM1: ; ANM2: . MNM1: ; MNM2: and CNM: and the noise distribution is specified in Tab.1. We generate 100 samples from each data generation mechanism for different algorithm to infer the causal direction. Experiments are conducted for 100 independent trials and results of different algorithms are reported in Tab.1. We observe that IGCI is quite robust and performs well when the data generation mechanism is indeed additive noise model. Unfortunately, we are not able to reproduce positive results reported in mitrovic2018causal . In this paper, we are using a direct version of KCDC without majority vote and the confident measure mitrovic2018causal because these extra processes are not used in our algorithm IKKM. Even without this extra process, IKKM and RwKIIM perform quite well except for the linear ANM with uniform noise.
In order to justify the necessity of using a projection matrix to a lower dimensional space, we compare the performance of IKKM with different ranks of . This results in using the algorithm exploiting eigenvalues ranging from the whole spectrum to only the smallest one. Two mechanisms are used in this experiments as shown in Fig. 2(a). Interestingly, we find that the algorithm using the whole spectrum does not perform the best but the one discarding the top1 eigenvalue performs consistently the best. This result justifies our motivation and argument: we need intrinsic deviance/invariance measurement that captures only higher order statistics of the shape of the conditional distribution. Trivial difference arising from the location and the scale might not be beneficial or even harmful for causal discovery.
5.2 Tuebinen CauseEffect Pairs (TCEP)
In this section, we verify the performance of our algorithm on real world data. We use the open benchmark called Tuebigen CauseEffect Pairs^{3}^{3}3https://webdav.tuebingen.mpg.de/causeeffect/ which has been widely used to evaluate causal discovery algorithms. The whole dataset contains 108 causeeffect pairs taken from 37 different data sets from various domains mooij2016distinguishing with known ground truth. We do not use some pairs as both and are highdimensional variables in pair 52,53,54,55,71,105 and there are missing values in pairs 81, 82 and 83. The ground truth direction in pair 86 is not mentioned in the data description document and thus is not used in our experiments. Fig.2(b) shows our proposed algorithm outperform the stateoftheart methods significantly.
6 Conclusion
In this paper, we focus on causal discovery for causeeffect pairs along the direction of Independent Mechanism (IM) principle. We prove that the existing normbased stateoftheart which only compare the norms of conditional mean embedding might lose discriminative power. To solve this problem, we propose a Kernel Intrinsic Invariance Measure (KIIM) to capture the intrinsic invariance of the conditional distribution, i.e. the higher order statistics corresponding to the shapes of the density functions. Experiments with synthetic data and real data justify the effectiveness of our proposed algorithm and supports our argument that we indeed need to look for higher order statistics/intrinsic invariance for causal discovery.
References
 [1] Y. Bengio, T. Deleu, N. Rahaman, R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal, and C. Pal. A metatransfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912, 2019.
 [2] K. Budhathoki and J. Vreeken. Mdl for causal inference on discrete data. In 2017 IEEE International Conference on Data Mining (ICDM), pages 751–756. IEEE, 2017.
 [3] Z. Chen, K. Zhang, L. Chan, and B. Schölkopf. Causal discovery via reproducing kernel hilbert space embeddings. Neural computation, 26(7):1484–1517, 2014.
 [4] K. Fukumizu, L. Song, and A. Gretton. Kernel bayes’ rule: Bayesian inference with positive definite kernels. The Journal of Machine Learning Research, 14(1):3753–3783, 2013.
 [5] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel statistical test of independence. In Advances in neural information processing systems, pages 585–592, 2008.
 [6] C. HeinzeDeml, M. H. Maathuis, and N. Meinshausen. Causal structure learning. Annual Review of Statistics and Its Application, 5:371–391, 2018.
 [7] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pages 689–696, 2009.
 [8] D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf. Informationgeometric approach to inferring causal directions. Artificial Intelligence, 182:1–31, 2012.
 [9] D. Janzing and B. Scholkopf. Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, 2010.

[10]
D. LopezPaz, R. Nishihara, S. Chintala, B. Scholkopf, and L. Bottou.
Discovering causal signals in images.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 6979–6987, 2017.  [11] J. Mitrovic, D. Sejdinovic, and Y. W. Teh. Causal inference via kernel deviance measures. In Advances in Neural Information Processing Systems, pages 6986–6994, 2018.
 [12] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103–1204, 2016.
 [13] J. Pearl. Causality: models, reasoning, and inference. Econometric Theory, 19(675685):46, 2003.

[14]
J. Peters, P. Bühlmann, and N. Meinshausen.
Causal inference by using invariant prediction: identification and confidence intervals.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016.  [15] W. Rudin. Fourier analysis on groups, volume 121967. Wiley Online Library, 1962.
 [16] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. Semisupervised learning in causal and anticausal settings. In Empirical Inference, pages 129–141. Springer, 2013.

[17]
B. Scholkopf and A. J. Smola.
Learning with kernels: support vector machines, regularization, optimization, and beyond
. MIT press, 2001.  [18] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear nongaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030, 2006.
 [19] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pages 13–31. Springer, 2007.
 [20] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.
 [21] P. Spirtes and K. Zhang. Causal discovery and inference: concepts and recent methodological advances. In Applied informatics, volume 3, page 3. SpringerOpen, 2016.
 [22] X. Sun, D. Janzing, and B. Schölkopf. Causal reasoning by evaluating the complexity of conditional densities with kernel methods. Neurocomputing, 71(79):1248–1256, 2008.