1 Introduction
Causal inference has been given rise to extensive attention and applied in several areas including statistics, neuroscience and sociology in recent years. An efficient approach for causal discovery is to conduct randomized controlled experiments. These experiments, however, are usually very expensive and sometimes practically infeasible. Therefore, causal inference methods using passive observational data take center stage, and many of them have been proposed, especially in the past ten years.
Existing causal inference methods that use passive observational data can be roughly categorized into two classes according to their objectives. One class of methods aim at identifying the variable that is the cause of the other in a variable pair [1, 2, 3, 4], which is often termed a causeeffect pair. Most of the methods in this class first model the relation between the cause and the effect using a functional model with certain assumptions. Then they derive a certain property which only holds in the causal direction and is violated in the anticausal direction to infer the true causal direction. This kind of widely used property is often termed causeeffect asymmetry. For example, the additive noise model (ANM) [1] represents the effect as a function of the cause with an additive independent noise: . The authors showed that there is no model of the form that admits an ANM in the anticausal direction for most combinations . Therefore, the inference of ANM is done by finding the direction that fits ANM better. Similar methods include postnonlinear model (PNL) [2] and information geometric causal inference (IGCI) [3]. Recently, a kernelbased, EMD (or abbreviation for EMbeDding) [4] using the framework of IGCI is proposed. EMD differs from the previous methods in the sense that it does not assume any specific functional model, but it still resorts to find the causeeffect asymmetry.
The other class of methods aims at recovering the structure of causal graphs. Constraintbased methods [5, 6, 7, 8, 9], which belong to this class, exploit the causal Markov condition and have been widely used in the social sciences, medical science, and bioinformatics. However, these methods allow one only to obtain the Markov equivalent class of the graph and are of high computational cost. In 2006, a linear nongaussian acyclic model (LiNGAM) [10]
which exploits the nongaussian property of the noise, was showed to be able to recover the full causal structure by using independent component analysis (ICA)
[11, 12]. To avoid the problem that ICA may result in a solution of local optima, different methods [13, 14] were proposed to guarantee the correctness of the causal order of variables in the causal graph.Both classes of existing methods are based on the assumption that all observations are sampled from a fixed causal model. By “fixed causal model,” we mean that the (joint) distribution of variables and the mechanism mapping cause(s) to effect(s) are unchanged during the data collecting process. For example in an ANM
, both the distribution of the cause and the causal mechanism are assumed to be fixed. Although some of these methods do achieve inspiring results and provide valuable insights for subsequent research, data generated from a varying causal model are much more common in practice and existing methods based on a fixed causal model would come across some problems when applied to varying causal models [15]. Therefore, we consider causal models where distributions of variables and causal mechanisms vary across domains or over different time periods and call these models nonstationary causal models. An example is the model of daily returns of different stocks. The distribution of the return of each stock varies with the financial status, and the causal mechanisms between different stocks also vary according to the relations between these companies. Recently, a method called Enhanced Constraintbased Procedure (ECBP) was proposed for causal inference of nonstationary causal models [15]. The authors resorted to an index variable to quantify the nonstationarity and proposed ECBP, which is built on constraintbased methods to recover the skeleton of the augmented graph, which consists of both observed variables and some unobserved quantities determined by . They also showed that it is possible to infer the parent nodes of variables adjacent to (termed specific variables) and proposed a measure to infer the causal direction between each specific variable and its parents. However, their method fails to ensure the recovery of the full causal structure, which is due to the limitation of methods that rely on conditional independence test. In contrast, our method, which is proposed originally for causeeffect pairs inference, is also extended to infer the complete causal structure of two kinds of graphs by transforming the nonstationarity into a LiNGAM model.In this paper, we introduce a nonstationary causal model and develop algorithms, which we call embeddingbased nonstationary causal model inference (ENCI) for inferring the complete causal relations of the model. Our model assumes that the underlying causal relations (i.e. the causal direction of a causeeffect pair or the causal structure of a graph) are fixed, whereas the distributions of variables and the causal mechanisms (i.e. the conditional distribution of the effect given the cause(s)) change across domains or over different time periods. To infer the nonstationary causal model, ENCI reformulates the relation among variables into a linear model in the Reproducing Kernel Hilbert Space (RKHS) and leverages the identifiability of the linear causal model to tackle the original complicated problem. Specifically, for a causeeffect pair, we embed the variation of the density of each variable into an RKHS to transform the original unknown causal model to a linear nongaussian additive model [16] based on the independence between the mechanism generating the cause and the mechanism mapping the cause to the effect. Then we infer the causal direction by exploiting the causal asymmetry of the obtained linear model. We also extend our approach to discover the complete causal structure of two kinds of causal graphs in which the distribution of each variable and the causal mechanism mapping cause(s) to effect(s) vary and the causal mechanism could be nonlinear.
This paper is organized as follows. In section 2, we formally define our model and objective of causal inference. In section 3, some preliminary knowledge of reproducing kernel Hilbert space embedding is introduced. In section 4, we elaborate our methods for causeeffect pairs. In section 5, we extend our methods to two kinds of causal graphs. In section 6, we report experimental results on both synthetic and realworld data to show the advantage of our approach over existing ones.
2 Problem Description
In this section we formalize the nonstationary causal model and the objective of our causal inference task. For a pair of variable and , we consider the case where is the cause and is the effect without loss of generality throughout this paper.
2.1 Nonstationary Causal Model
We assume the data generating process of a causeeffect pair fulfills the following properties:

The causal direction between and stays the same throughout the process.

Observations are collected from different domains. The density of the cause and the conditional density of the effect given the cause are fixed within each domain.

and vary in different domains.
We call this a nonstationary causal model due to the variation in distributions over domains. The datagenerating process is illustrated in Figure 1.
The collection of data obtained from each domain is called a data group , and the entire data set is denoted by . This nonstationarity over groups is common in the real world, as the observations we obtained are usually collected over different time periods or from different sources (e.g. different geographical regions or experimental settings).
2.2 Objective of Nonstationary Causal Model Inference
Our goal of nonstationary causal model inference is, by exploiting the variation of distributions in different groups, to accurately estimate the causal direction between and . We also extend, our approach to learn the full causal structure of two kinds of causal graphs by transforming their relationship among groups into a LiNGAM model. For clarity, we list some of the notations we use in the following sections in Table 1.
Symbol  Description 

Density of , in group  
Base of the density of  
Variation of the density of in group  
Conditional density of given in group  
Base of the conditional density of given  
Variation of the conditional density of given in group  
domain of variable ,  
,  Mean embedding of in , in 
,  Mean embedding of in , in 
,  Mean embedding of in , in 
3 Hilbert Space Embedding of Distributions
Kernel embeddingbased approaches represent probability distributions by elements in a reproducing kernel Hilbert space (RKHS) and it serves as the main tool in this letter to characterize distributions.
An RKHS over with a kernel is a Hilbert space of functions . Denoting its inner product by , RKHS fulfills the reproducing property . People often regard as a feature map of . Kernel embedding of a marginal density [17] is defined as the expectation of its feature map:
(1) 
where is the expectation of . It has been shown that is guaranteed to be an element in RKHS if
is satisfied. It is also generalized to joint distribution using tensor product feature spaces. The kernel embedding of a joint density
is defined as(2) 
Similarly, we have that . The embedding of conditional densities is viewed as an operator that maps from to which is an RKHS over [18]. Imposing that the conditional embedding satisfies the following two properties:
(3)  
(4) 
where and is kernel embedding of marginal density , [18] showed that conditional embedding can be defined as to fulfill equations 3 and 4. In the following sections, we follow the definition of kernel mean embedding and embed distributions in a tensor product space to represent distribution of each group.
4 Embeddingbased Nonstationary Causal Model Inference
In this section we introduce our proposed approach to infer the causal structure of nonstationary causal models.
4.1 Basic Idea
Currently, the most widely used idea of inferring causal direction is to quantify the independence between the mechanism generating the cause and the mechanism mapping the cause to the effect. One way to interpret the independence between these two mechanisms is to measure the independence between the cause and the noise. ANM and PNL lie in this field and LiNGAM methods could also be interpreted from this viewpoint [14]. We adopt a different interpretation which uses the independence between the marginal distribution of the cause and the conditional distribution of the effect given the cause to capture the independence between these two mechanisms and further exploit causal asymmetry. This kind of independence has also been used in many existing causal inference methods [19, 3, 4]. We formalize this independence in postulate 1:
Postulate 1.
The mechanism generating the cause and the mechanism mapping the cause to the effect are two independent natural processes.
[3] proposed this postulate and developed information geometry causal inference (IGCI). IGCI uses the density of the cause to characterize the first mechanism and the derivative of the function mapping the cause to the effect to characterize the second. In our approach, the variation of the marginal density of the cause is used to characterize the first mechanism, which is similar to IGCI. What differs from IGCI is that we use the variation of the conditional density to characterize the second mechanism. In subsequent sections, we introduce how we obtain the variation of densities and how we infer the causal direction based on the independence between them.
4.2 Decomposition of Distributions
Given the entire data set , which consists of groups, we make use of the variation of densities in each group. To obtain the variation, we first compute the mean of marginal densities and conditional densities of all groups as
(5) 
where is the density of and is the conditional density of given in group . We call and the base of marginal and conditional densities, respectively. Then the variation of density of each group is given by:
Definition 1 (Variation of density).
For any , we decompose and into two parts: one is the base of the (conditional) density and the other is a varying part, i.e. and . We call and the variation of the marginal and conditional density of group , respectively.
Since the base of densities is the mean of densities of all groups, and fulfill the following properties.
(6) 
Making use of the decomposition of distributions defined in definition 1, we are able to analyze densities of each group with some components fixed, which finally guides us to a fixed linear causal model. We take group as an example to provides some insights before elaborating the derivations. The marginal density of effect is given by
(7) 
where and are the same in all groups. Therefore, we would obtain a fixed term which does not change over in the expansion of equation 7. Although and vary over groups, they also consist of and which allows us to use the invariant to formulate the relation between them into a fixed causal model. In subsequent sections, we adopt kernel embedding to transform these kinds of invariant into a linear model to infer the causal direction.
4.3 Kernel Embedding of Distributions in Tensor Product Space
We resort to kernel embedding to represent distributions. The marginal distributions of and of each group are embedded in tensor product space and , respectively. For simplicity, we use to represent the tensor product space and to represent in subsequent sections. Following the definition of kernel mean embedding, we define the mean embeddings of and of group in and as:
Definition 2 (tensor mean embedding).
(8) 
where is the feature map of and is the density of in group . Similar notations go for .
Definition 2 is the embedding of marginal densities of each group. Since our analysis is conducted on the base and variation of density of each group, we further define the tensor mean embedding of the base and variation of densities:
Definition 3 (tensor mean embedding of the base and variation of distributions).
(9)  
(10) 
is the same in all groups and we have from definitions 2 and 3. Similarly, there is . Definition 2 and 3 together state how marginal distributions are embedded in the tensor product space after decomposition. Next, we show how we make use of these tensor mean embeddings to infer the causal direction between and . To avoid analyzing probability densities directly, we substitute equation 7 into definition 2 to conduct analysis on their embeddings:
(11) 
where we omit the term . Since the ranges of variables are usually bounded and distributions usually change smoothly instead of drastically in realworld situations, we consider it reasonable to omit the one with two variation terms. Although there exits sets of densities in which the omitted term of certain group would have magnitude comparable to the sum of the remaining three terms when the distribution shifts drastically, we deem it less likely to occur in real situations. Note that this claim is close in spirit to an assumption in [15] in which the authors assume the nonstationarity can be written as smooth functions of time or domain index. With this claim, we have the tensor mean embedding of the base of distributions as:
(12) 
where the approximately equal mark is again derived by omitting the one with two variation terms and the last equality is directly derived from the property shown in equation 6. Then we have the tensor mean embedding of the variation of distributions as:
(13) 
which shows the relation between the tensor mean embedding of the variation of the effect and cause. and are matrices of functions of . In addition, they are both symmetric and positive definite so they admit decomposition:
(14)  
(15) 
where and are lower triangular matrices, and denote the th column of and , respectively; and denotes the dimension of . The symbol indicates the corresponding relation of to the variation of densities. By assuming that and lie in the space of , we have and . and are matrices containing coefficient mapping from to and , respectively. Then we have
(16)  
(17) 
By substituting equation 16 and 17 into equation 13, we further obtain
(18) 
where and are substituted in according to definition 2 and 3.
4.4 Inferring Causal Directions
In this section, we discuss how we infer the causal direction using the kernel embedding of decomposed densities. Note again that we consider the case without loss of generality throughout this letter.
We start by taking normalized trace on both sides of equation 18,
(19) 
where is called the normalized trace of , is the size of , and . Since the independence of the two mechanisms in Postulate 1 is difficult to quantify, we consider to use the density of the cause and the conditional density of the effect given the cause to represent the two mechanisms and adopt the independence between the base and variation of these two densities to infer the causal direction. The independence we rely on is based on the concept of free independence [20, 21].
Definition 4 (Free independence).
It is straightforward from definition 4 that if and are free independent, it holds that [20, 21]. Then we have the following two assumptions to characterize the independence in postulate 1:
Assumption 1.
We assume that the tensor mean embedding of the variation of marginal density of the cause () and is free independent, and the tensor mean embedding of the base of marginal density of the cause () and , is free independent, that is,
(21)  
(22) 
where is the number of groups.
Assumption 1 captures the independence between the mechanism generating the cause and the mechanism mapping the cause to the effect. In equation 21, depends only on the base of the conditional densities which corresponds to the second mechanism, and depends only on the variation of the marginal densities of the cause , which corresponds to the first mechanism. Therefore, the free independence between them characterizes the independence in postulate 1. Similarly, we have assumptions shown in equation 23.
Assumption 2.
Regarding the normalized trace of the tensor mean embedding of variation of marginal densities of the cause in each group as a realization of a random variable
and each as a realization of another random variable , we assume that these two random variables are independent, i.e.(23) 
Assumption 2 is also motivated by the independence in postulate 1. Specifically, captures the information of the variation of marginal densities of the cause, and captures the information of the variation of conditional densities. We interpret postulate 1 as the independence between the marginal and conditional. Therefore, this independence between their variations of densities (approximately) holds. With assumption 1, equation 19 becomes
(24) 
Since and are fixed given , and are the same in all groups. We introduce the following notations for simplicity:
Notation 1. For any , we use and to represent and , respectively. denotes , which is the corresponding noise term. denotes . We view each as a realization of a random variable . Similarly, we have and .
Proposition 1.
Proof.
By adopting notations in notation 1, equation 24 becomes
(26) 
We first show that follows nongaussian distributions. According to assumption 1, we have
(27) 
where is fixed and thus can be viewed as a constant. From the definition of we have
(28) 
Since are positive for all , we have . Therefore, the distribution of
is not symmetric and is thus not Gaussian distributed.
According to the identifiability of LiNGAM [16, 10], and are dependent. By exploiting the causeeffect asymmetry that the cause is independent of the noise only in the causal direction, we propose the following causal inference approach: embeddingbased nonstationary causal model inference (ENCI).
Causal Inference Approach (ENCI): Given data set , we compute and for and conclude that if , otherwise if .
4.5 Empirical Estimations
In this section, we show how to estimate and for based on the observations.
Let and be the feature matrices of and in group , respectively, given observations in . We estimate the mean embedding of in as
(29) 
where is the number of observations in th group, and
is a column vector of all 1s. Since we have
(30) 
for estimating the tensor mean embedding of the base of distributions , the tensor mean embedding of the variation of distributions is estimated as
(31) 
By taking the normalized trace on both sides of equation 31, we have
(32) 
where is the total number of groups, is the number of observations in th group and is the kernel matrix of in th group. Similarly, we have
(33) 
where is the kernel matrix of in th group. We can see that both and can be easily calculated from Gram matrix using kernel methods.
5 Extending ENCI to Causal Graph Discovery
In this section, we extend ENCI to causal discovery for two kinds of directed acyclic graphs (DAGs). One is a treestructured graph in which each node has at most one parent node. The other is multipleindependentparent graph in which parent nodes of each node are mutually independent. Examples of these two kinds of DAGs are shown in Figure 2.
5.1 Describing Causal Relationship by Directed Acyclic Graphs
Consider a finite set of random variables with index set . A graph consists of nodes in and edges in for any . Then we introduce graph terminologies required for subsequent sections. Most of the definitions are from [8].
Edge is a directed link from node to node . Node is called a parent of , and is called a child of if . The parent set of is denoted by and its child set by . Nodes , are called adjacent if either or . A path in is a sequence of distinct vertices such that and are adjacent for all . If for all , the path is also called a directed path from to . is called a partially directed acyclic graph (PDAG) if there is no directed cycle, i.e., there is no pair such that there are directed paths from to and from to . is called a directed acyclic graph (DAG) if it is a PDAG and all edges are directed.
General causal graph discovery is very challenging, especially when the relation between a variable pair is a complicated nonlinear stochastic process. In the following section, we show how we discover the causal structure treestructured graphs (TSG) and multipleindependentparents graph (MIPG). Note that the causal relation between a variable and its parent node in our model not only could be complicated nonlinear functions but also varies in different groups.
5.2 TreeStructured Causal Graph Discovery
In a TSG with nodes, each variable and its only parent node fulfill the linear relation in Equation 25. Thus we have the following proposition for TSG:
Proposition 2.
In a TSG where each variable has only one parent node, the normalized traces of the tensor mean embedding of the variation of densities of all variables fulfill a linear nongaussian acyclic model (LiNGAM) [10] if assumption 1 and 2 hold:
(34) 
where , coefficient matrix whose element on th row and th column equals to could be permuted to a lower triangular matrix and collects all noise terms .
Proof.
First, , where , could be arranged in a causal order in which no later variable is the cause of earlier ones due to the acyclicity of the graph. Note that causal order in subsequent sections also means that this condition holds for a sequence of variables. Second, the noise term , where , follows nongaussian distributions as shown in proposition 1. Thirdly, assumption 2 ensures that for . ∎
Therefore, the graph formed by , where , fulfills the structure of LiNGAM [10] so we can apply LiNGAM on to infer the causal structure of the causal graph consists of .
5.3 MultipleIndependentParent Graph Discovery
We extend ENCI to cases where each node could have more than one parent node provided that all its parent nodes are mutually independent.
Suppose a variable in graph has independent parent nodes  . The marginal density of in group can be obtained from
(35) 
Then by substituting into with decomposed as and integrating with respect to , we have