Causal Discovery and Hidden Driving Force Estimation from Nonstationary/Heterogeneous Data

03/05/2019 ∙ by Biwei Huang, et al. ∙ Lingnan University Max Planck Society Carnegie Mellon University 0

It is commonplace to encounter nonstationary or heterogeneous data. Such a distribution shift feature presents both challenges and opportunities for causal discovery, of which the underlying generating process changes over time or across domains. In this paper, we develop a principled framework for causal discovery from such data, called Constraint-based causal Discovery from NOnstationary/heterogeneous Data (CD-NOD), which addresses two important questions. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine causal orientations by making use of independent changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. After learning the causal structure, next, we investigate how to efficiently estimate the `driving force' of the nonstationarity of a causal mechanism. That is, we aim to extract from data a low-dimensional and interpretable representation of changes. The proposed methods are totally nonparametric, with no restrictions on data distributions and causal mechanisms, and do not rely on window segmentation. Furthermore, we find that nonstationarity benefits causal structure identification with particular types of confounders. Finally, we show the tight connection between nonstationarity/heterogeneity and soft intervention in causal discovery. Experimental results on various synthetic and real-world data sets (task-fMRI and stock data) are presented to demonstrate the efficacy of the proposed methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many tasks across several disciplines of empirical sciences and engineering rely on the underlying causal information. As it is often difficult, if not impossible, to carry out randomized experiments, inferring causal relations from purely observational data, known as the task of causal discovery, has drawn much attention in machine learning, philosophy, statistics, and computer science. Traditionally, for causal discovery from observational data, under appropriate assumptions, so-called constraint-based approaches recover some information of the underlying causal structure based on conditional independence relationships of the variables 

(Spirtes et al., 1993). Alternatively, approaches based on functional causal models infer the causal structure by exploiting the fact that under certain conditions, the independence between the noise and the hypothetical cause only holds for the correct causal direction but not for the wrong direction (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyvärinen, 2009a, b).

Over the last few years, with the rapid accumulation of huge volumes of data of various types, causal discovery is facing exciting opportunities but also great challenges. One feature such data often exhibit is distribution shift. Distribution shift may occur across data sets, which may be obtained under different interventions or with different data collection conditions, or over time, as featured by nonstationary data. For an example of the former kind, consider remote sensing imagery data. The data collected in different areas and at different times usually have different distributions due to varying physical factors related to ground, vegetation, illumination conditions, etc. As an example of the latter kind, fMRI recordings are usually nonstationary: information flows in the brain may change with stimuli, tasks, attention of the subject, etc. More specifically, it is believed that one of the basic properties of neural connections is their time-dependence (Havlicek et al., 2011)

. In these situations many existing approaches to causal discovery may fail, as they assume a fixed causal model and hence a fixed joint distribution underlying the observed data. For example, if changes in local mechanisms of some variables are related, one can model the situation as if there exists some unobserved quantity which influences all those variables and, as a consequence, the conditional independence relationships in the distribution-shifted data will be different from those implied by the true causal structure.

In this paper, we assume that mechanisms or parameters, associated with the causal model, may change across data sets or over time (we allow mechanisms to change in such a way that some causal links in the structure may vanish or appear in some domains or over some time periods). We aim to develop a principled framework to model such situations as well as practical methods, called Constraint-based causal Discovery from heterogeneous/NOnstationary Data (CD-NOD), to address the following questions:

  • [itemsep=0.2pt,topsep=0.2pt]

  • How can we efficiently identify variables with changing local mechanisms and reliably recover the skeleton of the causal structure over observed variables?

  • How can we take advantage of the information carried by distribution shifts for the purpose of identifying causal directions?

After identifying the causal structure, it is then appealing to ask how causal mechanisms change across domains or over time, which raises the question:

  • [itemsep=0.2pt,topsep=0.2pt]

  • How can we extract from data a low-dimensional and potentially interpretable representation of changes, the so-called “driving force” of changing causal mechanisms?

Furthermore, we extend our approach to deal with more general scenarios, e.g., dynamic systems which involve both time-varying instantaneous and lagged causal relations and the case of stationary confounders.

In answering these questions, we make use of the following properties of causal representations. (i) Causal models and distribution shifts are heavily coupled: causal models provide a compact description of how data-generating processes, as well as data distributions, change, and distribution shifts exhibit such changes. (ii) From a causal perspective, the distribution shift in heterogeneous/nonstationary data is usually constrained—it may be due to the changes in the data-generating processes (i.e., the local causal mechanisms) of a small number of variables. (iii) From a latent variable modeling perspective, heterogeneous/nonstationary data are generated by some quantities that change across domains or over time, which gives hints as to how to understand distribution shift and estimate its underlying driving forces. (v) Suppose that there are no confounders for cause and effect. Then and are either fixed or change independently, also known as the modularity property of causal systems (Pearl, 2000). Such an independence helps identify causal directions in the presence of distribution shifts.

To reliably estimate the skeleton of the causal structure and detect changing causal mechanisms from heterogeneous/nonstationary data (Problem 1), we make use of property (i), (ii), and (iii) listed above. Specifically, we introduce a surrogate variable into the causal system to characterize hidden quantities that lead to the changes across domains or over time. The variable can be a domain or time index. Including in the causal system provides a convenient way to unpack distribution shifts to causal representations. We show that given , (conditional) independence relationships between observed variables are the same as those implied by the true causal structure.

Regarding Problem 2, as a sub-problem of causal discovery, we show that distribution shift provides additional information for causal direction identification. it is known that with functional causal model-based approaches, there are cases where causal directions are not identifiable, e.g., the linear-Gaussian case and the case with a general functional class (Hyvärinen and Pajunen, 1999; Zhang et al., 2015b). This restricts the causal direction identification to certain functional classes, e.g., additive (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyvärinen, 2009a) and post-nonlinear models (Zhang and Chan, 2006; Zhang and Hyvärinen, 2009b). We show that using information carried by distribution shifts does not suffer from these restrictions—the method applies to general (nonlinear) functional causal models.

Specifically, we take advantage of property (v) for causal direction determination: if there is no confounder for and , then the causal mechanisms, represented by the conditional distributions and , change independently across data sets or over time. However, independence typically no longer holds for wrong directions. This gives rise to causal asymmetry. To exploit this asymmetry, we develop a kernel embedding of nonstationary conditional distributions to represent changing causal mechanisms and accordingly propose a dependence measure to determine causal directions.

Regarding Problem 3, traditionally, one may use Bayesian change point detection to detect change points of observed time series (Adams and Mackay, 2007) or use sliding windows and then estimate the causal model within each segment separately. However, Bayesian change point detection can only be applied to detect changes in marginal or joint distributions, whereas causal mechanisms are represented by conditional distributions. Moreover, neither of them is appropriate when causal mechanisms change continuously over time. More recently, a window-free method has been proposed, by extending Gaussian process regression (Huang et al., 2015). However, it requires the assumption of linearity, and it fails to handle the case when nonstationarity results from the change of noise distributions. In this paper, by leveraging property (iii), we propose a nonparametric method to recover a low-dimensional and interpretable representation of changes, which does not rely on window segmentation.

This paper 111This paper is built on the arXiv paper by Zhang et al. (2015a) and the conference papers by Zhang et al. (2017) and Huang et al. (2017) but is significantly extended in several aspects. We reformulate assumptions in Section 3.1. We extend Section 3.2 to show how we detect pseudo confounders for nonadjacent variables. In Section 4.1, we add Algorithm 2 which uses generalization of invariance to identify causal directions. In Section 4.2, we propose a new approach to efficiently identify causal directions using independent changes between causal mechanisms and detect pseudo confounders behind adjacent variables (Algorithm 3). In Section 4.3, we give identifiability conditions of CD-NOD and define the equivalence class that CD-NOD can achieve if those conditions do not hold. Furthermore, we extend CD-NOD to the case when there exist both time-varying instantaneous and lagged causal relationships (Section 6.1). Accordingly, we propose Algorithm 5 to efficiently recover both instantaneous and time-lagged causal relationships. In Section 6.2, we further discuss whether distribution shifts also help for causal discovery when there exist stationary confounders. With CD-NOD, some causal directions may not be identifiable, if the identifiability conditions are not satisfied (Theorem 2). To make the method more applicable, we combine our framework with approaches based on constrained functional causal models (Section 6.3). In Section 7, we show that heterogeneity/nonstationarity and soft intervention are related in causal discovery, and we find that our proposed method is even more effective. is organized as follows. In Section 2 we define and motivate the problem in more detail and review related work. Section 3 proposes an enhanced constraint-based method to recover the causal skeleton over observed variables and identify variables with changing causal mechanisms. Section 4 develops a method for determining causal directions by exploiting distribution shifts. It makes use of the property that in a causal system, causal modules change independently if there are no confounders. Section 5 proposes a method, termed Kernel Nonstationary Visualization (KNV), to visualize a low-dimensional and interpretable representation of changing mechanisms, the so-called “driving force”. In Section 6, we extend CD-NOD to the case that allows both time-varying lagged and instantaneous causal relationships, and we discuss whether distribution shifts also help for causal discovery when there exist stationary confounders. In addition, we give a procedure to leverage both CD-NOD and approaches based on constrained functional causal models. In Section 7, we show the connection between heterogeneity/nonstationarity and soft intervention in causal discovery. Section 8 reports experimental results tested on both synthetic and real-world data sets,including task-fMRI data, Hong Kong stock data, and US stock data.

2 Problem Definitions and Related Work

In this section, we first review causal discovery approaches with fixed causal models. Then we give examples to show that if the underlying causal model changes, directly applying approaches with fixed causal models may result in spurious edges or wrong causal directions, which motivates our work in causal discovery with changing causal models.

2.1 Causal Discovery of Fixed Causal Models

Most causal discovery methods assume that there is a fixed causal model underlying the observed data and aim to estimate it from the data. Classic approaches to causal discovery divide roughly into two types. In the late 1980’s and early 1990’s, it was noted that under appropriate assumptions, one could recover a Markov equivalence class of the underlying causal structure based on conditional independence relationships among the variables (Spirtes et al., 1993). This gave rise to the constraint-based approach to causal discovery, and the resulting equivalence class may contain multiple DAGs (or other graphical objects to represent causal structures), which entail the same conditional independence relationships. The required assumptions include the causal Markov condition and the faithfulness assumption, which entail a correspondence between d-separation properties in the underlying causal structure and statistical independence properties in the data. The so-called score-based approach (see, e.g., (Chickering, 2003; Heckerman et al., 1995)) searches for the equivalence class which gives the highest score under some scoring criteria, such as the Bayesian Information Criterion (BIC), the posterior of the graph given the data (Heckerman et al., 1997), and the generalized score functions (Huang et al., 2018).

Another set of approaches is based on constrained functional causal models, which represent the effect as a function of the direct causes together with an independent noise term. The causal direction implied by the constrained functional causal model is generically identifiable, in that the model assumptions, such as the independence between the noise and cause, hold only for the true causal direction and are violated for the wrong direction. Examples of such constrained functional causal models include the Linear, Non-Gaussian, Acyclic Model (LiNGAM (Shimizu et al., 2006)), the additive noise model (Hoyer et al., 2009; Zhang and Hyvärinen, 2009a), and the post-nonlinear causal model (Zhang and Chan, 2006; Zhang and Hyvärinen, 2009b).

2.2 With Changing Causal Models

Suppose that we are given a set of observed variables whose causal structure is represented by a DAG . For each , let denote the set of parents of in

. Suppose that at each point in time or in each domain, the joint probability distribution of

factorizes according to :

(1)

We call each a causal module (the same meaning with “causal mechanism” in previous sections). If there are distribution shifts (i.e., changes across domains or over time), at least some causal modules , must change. We call those causal modules changing causal modules. Their changes may be due to changes of the involved functional models, causal strengths, noise levels, etc. We assume that those quantities that change cross domains or over time can be written as functions of a domain or time index, and denote by such an index.

(a)                                                   (b)

Figure 1: An illustration on how ignoring changes in the causal model may lead to spurious edges by constraint-based methods. (a) The true causal graph (including confounder ). (b) The estimated causal skeleton on the observed data in the asymptotic case given by constraint-based methods, e.g., PC or FCI.

If the changes in some modules are related, one can treat the situation as if there exists some unobserved quantity (confounder) which influences those modules and, as a consequence, the conditional independence relationships in the distribution-shifted data will be different from those implied by the true causal structure. Therefore, standard constraint-based algorithms such as PC or FCI (Spirtes et al., 1993) may not be able to reveal the true causal structure. As an illustration, suppose that the observed data were generated according to Fig. 1(a), where , a function of , is involved in the generating processes for both and ; the causal skeleton over the observed data then contains spurious edges and , as shown in Fig. 1(b), because there is only one conditional independence relationship, .

Figure 2: An illustration of a failure of using the approach based on functional causal models for causal direction determination when the causal model changes. (a) Scatter plot of and on data set 1. (b) That on data set 2. (c) That on merged data (both data sets). (d) The scatter plot of and the estimated regression residual on merged data by regressing on .

Moreover, when one fits a fixed functional causal model (e.g., a linear, non-Gaussian model (Shimizu et al., 2006)) to distribution-shifted data, the estimated noise may not be independent of the cause. Consequently, the approach based on constrained functional causal models, in general, cannot infer the correct causal structure either. Figure 2 gives an illustration of this point. Suppose that we have two data sets for variables and : is generated from according to in the first data set and according to in the second one, and in both data sets and

are mutually independent and follow a uniform distribution. Figure 

2(a-c) show the scatter plots of and on data set 1, on data set 2, and on merged data, respectively. Figure 2(d) shows the scatter plot of , the cause, and the estimated regression residual on the merged data set by regressing on ; they are not independent anymore, although on either data set the regression residual is independent of . Thus, we cannot correctly determine the causal direction.

To tackle the issue of changing causal models, one may try to find causal models in each sliding window  (Calhoun et al., 2014) (for nonstationary data) or in different domains (for data from multiple domains) separately, and then compare them. Improved versions include the online change point detection method (Adams and Mackay, 2007), the online undirected graph learning (Talih and Hengartner, 2005), and the locally stationary structure tracker algorithm (Kummerfeld and Danks, 2013)

. Such methods may suffer from high estimation variance due to sample scarcity, large type II errors, or multiple testing problems from a large number of statistical tests. Some methods aim to estimate the time-varying causal model by making use of certain types of smoothness of the change 

(Huang et al., 2015), but they do not explicitly locate the nonstationary causal modules. Several methods aim to model time-varying time-delayed causal relations (Xing et al., 2010; Song et al., 2009), which can be reduced to online parameter learning because the direction of causal relations is given (i.e., the past influences the future). Compared to them, learning changing instantaneous causal relations, with which we are concerned in this paper, is generally more difficult. Moreover, most of these methods assume linear causal models, limiting their applicability to complex problems with nonlinear causal relations.

In contrast, we developed a nonparametric and computationally efficient method that can identify changing causal modules and reliably recover the causal structure. We showed that distribution shifts actually contain useful information for the purpose of determining causal directions and developed practical algorithms accordingly. After identifying the causal structure, we proposed a method to estimate a low-dimensional and interpretable representation of changes.

3 CD-NOD Phase I: Changing Causal Module Detection and Causal Skeleton Estimation

In this section, we first formalize the assumptions that will be used in CD-NOD. Specifically, we allow a particular type of confounders—pseudo confounders, and we do not put hard restrictions on functional forms of causal mechanisms and data distributions. Accordingly, we propose an approach to efficiently detect changing causal modules and identify the causal skeleton; we call this step CD-NOD phase I. We show that the proposed approach is guaranteed to asymptotically recover the true graph as if unobserved changing factors were known.

3.1 Assumptions

In this paper, we allow changes in causal modules and some of the changes to be related; the related changes can be explained by positing particular types of confounders. Intuitively, such confounders may refer to some high-level background variables. For instance, for fMRI data, they may be the subject’s attention or some unmeasured background stimuli; for the stock market, they may be related to economic policies. Thus, we do not assume causal sufficiency for the set of observed variables. Instead, we assume pseudo causal sufficiency as stated below.

Assumption 1 (Pseudo Causal Sufficiency)

We assume that the confounders, if any, can be written as smooth functions of the domain index or time 222More specifically, for data with multiple domains, we require that the confounders can be written as a function of the domain index (i.e., it does not change within a domain); for nonstationary time series, we require that the confounder is a smooth function of the time index. Roughly speaking, the smoothness constraint requires the gradient of the function to not change rapidly. In practice, one may specify the level of smoothness in advance (say, by assuming the function follows a Gaussian process prior and properly setting the kernel width to some range) or learn it from data by maximizing marginal likelihood or cross validation.. It follows that in each domain or at each time, the values of these confounders are fixed.

To clearly express our basic idea in the presence of distribution shift, we focus on DAGs and assume pseudo causal sufficiency. Note that our approach is flexible enough to be extended to cover other types of graphs, e.g., graphs with confounders and graphs with cycles. Later in Section 6.2, we will discuss how nonstationarity helps when there exist stationary confounders. In table 1, we summarize descriptions of different types of confounders that will be used in this paper, including pseudo confounders, stationary confounders, and nonstationary confounders.

Confounder type Description
Pseudo confounder
It can be represented as smooth functions of domain or
time index.
Stationary confounder Its distribution is fixed.
Nonstationary confounder
Its distribution changes across domains or over time,
but it cannot be represented as smooth functions of
domain or time index.
Table 1: Descriptions of different types of confounders (latent common causes).

We start with contemporaneous causal relations; the mechanisms and parameters associated with the causal model are allowed to change across data sets or over time, or even vanish or appear in some domains or over some time periods. However, it is natural to generalize our framework to incorporate time-delayed causal relations (Section 6.1).

Denote by the set of pseudo confounders (which may be empty). We further assume that for each , its local causal process can be represented by the following structural equation model (SEM):

(2)

where denotes the set of confounders that influence (it is an empty set if there is no confounder behind and any other variable), denotes the effective parameters in the model that are also assumed to be functions of , and is a disturbance term that is independent of and and has a non-zero variance (i.e., the model is not deterministic). We also assume that the ’s are mutually independent. Note that is specific to and is independent of for . The variable can be the domain or time index. In special cases, e.g., the case with multiple domains and all of which have nonstationarity, has two dimensions: one is the domain index and the other is the time index. The SEM given in Eq.2 does not have any restrictions on data distributions or functional classes.

In this paper we treat

as a random variable, and so there is a joint distribution over

. We assume that this distribution is Markov and faithful to the graph resulting from the following additions to (which, recall, is the causal structure over ): add to , and for each , add an arrow from each variable in to and add an arrow from to . We refer to this augmented graph as . Obviously, is simply the induced subgraph of over . Specifically, the assumption is summarized below.

Assumption 2

The joint distribution over is Markov and faithful to the augmented graph . In addition, there is no selection bias; i.e., the observed data are perfect random samples from the populations implied by the causal model.

The distribution change across domains or over time can be considered in the following way. In the case when is the domain index, follows a uniform distribution over all possible values, and we have a particular way to generate its value: all possible values are generated once, resulting in domain indices. In the case when is the time index, we take time to be a special random variable which follows a uniform distribution over the considered time period, with the corresponding data points evenly sampled at a certain sampling frequency. Correspondingly, the generating process of nonstationary data can be considered as follows: we generate random values from , and then we generate data points over according to the SEM in (2). The generated data points are then sorted in ascending order according to the values of . In other words, we observe the distribution , where may change across different values of , resulting in non-identical distribution of data.

3.2 Detection of Changing Modules and Recovery of Causal Skeleton

In this section, we propose a method to detect variables whose causal modules change and infer the skeleton of . The basic idea is simple: we use the (observed) variable as a surrogate for the unobserved , or in other words, we take to capture C-specific information. We now show that given the assumptions in Section 3.1, we can apply conditional independence tests to to detect variables with changing modules and recover the skeleton of . We consider as a surrogate variable (it itself is not a causal variable, it is always available, and confounders and changing parameters are its functions): by adding only to the variable set , the skeleton of and the changing causal modules can be estimated as if were known. This is achieved by Algorithm 1 and supported by Theorem 1.

  1. Build a complete undirected graph on the variable set .

  2. (Detection of changing modules) For each , test for the marginal and conditional independence between and . If they are independent given a subset of , remove the edge between and in .

  3. (Recovery of causal skeleton) For every , test for the marginal and conditional independence between and . If they are independent given a subset of , remove the edge between and in .

Algorithm 1 Detection of Changing Modules and Recovery of Causal Skeleton

The procedure given in Algorithm 1 outputs an undirected graph that contains as well as . In Step 2, whether a variable has a changing module is decided by whether and are independent conditional on some subset of other variables. The justification for one side of this decision is trivial. If ’s module does not change, that means remains the same for every value of , and so . Thus, if and are not independent conditional on any subset of other variables, ’s module changes with , which is represented by an edge between and . Conversely, we assume that if ’s module changes, which entails that and are not independent given , then and are not independent given any other subset of . If this assumption does not hold, then we only claim to detect some (but not necessarily all) variables with changing modules.

Step 3 aims to discover the skeleton of the causal structure over . In practice, one may apply any constraint-based search procedures on , e.g., SGS and PC (Spirtes et al., 1993). Its (asymptotic) correctness is justified by the following theorem:

Theorem 1

Given Assumptions 2 and 2, for every , and are not adjacent in if and only if they are independent conditional on some subset of .

Basic idea of the proof. For a complete proof see Appendix A. The “only if” direction is proved by making use of the weak union property of conditional independence repeatedly, the fact that all and are deterministic functions of , some implications of the SEMs Eq. 2, the assumptions in Section 3.1, and the properties of mutual information given in (Madiman, 2008). The “if” direction is shown based on the faithfulness assumption on and the fact that is a deterministic function of .

For any pair of nonadjacent variables and with and , we can easily detect whether there are pseudo confounders behind and from the independence test results derived from Algorithm 1:

  • [itemsep=0.2pt,topsep=0.2pt]

  • if and , then there exist pseudo confounders behind and ;

  • if , then there is no pseudo confounder behind and ;

with .

Note that in Algorithm 1, it is crucial to use a general, nonparametric conditional independence test, for how variables depend on is unknown and usually very nonlinear. In this work, we use the kernel-based conditional independence test (KCI-test (Zhang et al., 2011)) to capture the dependence on in a nonparametric way. By contrast, if we use, for example, tests of vanishing partial correlations, as is widely used in the neuroscience community, the proposed method may not work well.

4 CD-NOD Phase II: Distribution Shifts Benefit Causal Direction Determination

We now show that introducing the additional variable as a surrogate not only allows us to infer the skeleton of the causal structure but also facilitates the determination of some causal directions. Let us call those variables that are adjacent to in the output of Algorithm 1 “-specific variables”, which are actually the effects of changing causal modules. For each -specific variable , it is possible to determine the direction of every edge which has an endpoint on . Let be any variable adjacent to in the output of Algorithm 1. Then there are two possible scenarios to consider:

  1. is not adjacent to . Then forms an unshielded triple. For practical purposes, we can take the direction between and as (though we do not claim to be a cause in any substantial sense). Then we can use standard orientation rules for unshielded triples to orient the edge between and  (Spirtes et al., 2001; Pearl, 2000). There are two possible situations:
    1.a If and are independent given a set of variables excluding , then the triple is a V-structure, and we have .
    1.b Otherwise, if and are independent given a set of variables including , then the triple is not a V-structure, and we have .

  2. is also adjacent to . This case is more complex than S, but it is still possible to identify the causal direction between and , based on the principle that and

    change independently; a heuristic method is given in Section 

    4.2.

The procedure in S, which will be further discussed in Section 4.1, contains the methods proposed in (Hoover, 1990; Tian and Pearl, 2001; Peters et al., 2016) for causal discovery from changes as special cases. It may also be interpreted as special cases of the principle underlying the method for S: if one of and changes while the other remains invariant, they are clearly independent.

4.1 Causal Direction Identification by Generalization of Invariance

There exist methods for causal discovery from differences among multiple data sets (Hoover, 1990; Tian and Pearl, 2001; Peters et al., 2016) that explore invariance of causal mechanisms. They used linear models to represent causal mechanisms and, as a consequence, the invariance of causal mechanisms can be assessed by checking whether the involved parameters change across data sets or not. Actually, S above provides a nonparametric way to achieve this in light of nonparametric conditional independence tests. For any variable and a set of variables , the conditional distribution is invariant across different values of if and only if

This is exactly the condition under which . In words, testing for invariance (or homogeneity) of the conditional distribution is naturally achieved by performing a conditional independence test on and given the variable , for which there exist off-the-shelf algorithms and implementations. When is the empty set, this reduces to a test of marginal independence between and , or a test of homogeneity of .

In S, we have the invariance of when the causal mechanism, represented by , changes, which is complementary to the invariance of causal mechanisms. The (conditional) independence test results between and are readily available from Algorithm 1 and can be applied to determine causal directions between variables which satisfy S. The procedure is summarized in Algorithm 2.

  1. Input: causal skeleton from Algorithm 1.

  2. Orient , for any variable which is adjacent to .

  3. For any unshielded triple with , where is not adjacent to ,

    • if , with and , orient ;

    • if , with and , orient .

  4. Output: partially oriented graph using the property of generalization of invariance.

Algorithm 2 Causal Direction Identification by Generalization of Invariance

Naturally, both invariance properties above are particular cases of the principle of independent changes of causal modules underlying the method for S: if one of and changes while the other remains invariant, they are clearly independent. Usually, there is no reason why only one of them could change, so the above invariance properties are rather restrictive. The property of independent changes holds in rather generic situations, e.g., when there is no confounder behind cause and effect, or even when there are confounders but the confounders are independent of . Below we will propose an algorithm for causal direction determination based on independent changes.

4.2 Causal Direction Identification by Independently Changing Modules

We now develop a method to handle S above. To clearly express the idea, let us start with a two-variable case: suppose and are adjacent and are both adjacent to . We aim to identify the causal direction between them, which, without loss of generality, we assume to be .

Figure 3(a) shows the case where the involved changing parameters, and , are independent, i.e., and change independently (we dropped the argument in and to simplify notation).

Figure 3: An illustration of a two-variable case: with corresponding parameters and changing independently.

For the reverse direction, one can decompose the joint distribution of according to

(3)

where and are assumed to be sufficient for the corresponding distribution modules and . Generally speaking, and are not independent, because they are determined jointly by and .

Now we face the problem of how to compare the dependence between and with that between and . Since is assumed to be sufficient for the corresponding distribution module, it is equivalent to compare the dependence between and with that between and .

The idea that causal modules are independent is not new (Pearl, 2000), but note that in a stationary situation where each module is fixed, such independence is very difficult, if not impossible, to test. By contrast, in the situation we are considering presently, both and are changing, and we can try to measure the extent to which variation in and variation in are dependent (and similarly for and ). This is the sense in which distribution change actually helps in the identification of causal directions, and as far as we know, this is the first time that such an advantage is exploited in the case where both and change.

We extend the Hilbert Schmidt Independence Criterion (HSIC) to measure the dependence between causal modules. To do so, we first develop a novel kernel embedding of nonstationary conditional distributions which does not rely on sliding windows and estimate their corresponding Gram matrices in Section 4.2.1, which will be used in the extended HSIC and the causal direction determination rule in Section 4.2.2. In Section 4.2.3, we propose an algorithm for causal direction determination in multi-variable cases by taking advantage of independent changes.

4.2.1 Kernel Embedding of Constructed Joint Distributions

Notation

Throughout this section, we use the following notation. Let be a random variable on domain , and be a Reproducing Kernel Hilbert Space (RKHS) with a measurable kernel on . Let represent the feature map for each , with . We assume integrability: . Similar notations are for variables and . The cross-covariance operator is defined as , where is the RKHS associated with .

We represent causal modules by kernel embedding. Intuitively, to represent the kernel embedding of changing causal modules, we need to consider for each value of separately. If is a domain index, for each value of we have a dataset of . If is a time index, one may use a sliding window to use the data of in the window of length centered at . However, in some cases, it might be hard to find an appropriate window length , especially when the causal module changes fast. In the following, we propose a way to estimate the kernel embedding of changing causal modules on the whole dataset, avoiding window segmentation. For the sake of conciseness, below we use and to denote and , respectively.

Suppose that there are samples for each variable. Instead of working with directly, we “virtually” construct a particular distribution as follows:333Here we use instead of to emphasize that in this constructed distribution and are not symmetric.

(4)

Since does not depend on and its support is rich enough to contain that of , one can see that whenever there are changes in across different values of , there must be changes in , and vice versa. In other words, the constructed distribution captures changes in across different . We let .

Proposition 1 shows that the kernel embedding of the distribution can be estimated on the whole data set, without window segmentation.

Proposition 1

Let represent the direct causes of , and suppose that they have observations. The kernel embedding of distribution can be represented as

where , , , , , and represents point-wise product.

The detailed proof of proposition 1 is given in Appendix B.

Next we estimate the Gram matrix of . We consider different kernels for the estimation of Gram matrix. Let represent the Gram matrix of with a linear kernel, and the Gram matrix of with a Gaussian kernel.

If we use a linear kernel, the th entry of the Gram matrix is the inner product between and :

(5)

which is the th entry of the matrix

(6)

If we use a Gaussian kernel with kernel width , the Gram matrix is given by

(7)

where denotes the Frobenius norm. This can be represented in matrix notation as

(8)

where sets all off-diagonal entries zero, and is a matrix with all entries 1.

We can see that with our methods, we do not need to explicitly learn the high-dimensional kernel embedding for each . With the kernel trick, the final Gram matrix can be represented by kernel matrices directly.

There are several hyperparameters to set. The hyperparameters associated with

, , and the regularization parameter in equation (6) are learned through a Gaussian process regression framework: they are learned by maximizing the marginal likelihood of . For the hyperparameters associated with and the kernel with in equation (7), we set them with empirical values; please refer to Zhang et al. (2011) for details.

Change in marginal distributions. As a special case, when we are concerned with how the marginal distribution of changes with , i.e., when , we directly make use of

(9)

This can also be obtained by constraining in to take a fixed value. Its empirical estimate is

(10)

Then entry of the Gram matrix with a linear kernel is:

(11)

which is the th entry of

(12)

For a Gaussian kernel with kernel with , the Gram matrix is

(13)

4.2.2 Two-Variable Case

In this section, we extend HSIC to measure the dependence between causal modules, based on which we determine causal directions.

For simplicity, let us start with the two-variable case: suppose that and are adjacent and both are adjacent to . We aim to identify the causal direction between them, which, without loss of generality, we assume to be . The guiding idea is that distribution shift may carry information that confirms the independence of causal modules, which, in the simple case we are considering, is the independence between and . If and are independent but and are not, then the causal direction is inferred to be from to .

The dependence between and can be measured by extending the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2008).

Hsic

Given a set of observations for variables and , respectively, HSIC provides a measure of dependence and a statistic for testing their statistical independence. Roughly speaking, it measures the squared covariances between feature maps of and feature maps of . Let and be the Gram matrices for and calculated on the sample, respectively. An estimator of HSIC is given by (Gretton et al., 2008)

(14)

where is used to center the features, with entries .

In what follows, we will use a normalized version of the estimated HSIC, which is invariant to the scale in and :

(15)
Dependence between Changing Modules

In our case, we aim to check whether and change independently when changes. We work with the estimate of their embeddings. Then we can think of as the observed data pairs and measure their dependence from the data pairs.

This can be done by applying (the normalized version of) the estimate of HSIC given in Eq.15 to the above data pairs. The expression then involves , the Gram matrix of at , and , the Gram matrix of at . In particular, the dependence between and on the given data can be estimated by

(16)

Similarly, for the hypothetical direction the dependence between and on the data is estimated by

(17)

We then have the following rule to determine the causal direction between and .

Causal Direction Determination Rule

Suppose that and are two random variables with observations. We assume that and are adjacent and both are adjacent to and assume no pseudo confounders behind them. The causal direction between and is then determined according to the following rule:

  • if , output ;

  • if , output .

In practice, there may exist pseudo confounders. In such a case, we set a threshold on . If and , we conclude that there are pseudo confounders which influence both and and leave the direction undetermined.

4.2.3 With More Than Two Variables

The causal direction determination rule in the two-variable case can be extended to learn causal directions in multi-variable cases. Suppose that we have observed random variables and a partially oriented graph derived from Algorithms 1 and 2. Let be the subset of , such that if and only if ’s causal module changes.

Before moving forward, we first define deconfounding set and potential deconfounding set of a pair of adjacent variables in .

Definition 1 (Deconfounding Set)

A set of variables is the deconfounding set of a pair of adjacent variables , if

  1. no node in is a descendant of or ,

  2. and blocks every path between and that contains arrows into and .

Furthermore, a set of variables is the minimal deconfounding set of a pair of adjacent variables , if any is not a deconfounding set.

Definition 2 (Potential Deconfounding Set)

A set of variables is the potential deconfounding set of a pair of adjacent variables , if

  1. no node in is a descendant of or ,

  2. blocks every path between and that does not contain an arrow out of or ,

  3. and any is not in the deconfounding set.

Furthermore, a set of variables is the minimal potential deconfounding set of a pair of adjacent variables , if any subset is not a potential deconfounding set.

In Figure 4, for example, the set is a minimal deconfounding set of , and the set is a minimal potential deconfounding set of .

Figure 4: An illustration of the definitions of minimal deconfounding set and minimal potential deconfounding set on a partially oriented graph.

We take advantage of the independence between causal modules to identify directions. To efficiently identify causal directions using independent changes when there are multiple variables, we propose Algorithm 3, whose main procedure is as follows. For each undirected pair and , we denote their minimal deconfounding set by and their minimal potential deconfounding set by . Note that there may be multiple minimal deconfounding sets and multiple minimal potential deconfounding sets; to search for such sets more efficiently, we only consider variables in or that are adjacent to . Let be a subset of , where is the total cardinality of and ; i.e., . We evaluate the dependence between and , and that between and . If we find that and that , we output , and if there are unoriented edges from variables in to or , then we consider those variables as parents. Similarly, if we find that and that , we output