1 Introduction
Causal models play a fundamental role in modern scientific endeavor (Pearl, 2009). While randomized control studies are the gold standard, such an approach is unfeasible or unethical in many scenarios (Spirtes and Zhang, 2016). Even when it is possible to run randomized control trials, the number of experiments required may raise practical challenges (Eberhardt et al., 2005). Furthermore, big data sets publicly available on the internet often try to be generic and thus cannot be strongly based on specific interventions; a prominent example is the Human Connectome Project which collects resting state fMRI data from over 500 subjects (Van Essen et al., 2012). As such, it is both necessary and important to develop causal discovery methods through which to uncover causal structure from (potentially largescale) passively observed data. Such data collected without explicit manipulation of certain variables is often termed observational data.
The intrinsic appeal of causal discovery methods is that they allow us to uncover the underlying causal structure of complex systems, providing an explicit description of the underlying generative mechanisms. Within the context of machine learning causal discovery has also been shown to play an important role in many domains such as semisupervised and transfer learning
(Schölkopf et al., 2012; Zhang et al., 2013), covariate shift and algorithmic fairness (Kusner et al., 2017). As a result, a wide range of methods have been proposed (Shimizu et al., 2006; Pearl, 2009; Hoyer et al., 2009; Zhang and Hyvärinen, 2009; Peters et al., 2016; Zhang et al., 2017). However, many of the current methods rely on restrictive assumptions regarding the nature of the causal relationships. For example, Shimizu et al. (2006) assume linear causal models with nonGaussian disturbances and demonstrate that independent component analysis (ICA) may be employed to uncover causal structure. Hoyer et al. (2009) provide an extension to nonlinear causal models under the assumption of additive noise.In this paper we propose a general method for bivariate causal discovery in the presence of general nonlinearities. The proposed method is able to uncover nonlinear causal relationships without requiring assumptions such as linear causal structure or additive noise. Our approach exploits a correspondence between a nonlinear ICA model and nonlinear causal models, and is specifically tailored for observational data which is collected across a series of distinct experimental conditions or regimes. Given such data, we seek to exploit the nonstationarity introduced via distinct experimental conditions in order to perform causal discovery. We demonstrate that if latent sources can be recovered via nonlinear ICA, then a series of independence tests may be employed to uncover causal structure. As an alternative to independence testing, we further propose a novel measure of nonlinear causal direction based on an asymptotic approximation to the likelihood ratio.
2 Preliminaries
In this section we introduce the class of causal models studied. We also present an overview of nonlinear ICA methods based on contrastive learning, upon which we base the proposed method.
2.1 Model Definition
Suppose we observe
dimensional random variables
with joint distribution
. The objective of causal discovery is to use the observed data, which give the empirical version of , to infer the associated causal graph which describes the data generating procedure (Pearl, 2009).A structural equation model (SEM) is here defined (generalizing the traditional definition) as a collection of structural equations:
(1) 
together with a joint distribution, , over disturbance (noise) variables, , which are assumed to be mutually independent. We write to denote the parents of the variable . The causal graph, , associated with a SEM in equation (1) is a graph consisting of one node corresponding to each variable ; throughout this work we assume is a directed acyclic graph (DAG).
While functions in equation (1) can be any (possibly nonlinear) functions, to date the causal discovery community has focused on specific special cases in order to obtain identifiability results as well as provide practical algorithms. Pertinent examples include: a) the linear nonGaussian acyclic model (LiNGAM; Shimizu et al., 2006), which assumes each is a linear function and the are nonGaussian, b) the additive noise model (ANM; Hoyer et al., 2009), which assumes the noise is additive:
and c) the postnonlinear causal model, which also captures possible nonlinear distortion in the observed variables (Zhang and Hyvärinen, 2009).
The aforementioned approaches enforce strict constraints on the functional class of the SEM. Otherwise, without suitable constraints on the functional class, then for any two variables one can always express one of them as a function of the other and independent noise (Hyvärinen and Pajunen, 1999). We are motivated to develop novel causal discovery methods which benefit from new identifiability results established from a different angle, in the context of general nonlinear (and nonadditive) relationships. A key component of our method exploits some recent advances in nonlinear ICA, which we review next.
2.2 NonLinear Ica via Tcl
We briefly outline the recently proposed Time Contrastive Learning (TCL) algorithm, through which it is possible to demix (or disentangle) latent sources from observed nonlinear mixtures; this algorithm provides hints as to the identifiability of causal direction between two variables in general nonlinear cases under certain assumptions and is exploited in our causal discovery method. For further details we refer readers to Hyvärinen and Morioka (2016) but we also provide a brief review in Supplementary material A. We assume we observe dimensional data, X, which is generated according to a smooth and invertible nonlinear mixture of independent latent variables . In particular, we have
(2) 
The goal of nonlinear ICA is then to recover S from X.
TCL, as introduced by Hyvärinen and Morioka (2016), is a method for nonlinear ICA which is premised on the assumption that both latent sources and observed data are nonstationary time series. Formally, they assume that while components are mutually independent, the distribution of each component is piecewise stationary, implying they can be divided into nonoverlapping time segments such that their distribution varies across segments, indexed by . In the basic case, the logdensity of the th latent source in segment is assumed to follow an exponential family distribution such that:
(3) 
where is a stationary baseline and is a nonlinear scalar function defining an exponential family for the th source. (Exponential families with more than one sufficient statistic are also allowed.) The final term in equation (3) corresponds to a normalization constant. It is important to note that parameters are functions of the segment index, , implying that the distribution of sources will vary across segments. It follows from equation (2) that observations X may also be divided into nonoverlapping segments indexed by . We write to denote the th observation and to denote its corresponding segment.
TCL proceeds by defining a multinomial classification task, where we consider each original data point
as a data point to be classified, and the segment indices
give the labels. Given the observations, X, together with the associated segment labels, , TCL can then be proven to recover as well as independent components, S, by learning to classify the observations into their corresponding segments. In particular, TCL trains a deep neural network using multinomial logistic regression to perform this classification task. The network architecture employed consists of a feature extractor corresponding to the last hidden layer, denoted
and parameterised by , together with a final linear layer. The central Theorem on TCL is given in our notation asTheorem 1 (Hyvärinen and Morioka (2016))
Assume the following conditions hold:

We train a neural network consisting of a feature extractor and a final linear layer (i.e., softmax classifier) to classify each observation to its corresponding segment label, . We require the dimension of be the same as .

The matrix L with elements for and has full rank.
Then in the limit of infinite data, the outputs of the feature extractor are equal to
, up to an invertible linear transformation.
Theorem 1
states that we may perform nonlinear ICA by training a neural network to classify the segments associated with each observation, followed by linear ICA on the hidden representations,
. This theorem provides identifiability of this particular nonlinear ICA model, meaning that it is possible to recover the sources. This is not the case with many simpler attempts at nonlinear ICA models (Hyvärinen, 1999), such as the case with a single segment in the model above.While Theorem 1 provides identifiability for a particular nonlinear ICA model, it requires a final linear unmixing of sources (i.e., via linear ICA). However, when sources follow the piecewise stationary distribution detailed in equation (3), traditional linear ICA methods may not be appropriate as sources will only be independent condition on the segment. For example, it is possible that exponential family parameters, , are dependent across sources (e.g., they may be correlated). This problem will be particularly pertinent when data is only collected over a reduced number of segments. As such, alternative linear ICA algorithms are required to effectively employ TCL in such a setting, as addressed in Section 3.2.
3 NonLinear Causal Discovery via NonLinear Ica
In this section we outline the proposed method for causal discovery over bivariate data, which we term Nonlinear SEM Estimation using NonStationarity (NonSENS). We begin by providing an intuition for the proposed method in Section 3.1, which is based on the connection between nonlinear ICA and nonlinear SEMs. In Section 3.2 we propose a novel linear ICA algorithm which complements TCL for the purpose of causal discovery, particularly in the presence of observational data with few segments. Our method is formally detailed in Section 3.3, which also contains a proof of identifiability. Finally in Section 3.4 we present an alternative measure of casual direction based on asymptotic approximations to the likelihood ratio of nonlinear causal models.
3.1 Relating Sem to Ica
We assume we observe bivariate data and write and to denote the first and second entries of respectively. We will omit the index whenever it is clear from context. Following the notation of Peters et al. (2016), we further assume data is available over a set of distinct environmental conditions . As such, each is allocated to an experimental condition and denote by the experimental condition under which the th observation was generated. Let be the number of observations within each experimental condition such that .
The objective of the proposed method is to uncover the causal direction between and . Without loss of generality, we explain the basic logic. Suppose that , such that the associated SEM is of the form:
(4)  
(5) 
where are latent disturbances whose distribution is also assumed to vary across experimental conditions. The DAG associated with equations (4) and (5) is shown in Figure 1. Fundamentally, the proposed NonSENS algorithm exploits the correspondence between the nonlinear ICA model described in Section 2.2 and nonlinear SEMs. This correspondence is formally stated as follows: observations generated according to the piecewise stationary nonlinear ICA model of equations (2) and (3
) will follow a (possibly nonlinear) SEM where each disturbance variance,
, corresponds to a latent source, , and each structural equation, , will correspond to an entry in the smooth, invertible funciton f. We note that due to the permutation indeterminacy present in ICA each disturbance variable, , will only be identifiable up to some permutation of the set .The proposed method consists of a twostep procedure. First, it seeks to recover latent disturbances via nonlinear ICA. Given estimated latent disturbances, the following property highlights how we may employ statistical independencies between observations and estimated sources in order to infer the causal structure:
Property 1
Assume the true causal structure follows equations (4) and (5), as depicted in Figure 1. Then, assuming each observed variable is statistically dependent on its latent disturbance (thus avoiding degenerate cases), it follows that while and as well as . ^{1}^{1}1We note that the property that effect is dependent on its direct causes typically holds, although one may construct specific examples (with discrete variables or continuous variables with complex causal relations) in which effect is independent from its direct causes. In particular, if faithfulness is assumed (Spirtes et al., 2000), the above property clearly holds.
3.2 A Linear Ica Algorithm for PieceWise Stationary Sources
Before proceeding, we have to improve the nonlinear ICA theory of Hyvärinen and Morioka (2016). Assumptions 1–3 of Theorem 1 for TCL guarantee that the feature extractor, , will recover a linear mixture of latent independent sources. As a result, applying a linear unmixing method to the final representations, , will allow us to recover latent disturbances. However, the use of ordinary linear ICA to unmix is premised on the assumption that latent sources are independent. This is not necessarily guaranteed when sources follow the ICA model presented in equation (3) with a fixed number of segments. For example, it is possible that parameters, , are correlated across segments. We note that this is not a problem when the number of segments increases asymptotically and parameters, , are assumed to be randomly generated, as stated in Corollary 1 of Hyvärinen and Morioka (2016).
In order to address this issue, we propose an alternative linear ICA algorithm to be employed in the final stage of TCL, through which to accurately recover latent sources in the presence of a small number of segments.
The proposed linear ICA algorithm explicitly models latent sources as following the piecewise stationary distribution specified in equation (3). We write to denote the th observation, generated as a linear mixtures of sources: , where is a square mixing matrix. Estimation of parameters proceeds via score matching (Hyvärinen, 2005), which yields an objective function of the following form:
where denotes the unmixing matrix and and denote the first and second derivatives of the nonlinear scalar functions introduced in equation (3). Details and results are provided in Supplementary B, where the proposed method is shown to outperform both FastICA and Infomax ICA, as well as the joint diagonalization method of Pham and Cardoso (2001), which is explicitly tailored for nonstationary sources.
3.3 Causal Discovery Using Independence Tests
Now we give the outline NonSENS. NonSENS performs causal discovery by combining Property 1 with a nonlinear ICA algorithm. Notably, we employ TCL, described in Section 2.2, with the important addition that the final linear unmixing of the hidden representations, , is performed using the objective given in Section 3.2. The proposed method is summarized as follows:


Using TCL, train a deep neural network with feature extractor to accurately classify each observation according to its segment label .

Perform linear unmixing of using the algorithm presented in Section 3.2.


Perform the four tests listed in Property 1, and conclude a causal effect in the case where there is evidence to reject the null hypothesis in three of the tests and only one of the tests fails to reject the null. The variable for which the null hypothesis was not rejected is the cause variable.
Each test is run at a prespecified significance level, , and Bonferroni corrected in order to control the familywise error rate. Throughout this work we employ HSIC as a test for statistical independence (Gretton et al., 2005). Theorem 2 formally states the assumptions and identifiability properties of the proposed method.
Theorem 2
Assume the following conditions hold:

We observe bivariate data X which has been generated from a nonlinear SEM with smooth nonlinearities and no hidden confounders.

Data is available over at least three distinct experimental conditions and latent disturbances, , are generated according to equation (3).

We employ TCL, with a sufficiently deep neural network as the feature extractor, followed by linear ICA (as described in Section 3.2) on hidden representations to recover the latent sources.

We employ an independence test which can capture any type of departure from independence, for example HSIC, with Bonferroni correction and significance level .
Then in the limit of infinite data the proposed method will identify the cause variable with probability
.3.4 Likelihood RatioBased Measures of Causal Direction
While independence tests are widely used in causal discovery, they may not be statistically optimal for deciding causal direction. In this section, we further propose a novel measure of causal direction which is based on the likelihood ratio under nonlinear causal models, and which thus is likely to be more efficient.
The proposed measure can be seen as the extension of linear measures of causal direction, such as those proposed by Hyvärinen and Smith (2013), to the domain of nonlinear SEMs. Briefly, Hyvärinen and Smith (2013) consider the likelihood ratio between two candidate models of causal influence: or . The loglikelihood ratio is then defined as the difference in loglikelihoods under each model:
(6) 
where we write to denote the loglikelihood under the assumption that is the causal variable and for the alternative model. Under the assumption that , it follows that the underlying SEM is of the form described in equations (4) and (5). The loglikelihood for a single data point may thus be written as
Furthermore, in the context of linear causal models we have that equations (4) and (5) define a bijection between and whose Jacobian has unit determinant, such that the loglikelihood can be expressed as:
In the asymptotic limit we can take the expectation of loglikelihood, and the loglikelihood converges to:
(7) 
where denotes the differential entropy. Hyvärinen and Smith (2013) note that the benefit of equation (7) is that only univariate approximations of the differential entropy are required. In this section we seek to derive equivalent measures for causal direction without the assumption of linear causal effects. Recall that after training via TCL, we obtain an estimate of which is parameterized by a deep neural network.
In order to compute the loglikelihood, , we consider the following change of variables:
where we note that refers to the second component of . Further, we note that the the mapping only applies the identity to the first element, thereby leaving unchanged. Given such a change of variables, we may evaluate the loglikelihood as follows:
where denotes the Jacobian of , as we have by construction under the assumption that .
Due to the particular choice of , we are able to easily evaluate the Jacobian, which can be expressed as:
As a result, the determinant can be directly evaluated as . Furthermore, since is parameterized by a deep network, we can directly evaluate its derivative with respect to . This allows us to directly evaluate the loglikelihood of being the causal variable as:
Finally, we consider the asymptotic limit and obtain the nonlinear generalization of equation (7) as:
In practice we use the sample mean instead of the expectation.
One remaining issue to address is the permutation invariance of estimated sources (note this this permutation is not about the causal order of the observed variables). We must consider both permutations of the set . In order to resolve this issue, we note that if the true permutation is then assuming we have while . This is because unmixes observations to return the latent disturbance for causal variable, , and is therefore not a function of . The converse is true if the permutation is . Similar reasoning can be employed for the reverse model: . As such, we propose to select the permutation as follows:
For a given permutation, , we may therefore compute the likelihood ratio in equation (6) as:
If is positive, we conclude that is the causal variable, whereas if is negative is reported as the causal variable. When computing the differential entropy, we employ the approximations described in Kraskov et al. (2004). We note that such approximations require variables to be standardized; in the case of latent variables this can be achieved by defining a further change of variables corresponding to a standardization.
Finally, we note that the likelihood ratio presented above can be connected to the independence measures employed in Section 3.3 when mutual information is used a measure of statistical dependence. In particular, we have
(8) 
where denotes the mutual information between two variables. We provide a full derivation in Supplementary D. This result serves to connect the proposed likelihood ratio to independence testing methods for causal discovery which use mutual information.
3.5 Extension to Multivariate Data
It is not straightforward to extend NonSENS to multivariate cases. Due to the permutation invariance of sources, independence testing would require tests, where is the number of variables, leading to a significant drop in power after Bonferroni correction. Likewise, the likelihood ratio test inherently considers only two variables.
Instead, we propose to extend to proposed method to the domain of multivariate causal discovery by employing it in conjunction with a traditional constraint based method such as the PC algorithm, as in Zhang and Hyvärinen (2009). Formally, the PC algorithm is first employed to estimate the skeleton and orient as many edges as possible. Any remaining undirected edges are then directed using either proposed bivariate method.
3.6 Relationship to Previous Methods
NonSENS is closely related to linear ICAbased methods as described in Shimizu et al. (2006). However, there are important differences: LiNGAM focuses exclusively on linear causal models whilst NonSENS is specifically designed to recover arbitrary nonlinear causal structure. Moreover, the proposed method is mainly designed for bivariate causal discovery whereas the original LiNGAM method can easily perform multivariate causal discovery by permuting the estimated ICA unmixing matrix. In this sense NonSENS is more closely aligned to the Pairwise LiNGAM method (Hyvärinen and Smith, 2013).
Peters et al. (2014)
propose a nonlinear causal discovery method named regression and subsequent independence test (RESIT) which is able to recover the causal structure under the assumption of an additive noise model. RESIT essentially shares the same underlying idea as NonSENS, with the difference being that it estimates latent disturbances via nonlinear regression, as opposed to via nonlinear ICA. Related to the RESIT algorithm is the Regression Error Causal Inference (RECI) algorithm
(Blöbaum et al., 2018), which proposes measures of causal direction based on the magnitude of (nonlinear) regression errors. Importantly, both of those methods restrict the nonlinear relations to have additive noise.Recently several methods have been proposed which seek to exploit nonstationarity in order to perform causal discovery. For example, Peters et al. (2016) propose to leverage the invariance of causal models under covariate shift in order to recover the true causal structure. Their method, termed Invariant Causal Prediction (ICP), is tailored to the setting where data is collected across a variety of experimental regimes, similar to ours. However, their main results, including identifiability are in the linear or additive noise settings.
In contrast, Zhang et al. (2017) proposed a method, termed CDNOD, for causal discovery from heterogeneous, multipledomain data or nonstationary data, which allows for general nonlinearities. Their method thus solves a problem very similar to ours, although with a very different approach. Their method accounts for nonstationarity, which manifests itself via changes in the causal modules, via the introduction of an surrogate variable representing the domain or time index into the causal DAG. Conditional independence testing is employed to recover the skeleton over the augmented DAG, and their method does not produce an estimate of the SEM to represent the causal mechanism.
4 Experimental Results
In order to demonstrate the capabilities of the proposed method we consider a series of experiments on synthetic data as well as real neuroimaging data.
4.1 Simulations on Artificial Data
In the implementation of the proposed method we employed deep neural networks of varying depths as feature extractors. All networks were trained on crossentropy loss using stochastic gradient descent. In the final linear unmixing required by TCL, we employ the linear ICA model described in Section
3.2. For independence testing, we employ HSIC with a Gaussian kernel. All tests are run at the level and Bonferroni corrected.We benchmark the performance of the NonSENS algorithm against several stateoftheart methods. As a measure of performance against linear methods we compare against LiNGAM. In particular, we compare performance to DirectLiNGAM (Shimizu et al., 2011). In order to highlight the need for nonlinear ICA methods, we also consider the performance of the proposed method where linear ICA is employed to estimate latent disturbances; We refer to this baseline as LinearICA NonSENS. We further compare against the RESIT method of Peters et al. (2014). Here we employ Gaussian process regression to estimate nonlinear effects and HSIC as a measure of statistical dependence. Finally, we also compare against the CDNOD method of Zhang et al. (2017) as well as the RECI method presented in Blöbaum et al. (2018). For the latter, we employ Gaussian process regression and note that this method assumes the presence of a causal effect, and is therefore only included in some experiments. We provide a description of each of the methods in the Supplementary material E.
We generate synthetic data from the nonlinear ICA model detailed in Section 2.2. Nonstationary disturbances, N, were randomly generated by simulating Laplace random variables with distinct variances in each segment. For the nonlinear mixing function we employ a deep neural network (“mixingDNN”) with randomly generated weights such that:
(9)  
(10) 
where we write to denote the activations at the th layer and
corresponds to the leakyReLU activation function which is applied elementwise. We restrict matrices
to be lowertriangular in order to introduce acyclic causal relations. In the special case of multivariate causal discovery, we follow Peters et al. (2014) and randomly include edges with a probability of , implying that the expected number of edges is . We present experiments for dimensional data. Note that equation (9) is a LiNGAM. For depths , equation (10) generates data with nonlinear causal structure.Throughout experiments we vary the following factors: the number of distinct experimental conditions (i.e., distinct segments, ), the number of observations per segment, , as well as the depth, , of the mixingDNN. In the context of bivariate causal discovery we measure how frequently each method is able to correctly identify the cause variable. For multivariate causal discovery we consider the score, which serves to quantify the agreement between estimated and true DAGs.
Figure 2 shows results for bivariate causal discovery as the number of distinct experimental conditions, , increases and the number of observations within each condition was fixed at . Each horizontal panel shows the results as the depth of the mixingDNN increased from to . The top panels show the proportion of times the correct cause variable was identified across 100 independent simulations. In particular, the first top panel corresponds to linear causal dependencies. As such, all methods are able to accurately recover the true cause variable. However, as the depth of the mixingDNN increases, the causal dependencies become increasingly nonlinear and the performance of all methods deteriorates. While we attribute this drop in performance to the increasingly nonlinear nature of causal structure, we note that the NonSENS algorithm is able to outperform all alternative methods.
The bottom panels of Figure 2 shows the results when no directed acyclic causal structure is present. Here data was generated such that was not lowertriangular. In particular, we set the offdiagonal entries of to be equal and nonzero, resulting in cyclic causal structure. In the context of such data, we would expect all methods to report that the causal structure is inconclusive of the time, as all tests are Bonferroni corrected at the level. The bottom panel of Figure 2 show the proportion of times the causal structure is correctly reported as inconclusive. The results indicate that all methods are overly conservative in their testing, and become increasingly conservative as the depth, , increases.
We also consider the performance of all algorithms in the context of a fixed number of experimental conditions, , and an increasing number of observations per condition, . These results, presented in Supplementary F, demonstrate that the proposed method continues to perform competitively in such a scenario.
Furthermore, we also consider the scenario where a causal effect is assumed to exist. In such a scenario, we employ the likelihood ratio approach described in Section 3.4. In the case of algorithms such as RESIT we compare values in order to determine direction. The results for these experiments are shown in Figure 3. The top panels show results as the number of experimental conditions, , increases. As before, we fix the number of observations per condition to . The bottom panels show results for a fixed number of experimental conditions , as we increase the number of observations per condition. We note that the proposed measure of causal direction is shown to outperform alternative algorithms. Performance in Figure 3 appears significantly higher than that shown in Figure 2 due to that the fact that a causal effect is assumed to exist; this reduces the bivariate causal discovery problem to recovering the causal ordering over and . The CDNOD algorithm cannot easily be extended to assume the existence of a causal effect and is therefore not included in these experiments.
Finally, the results for multivariate causal discovery are presented in Figure 4, where we plot the score between the true and inferred DAGs as the depth of the mixingDNN increases. The proposed method is competitive across all depths. In particular, the proposed method outperforms the PC algorithm, indicating that its use to resolve undirected edges is beneficial.
4.2 Hippocampal Fmri Data
As a realdata application, the proposed method was applied to resting state fMRI data collected from six distinct brain regions as part of the MyConnectome project (Poldrack et al., 2015). Data was collected from a single subject over 84 successive days. Further details are provided in Supplementary Material G. We treated each day as a distinct experimental condition and employed the multivariate extension of the proposed method. For each unresolved edge, we employed NonSENS as described in Section 3.3 with a 5 layer network. The results are shown in Figure 5. While there is no ground truth available, we highlight in blue all estimated edges which are feasible due to anatomical connectivity between the regions and in red estimated edges which are not feasible (Bird and Burgess, 2008). We note that the proposed method recovers feasible directed connectivity structures for the entorhinal cortex (ERc), which is known to play an prominent role within the hippocampus.
5 Conclusion
We present a method to perform causal discovery in the context of general nonlinear SEMs in the presence of nonstationarities or different conditions. This is in contrast to alternative methods which often require restrictions on the functional form of the SEMs. The proposed method exploits the correspondence between nonlinear ICA and nonlinear SEMs, as originally considered in the linear setting by Shimizu et al. (2006). Notably, we established the identifiability of causal direction from a completely different angle, by making use of nonstationarity instead of constraining functional classes. Developing computationally more efficient methods to for the multivariate case is one line of our future work.
References
 Bell and Sejnowski (1995) Anthony Bell and Terrence Sejnowski. An informationmaximization approach to blind separation and blind deconvolution. Neural Comput., 7(6):1129–1159, 1995.
 Bird and Burgess (2008) Chris M. Bird and Neil Burgess. The hippocampus and memory: Insights from spatial processing. Nat. Rev. Neurosci., 9(3):182–194, 2008. ISSN 1471003X. doi: 10.1038/nrn2335.
 Blöbaum et al. (2018) Patrick Blöbaum, Dominik Janzing, Takashi Washio, Shohei Shimizu, and Bernhard Schölkopf. CauseEffect Inference by Comparing Regression Errors. AISTATS, 2018.
 Eberhardt et al. (2005) Frederick Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. Proc. TwentyFirst Conf. Uncertain. Artif. Intell., pages 178–184, 2005.
 Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring Statistical Dependence with HilbertSchmidt Norms. Int. Conf. Algorithmic Learn. Theory, pages 63–77, 2005.
 Hoyer et al. (2009) Patrik O Hoyer, Dominik Janzing, Joris M. Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. Neural Inf. Process. Syst., pages 689–696, 2009.
 Hyvärinen and Pajunen (1999) A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
 Hyvärinen (1999) Aapo Hyvärinen. Fast and robust fixedpoint algorithm for independent component analysis. IEEE Trans. Neural Networks Learn. Syst., 10(3):626–634, 1999.
 Hyvärinen (2005) Aapo Hyvärinen. Estimation of nonnormalized statistical models by score matching. J. Mach. Learn. Res., 6:695–708, 2005.
 Hyvärinen (2007) Aapo Hyvärinen. Some extensions of score matching. Comput. Stat. Data Anal., 51(5):2499–2512, 2007.

Hyvärinen and Morioka (2016)
Aapo Hyvärinen and Hiroshi Morioka.
Unsupervised Feature Extraction by TimeContrastive Learning and Nonlinear ICA.
Neural Inf. Process. Syst., 2016.  Hyvärinen and Smith (2013) Aapo Hyvärinen and Stephen M Smith. Pairwise Likelihood Ratios for Estimation of NonGaussian Structural Equation Models. J. Mach. Learn. Res., 14:111–152, 2013.
 Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E, 69(6):16, 2004.
 Kusner et al. (2017) Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. Counterfactual Fairness. Neural Inf. Process. Syst., 2017.
 Pearl (2009) Judea Pearl. Causality. Cambridge University Press, 2009.
 Peters et al. (2014) Jonas Peters, J Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. J. Mach. Learn. Res., 15:2009–2053, 2014.

Peters et al. (2016)
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen.
Causal inference by using invariant prediction: identification and confidence intervals.
J. R. Stat. Soc. Ser. B (Statistical Methodol., pages 947–1012, 2016.  Pham and Cardoso (2001) Dinh Tuan Pham and JeanFrancois Cardoso. Blind Separation of Instantaneous Mixtures of Non Stationary Sources. IEEE Trans. Signal Process., 49(9):1837–1848, 2001.
 Poldrack et al. (2015) Russell A Poldrack et al. Longterm neural and physiological phenotyping of a single human. Nat. Commun., 6, 2015. ISSN 20411723.
 Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On Causal and Anticausal Learning. In Int. Conf. Mach. Learn., pages 1255–1262, 2012.
 Shimizu et al. (2006) Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti Kerminen. A Linear NonGaussian Acyclic Model for Causal Discovery. J. Mach. Learn. Res., 7:2003–2030, 2006.
 Shimizu et al. (2011) Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvärinen, Yoshinobu Kawahara, Takashi Washio, Patrik O Hoyer, and Kenneth Bollen. DirectLiNGAM: A Direct Method for Learning a Linear NonGaussian Structural Equation Model. J. Mach. Learn. Res., 12:1225–1248, 2011.
 Spirtes and Zhang (2016) Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. Appl. Informatics, 2016.
 Spirtes et al. (2000) Peter Spirtes, Clark Glymour, Richard Scheines, David Heckerman, Christopher Meek, and Thomas Richardson. Causation, Prediction and Search. MIT Press, 2000.
 Van Essen et al. (2012) D. C. Van Essen et al. The Human Connectome Project: A data acquisition perspective. Neuroimage, 62(4):2222–2231, oct 2012.
 Zhang and Hyvärinen (2009) Kun Zhang and Aapo Hyvärinen. On the identifiability of the postnonlinear causal model. Proc. TwentyFifth Conf. Uncertain. Artif. Intell., pages 647–655, 2009.
 Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. Proc. 30th Int. Conf. Mach. Learn., 28:819–827, 2013. ISSN 19387228.
 Zhang et al. (2017) Kun Zhang, Biwei Huangy, Jiji Zhang, Clark Glymour, and Bernhard Schölkopf. Causal discovery from Nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In Int. Jt. Conf. Artif. Intell., pages 1347–1353, 2017.
Supplementary Material
A Review: Nonlinear ICA via time contrastive learning
Hwew we provide a more detailed review of the theory of TCL. For further details we refer readers to Hyvärinen and Morioka (2016) and we note that the results presented in this section are adapted from Hyvärinen and Morioka (2016). We begin by showing that an optimally discriminative feature extractor combined with a linear multinomial classification layer learns to model the nonstationary probability density of observations within each experimental condition.
Recall that we observe dimensional data, , which is generated via a smooth and invertible nonlinear mixture, f, of independent latent variables, . As in linear ICA, the latent variables are assumed to be mutually independent. However, we also assume they are nonstationary. In particular, we assume the distribution of latent variables to be piecewises stationary such that we may associate a label, , with each indicating the piecewise stationary segment from which it was generated. In this manner, it is assumed that the distribution of latent variables varies across segments, as shown in Figure S.1. As such, we write to denote the segment of the th observation where is the set of all distinct segments. For example, each segment may correspond to a distinct experimental condition. As the function f is smooth and invertible, it follows that the distribution of each will also vary across segments.
We may therefore consider the task of classifying observed data into the various segments as a multinomial classification task consisting of features, , and categorical labels, . For any observation, , associated with true label , we have:
(S.1) 
where are parameters for the neural network feature extractor and the weight matrix,
, and bias vector,
, parameterize the final multinomial layer. We note that the sum in the denominator goes from . This is because we fix and in order to avoid indeterminancy in the softmax function.Conversely, we can derive the true posterior distribution over the label as:
(S.2) 
If we assume that the feature extractor has a universal function approximation capacity and that we have infinite data then the multinomial logistic classifier based on features will converge to the optimal classifier, implying that equation (S.1) will equal equation (S.2) for all . We may then consider the following ratio:
(S.3) 
which after expanding and taking logarithms yields:
(S.4) 
indicating that the optimal feature extractor computes the log probability density function of the data within each experimental condition (relative to some pivot segment, in this case the first condition). We note that this condition holds for all
.If we further assume the data were generated according to a nonstationary ICA model as described in equation (3), then equation (S.4) yields:
(S.5) 
where the sum is taken over each independent source, . Equation (S.5) follows from the change of variable from S to X, noting that the Jacobians required by such a transformation cancel out because of the subtraction in the right hand side of equation (S.4). As a result, by modeling the log probability densities with respect to some pivot segment (in this case segment 1), we do not need explicity compute the Jacobians. We note that is a function which does not depend on and is a function which does not depend on either X or S. As a result, it follows that both and must span the same linear space, implying that we may compute latent sources up to some nonlinearity, , by first learning a feature extractor based on TCL and subsequently applying linear ICA on estimated features, . Figure S.1 summarizes the relationship between the generative model and TCL.
B A linear ICA method for piecewise stationary sources
Formally, assumptions 13 of Theorem 1 guarantee that TCL, as presented in Section 2.2, will recover a linear mixture of latent independent sources up to pointwise transformation. This implies that the hidden representations obtained satisfy:
(S.6) 
for some linear mixing matrix, A. Equation (S.6) suggests that applying ordinary linear ICA to hidden representations, , will allow us to recover . However, the use of linear ICA is premised on the assumption that latent variables are independent. This is not necessarily guaranteed under the generative model presented in equation (3) in the context of a fixed number of segments. While it is certainly true that and are conditionally independent given segment labels, it may be the case that they are not marginally independent over all segments; for example it is possible that their variances are somehow related (e.g., they may be monotonically increasing). We note that this is not an issue when the number of segments grows asymptotically and we assume exponential family parameters, , are randomly generated, as stated in Corollary 1 of Hyvärinen and Morioka (2016).
Here we seek to address this issue by proposing an alternative linear ICA algorithm which explicitly models the piecewise stationary nature of the data. In particular, the proposed linear ICA algorithm explicitly incorporates equation (3) as the generative model for latent sources. As such, it can be used to accurately unmix sources in the final stage of TCL, especially when the number of segments in small.
To set notation, we assume we observe data which corresponds to a linear mixture of sources:
where and is a square mixing matrix. Popular ICA algorithms, such as FastICA, estimate the unmixing matrix by maximizing the loglikelihood under the assumptions that sources are independent and nonGaussian. The objective function for FastICA is therefore of the form:
(S.7) 
where is the normalization constant and we write to denote the th row of W. Typically a parametric form is assumed for , popular examples being exponential or logcosh. In the context of the generative model for sources specified in equation (3), the FastICA algorithm therefore proceeds under the assumption that is constant across all segments .
In order to address this issue we consider an alternative model for the density of latent sources. In particular, we seek to directly employ the piecewise stationary logdensity described in equation (3). As such, we logdensity of an observation within segment is defined as:
(S.8) 
In contrast to equation (S.7), the logdensity of each observation depends on both the segment, , the exponential family parameters, , as well as the unmixing matrix, .
In other to recover latent sources we propose to estimate parameters, corresponding to unmixing matrix as well as exponential family parameters, via score matching (Hyvärinen, 2005). This avoids the need to estimate the normalization parameter, which may not be available analytically when sources follow unnormalized distributions. The score matching objective for the ICA model defiend in equation (S.8) is defined as:
(S.9) 
where we write and to denote the first and second derivatives of with respect to observations, Z. We propose to minimize equation (S.9) via block gradient descent, conditionally updating the mixing matrix W and exponential family parameters . This has the important benefit that conditional on W, there is a closed form update for (Hyvärinen, 2007).
Experimental results
In order to assess the performance of the proposed linear ICA algorithm, we generated bivariate data following the piecewise stationary distribution described in equation (3). We compare the performance of the proposal algorithm against the following popular linear ICA algorithms: FastICA (Hyvärinen, 1999), Infomax ICA (Bell and Sejnowski, 1995) and Joint Diagonalization method proposed by Pham and Cardoso (2001) which also accommodates nonstationary sources.
In order to assess the performance of the proposed method we consider two scenarios:

The exponential family parameters are deliberately generated such that there is a statistical dependence structure across segments. In particular, we generate bivariate data where we explicitly enforce to be monotonically increasing in . As a concrete example, when sources follow a Laplace distribution this implies that and in turn corresponds to the variance of the th source in segment . In such a setting, we generate piecewise stationary Laplace sources with where the variances are correlated across segments.

As a baseline, we also generate data where the exponential family parameters are generated at random. This removes any systematic, higherorder dependence between latent sources.
We begin by generating bivariate data were sources follow a piecewise stationary Laplace distribution. This implies that sources follow the logdensity specified in equation (3) where and each term denotes the variance of the th source in segment . Results when data is generated over five segments are provided in Figure S.2. The left panel shows the case were the variances of each latent source are correlated across segments. We note that when this is the case, methods such FastICA and Infomax ICA perform poorly. This is in contrast to the proposed method and the joint diagonalization approach of Pham and Cardoso (2001), who explicitly model the nonstationary nature of the data. The right panel of Figure S.2 shows equivalent results when variances are randomly generated, thereby removing second order dependence between latent variables. As expected, in this setting all methods perform well.
In order to further probe the differences between the proposed method and the approach of Pham and Cardoso (2001), we consider latent sources with an unnormalized distribution. In particular, we generate sources such that the logdensity within a particular segment is as follows:
(S.10) 
Such a density is both unnormalized but also odd. As such, we expect the joint diagonalization algorithm of
Pham and Cardoso (2001)to perform poorly in this setting as it exclusively studies covariance structure and therefore cannot model skewed distributions. Figure
S.3 visualizes the results for these experiments. As expected, the Joint Diagonalization algorithm of Pham and Cardoso (2001) suffers a drop in performance. However, it continues to outperform FastICA and Infomax ICA, especially when there are dependencies over the exponential family parameters. We note that the proposed model, where parameters are estimated using score matching, is shown to be more robust.C Proof of Theorem 2
The proof of Theorem 2 follows from a combination of the presented assumptions together with Property 1. Formally, assumptions 1–3 guarantee that the TCL, as presented in Section 2.2, will recover a linear mixture of latent independent sources up to pointwise transformation. This, combined with the novel linear ICA algorithm described in Section 3.2, imply that latent disturbances can be recovered. We note that in practice we recover the latent disturbances up to pointwise transformation, , as opposed to N. This is not a problem when a general test of statistical independence, such as HSIC which can capture arbitrary (i.e., nonlinear) dependencies, is employed. Finally, Assumption 1 further states there is no latent confounder is present, implying that running all possible pairwise tests using a sufficiently flexible independence test, as required by assumption 4, will allow us to determine causal structure.
D Relationship between likelihood ratio and measures of independence
In this section we derive the result presented in equation (8), for some permutation of latent disturbances. We begin by considering the mutual information between and :
where we employ the same change of variable, whose Jacobian can be easily evaluated, as in Section 3.4. In particular, we have used the property that:
and noted that the particular choice of allows us to directly compute the Jacobian as . We may therefore compute the difference in mutual information between each observed variable and the relevant latent disturbance, yielding:
which is precisely the negative of likelihood ratio presented in Section 3.4.
E Baseline methods
In this section we briefly overview and provide pseudocode for alternative methods which are presented as baselines in the manuscript.

DirectLiNGAM: The DirectLiNGAM method of Shimizu et al. (2011) is based on the property that within a linear nonGaussian acyclic model (LiNGAM), if we regress out the parents of any variable, then the residuals also follow a LiNGAM. Based on this property, the authors propose an iterative algorithm through which to iteratively uncover exogenous variables. oFurther, if variables follow a LiNGAM, then due to the additive nature of noise in such models, the residuals will be independent when we regress the parent on its children. As a result, we may infer the causal structure by studying the statistical independences between variables and residuals. Pseudocode for the bivariate DirectLiNGAM method is provided in Algorithm 1.
Input : Bivariate data, and significance level .1 for do2 Linearly regress on and compute the residual . Evaluate the test:3 end for4if we fail to reject the null hypothesis only once then5 Variable such that we fail to reject is the cause variable6else7 The causal dependence structure is inconclusive8 end ifAlgorithm 1 Bivariate causal discovery using DirectLiNGAM (Shimizu et al., 2011) 
RESIT: The RESIT method, first proposed by Hoyer et al. (2009) and subsequently extended by Peters et al. (2014), can be seen as a nonlinear extension of the DirectLiNGAM algorithm. The RESIT algorithm is premised on the assumption of an additive noise model (ANM), which implies that each structural equation are of the form . Given the ANM assumption, RESIT is able to recover the causal structure by testing for dependence between variables and residuals. Gaussian process regression is employed in order to accommodate for nonlinear additive causal relations.
Input : Bivariate data, and significance level .1 for doRegress on and compute the residual . // Gaussian process regression is employed2 Evaluate the test:3 end for4if We fail to reject the null hypothesis only once then5 Variable such that we fail to reject is the cause variable6else7 The causal dependence structure is inconclusive8 end ifAlgorithm 2 Bivariate causal discovery using RESIT (Hoyer et al., 2009; Peters et al., 2014) 
Nonlinear ICP: The ICP method proposed by Peters et al. (2016) proposes a fundamentally different approach to causal discovery. The underlying principle of the ICP algorithm is that the direct causal predictors of a given variable must remain invariant across distribution shifts induced by various experimental conditions. In the context of bivariate data, the nonlinear ICP algorithm, as described in Section 6.1 of Peters et al. (2016) therefore corresponds to fitting a nonlinear regression model on the data across all experimental conditions and testing whether the distribution of residuals is the same within each condition. We note that such an approach assumes an additive noise model, as this greatly simplifies testing for invariance.
Input : Bivariate data, , labels and significance level .1 for doRegress on and compute the residual, // Note that Gaussian process regression is employed and we aggregate data across all experimental conditionsEvaluate the test:where is some arbitrary distribution. // The KolmogorovSmirnov test is employed23 end for4if We fail to reject the null hyp. only once then5 Variable such that we fail to reject is the cause variable6else7 The causal dependence structure is inconclusive8 end ifAlgorithm 3 Bivariate causal discovery via nonlinear ICP (Peters et al., 2016) 
RECI: Blöbaum et al. (2018) propose a method for inferring the causal relation between two variables by comparing the regression errors in each possible causal direction. Under some mild assumptions, they are able to prove that the magnitude of residual errors should be smaller in the causal direction. This suggests a straightforward causal discovery algorithm, which we outline below. We note that while any nonlinear regression method may be employed, our implementation used Gaussian process regression.

CDNOD: Zhang et al. (2017) propose a causal discovery algorithm which explicitly accounts for nonstationarity or heterogeneity over observed variables. The CDNOD algorithm accounts for nonstationarity, which may manifest itself as changes in the causal modules, by introducing an additional variable representing the time or domain index into the causal DAG. Conditional independence testing is then employed to recover the skeleton over the augmented DAG. Their method can find causal direction by making use of not only invariance, but also independent changes of causal models, as an extended notion of invariance. We also note that CDNOD is a nonparametric method, implying it is able to accommodate nonlinear causal dependencies.
F Further experimental results
In this section of the supplementary material we present further experimental results. In particular, in Section F.1 we present results for bivariate causal discovery in the context of a fixed number of experimental conditions, , and increasing observations per segment. In Section F.2 we provide addition results in the context of multivariate causal discovery. In particular, we report the Hamming distance between true and estimated DAGs.
f.1 Additional bivariate causal discovery experiments
We consider the performance of all algorithms in the context of a fixed number of experimental conditions, , and an increasing number of observations per condition, . The results are presented in Figure S.4, where we repeat each experiment 100 times. We note that all algorithms are able to accurately identify causal structure in the presence of LiNGAMs (corresponding to a 1 layer mixingDNN). However, as the causal structure becomes increasingly nonlinear, the performance of all methods declines. In particular, we note that the proposed method has comparable performance with alternative methods such as RESIT and CDNOD when the number of samples is small. However, as the number of observations increases the proposed method is able to outperform alternative algorithms.
f.2 Multivariate causal discovery results
In this section we provide additional performance metrics in the context of multivariate causal discovery. While Figure 4 reported the score, we further provide results for the Hamming distance between true and estimated DAGs in Figure S.5.
.
.
G Hippocampal functional MRI data
In this section we provide further details of the Hippocampal fMRI data employed in Section
4.2.
The data was collected as part of the
MyConnectome project, presented in Poldrack et al. (2015),
which involved
daily fMRI scans for a single individual (Caucasian male, aged 45).
The data may be freely downloaded from https://openneuro.org/datasets/ds000031/
.
We focus only on the restingstate fMRI data taken from this project, noting that future work may also wish to study the other modalities of data collected.
Data was collected from the same subject over a series of 84 successive days, allowing us to consider data collected on distinct days as a distinct experimental condition. Full details of the data acquisition pipelines are provided in Poldrack et al. (2015). For each day, we observe 518 BOLD measurements. After preprocessing, data was collected from the following brain regions: perirhinal cortex (PRC), parahippocampal cortex (PHC), entorhinal cortex (ERC), subiculum (Sub), CA1, and CA3/Dentate Gyrus (DG). This resulted in dimensional data. As such, data employed consists of observations per experimental condition and distinct conditions.
Comments
There are no comments yet.