Recovery of non-linear cause-effect relationships from linearly mixed neuroimaging data

05/02/2016 ∙ by Sebastian Weichwald, et al. ∙ 0

Causal inference concerns the identification of cause-effect relationships between variables. However, often only linear combinations of variables constitute meaningful causal variables. For example, recovering the signal of a cortical source from electroencephalography requires a well-tuned combination of signals recorded at multiple electrodes. We recently introduced the MERLiN (Mixture Effect Recovery in Linear Networks) algorithm that is able to recover, from an observed linear mixture, a causal variable that is a linear effect of another given variable. Here we relax the assumption of this cause-effect relationship being linear and present an extended algorithm that can pick up non-linear cause-effect relationships. Thus, the main contribution is an algorithm (and ready to use code) that has broader applicability and allows for a richer model class. Furthermore, a comparative analysis indicates that the assumption of linear cause-effect relationships is not restrictive in analysing electroencephalographic data.



There are no comments yet.


page 4

Code Repositories


MERLiN: Mixture Effect Recovery in Linear Networks (cf.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Causal inference requires causal variables. However, not always do the variables in a dataset specify the candidate causal relata. In electroencephalography (EEG) studies, for example, what is measured at electrodes placed on the scalp is instantaneously and linearly superimposed electromagnetic activity of sources in the brain [1]. Standard causal inference methods require to first recover the cortical sources from the observed electrode signals [2]. This is disadvantageous. First, any source localisation procedure is prone to modelling errors, which may distort the true cause-effect relationships between cortical sources. Second, source localisation enlarges the data dimensionality by roughly two orders of magnitude, which leads to increased computational complexity.

We recently proposed a novel idea to construct causal variables, i. e., recover cortical sources, by directly optimising for statistical in- and dependences that imply a certain cause-effect relationship [3]. The linear MERLiN algorithm can – skipping potentially error prone modelling steps – establish a linear cause-effect relationship between brain state features that are observed only as part of a linear mixture. This allows for computationally efficient insights into brain networks beyond those readily obtained from encoding and decoding models trained on pre-defined variables [4]. The linear MERLiN algorithm, however, is unable to reconstruct cortical sources with non-linear cause-effect relationships.

Here we present the non-linear

MERLiN algorithm and relax the assumption of linear cause-effect relationships. By integrating kernel ridge regression and a non-linear independence test, the extended algorithm can capture any higher order dependence. We compare the results of our linear- and non-linear MERLiN algorithms on EEG data for which cause-effect relationships have previously only been computed by an exhaustive search approach

[5] and find no qualitative differences. The contribution of this work is thus two-fold. First, we provide an algorithm to learn non-linear cause-effect relationships from linear mixtures of causal variables, and, second, we provide empirical evidence that linear methods suffice to identify cause-effect relationships within individual EEG frequency bands. The Python implementation is available at

Ii Methods

Ii-a Causal Bayesian Networks

We briefly introduce the main aspects of Causal Bayesian Networks (CBNs). For an exhaustive treatment see

[6, 7]. The important advantage of this framework over methods based on information flow is that it yields testable predictions on the impact of interventions [8, 9].

Definition 1 (Structural Equation Model).

We define a structural equation model (SEM) as a set of equations where the so-called noise variables are independently distributed according to . For the set contains the so-called parents of and describes how

relates to the random variables in


. The induced joint distribution is denoted by


Replacing at least one of the functions by a constant yields a new SEM. We say has been intervened on, which is denoted by , leads to the SEM , and induces the interventional distribution .

Definition 2 (Cause and Effect).

is a cause of () wrt. a SEM iff there exists such that .111 and denote the marginal distributions of corresponding to and respectively. is an effect of iff is a cause of . Often the considered SEM is omitted if it is clear from the context.

For each SEM there is a corresponding graph with and that has the random variables as nodes and directed edges pointing from parents to children. We employ the common assumption that this graph is acyclic, i.e., will always be a directed acyclic graph (DAG).

So far a DAG simply depicts all parent-child relationships defined by the SEM . Missing directed paths indicate missing cause-effect relationships. In order to specify the link between statistical independence (denoted by ) wrt. the joint distribution and properties of the DAG (representing a SEM ) we need the following definition.

Definition 3 (d-separation).

For a fixed graph disjoint sets of nodes and are d-separated by a third disjoint set (denoted by ) iff all pairs of nodes and are d-separated by . A pair of nodes is d-separated by iff every path between and is blocked by . A path between nodes and is blocked by iff there is an intermediate node on the path such that (i) and is tail-to-tail () or head-to-tail (), or (ii) is head-to-head () and neither nor any of its descendants is in .

Conveniently, assuming faithfulness222Intuitively, this is saying that conditional independences are due to the causal structure and not accidents of parameter values [6, p. 9]; more formally the assumption reads . (and exploiting Markovianity333The distribution generated by a SEM is Markov wrt. (cf. [7, Theorem 1.4.1] for a proof), i. e., .) we have the following one-to-one correspondence between d-separation and conditional independence statements:

Summing up, we have defined interventional causation in terms of SEMs and have seen how a SEM gives rise to a DAG. This DAG has two convenient features. Firstly, the DAG yields a visualisation that allows to easily grasp missing cause-effect relationships that correspond to missing directed paths. Secondly, since we assume faithfulness, d-separation properties of this DAG are equivalent to conditional independence properties of the joint distribution. Thus, conditional independences translate into causal statements, e.g. ‘a variable becomes independent of all its non-effects given its immediate causes’ or ‘cause and effect are marginally dependent’. Furthermore, the causal graph can be identified from conditional independences observed in – at least up to a so-called Markov equivalence class, the set of graphs that entail the same conditional independences [10].

Ii-B Formal problem description

In the following, the variables , , and may be thought of as a stimulus variable, the activity of multiple cortical sources, and the EEG channel recordings respectively. We aim at recovering an effect of a pre-defined target variable . The terminology introduced in Section II-A allows to precisely state the problem as follows.

Ii-B1 Assumptions

Let and denote (finitely many) random variables. We assume existence of a SEM , potentially with additional unobserved variables , that induces . We refer to the corresponding graph as the true causal graph and call its nodes causal variables. We further assume that

  • affects indirectly via ,444By saying a variable causes indirectly via we imply (a) existence of a path , and (b) that there is no path without on it (this also excludes the edge ).

  • there are no edges in pointing into .555This condition can for example be ensured by randomising .

Importantly, we do not require that the structural equation that relates to its parents is linear in . Figure 1 depicts an example of how might look like.

Fig. 1: Example graph where is a hidden variable.

Ii-B2 Given data

  • such that

  • iid666independent and identically distributed samples of and of where is the observed linear mixture of the causal variables and the mixing matrix

Ii-B3 Desired output

Find such that where is an effect of (). For the graph shown in Figure 1 recovery of the causal variable is a valid solution.

Ii-C Strategy

Our approach leverages the following causal inference rule that – under the assumptions in Section II-B1 – applies to a causal variable (cf. [5]).

Causal Inference Rule

If and , then indirectly affects via . In particular, a causal path exists.

The idea is to recover the sought-after variable from the mixture by optimising for these statistical properties. Thence, the general strategy relies on solving an optimisation problem of the form777To attenuate the signal of we restrict search onto the orthogonal complement which is diffeomorphic to .

where and

denotes a (conditional) dependence criterion that estimates from empirical samples the strength of association between the two variables.

Ii-D Non-Linear MERLiN algorithm

The linear MERLiN algorithm uses the partial correlations and for the terms and in the objective function. As such only linear dependence between and can be detected while remaining higher-order dependences between and given may go undetected. A general kernel-based independence criterion, the Hilbert-Schmidt Independence Criterion (HSIC) [11], and a regression-based conditional independence criterion (cf. [5]) in conjunction with kernel ridge regression [12] allow extension to non-linear dependences.

Regression-Based Conditional Independence Criterion

If there exists a (regression) function such that then .

The non-linear MERLiN algorithm solves the following optimisation problem

where denotes the empirical HSIC estimate888We compute the empirical HSIC estimate based on the Gaussian kernel where the kernel size is determined by the median distance between points in input space [11]. and corresponds to the residuals using kernel ridge regression with Gaussian kernel of width and ridge regression parameter . To temper overfitting, the sample is split into three partitions; the residuals of the partition are obtained by using the kernel ridge regression function obtained on the remaining partitions. The regression parameters and are also being optimised over to allow an optimal regression fit wrt. witnessing conditional independence and hence minimising the second summand in the objective function.

Implementing the objective function in Theano

[13, 14], we use the Python toolbox Pymanopt [15] to run optimisation on the product manifold using a steepest descent algorithm with standard back-tracking line-search. This approach is exact and efficient, relying on automated differentiation and respecting the manifold geometry.

Ii-E Application to EEG data

We consider EEG trial-data of the form where denotes the number of electrodes, the number of trials, and the length of the time series for each electrode and each sample ; that is holds iid samples of a -valued random variable . Analyses of EEG data commonly focus on trial-averaged log-bandpower in a particular frequency band. Accordingly, applying our algorithms to EEG data we aim to identify a linear combination such that the log-bandpower of the resulting one-dimensional trial signals is a causal effect of the log-bandpower of the one-dimensional trial signals .

However, the two operations of computing the log-bandpower and taking a linear combination do not commute. The log-bandpower computation needs to be switched into the objective function described above. This is accomplished by letting and where denotes the operation of applying a Hanning window and computing the average log-bandpower in a specified frequency range for the given time series.

Iii Comparative EEG analysis

Iii-a Experimental data

We applied the linear MERLiN algorithm and its non-linear extension to EEG data recorded during a neurofeedback experiment [16]. In this study the -log-bandpower ( Hz) in the right superior parietal cortex (SPC) was provided as feedback signal and subjects were instructed to up- or down-regulate the bandpower. subjects were recorded in sessions each and each session had trials seconds.

The data of one session consists of a stimulus vector

, a spatial filter

that was used to extract the feedback signal, and a tensor

that holds the time series (of length ) for each channel and trial. The reader is referred to [16] for more details on the experimental data.

Iii-B How to compare to previous results

We compare our MERLiN algorithms against a causal analysis of this neurofeedback experiment that is based on source localisation in combination with an exhaustive search procedure [5]. The hypothesis was, based on transcranial magnetic stimulation studies [17], that -oscillations in the SPC modulate -oscillations in the medial prefrontal cortex (MPC). We briefly describe this exhaustive search approach.

First, the signal of dipoles across the cortical surface was extracted using a LCMV beamformer and a three-shell spherical head model [18]. Then, the authors applied their newly introduced stimulus-based causal inference (SCI) algorithm to assess for every dipole whether its -log-bandpower is a linear causal effect of the -log-bandpower in the SPC. Group results were summarised in a vector where the entry denotes the percentage of dipoles within a certain radius that were found to be modulated by the SPC. The results of this exhaustive search analysis, visualising on hemisphere plots, supported the hypothesis that the MPC is a linear causal effect of the SPC. The reader is referred to [5] for more details.

In contrast to exhaustive search, both our linear MERLiN algorithm as well as its non-linear extension aim at immediately recovering the causal effect by optimising a linear combination of electrode signals999Since there were only samples per session we decided to select a subset of EEG channels distributed across the scalp (according to the system). Hence, for each recording session we obtained a spatial filter .. To allow for a qualitative comparison of our results with the results summarised by the vector we derive for each a vector . This vector represents the involvement of each cortical dipole in the recovered signal and is derived from as follows. First, a scalp topography is obtained via where the entry of is the covariance between the EEG channel and the source that is recovered by [19, Equation (7)]. Here denotes the session-specific covariance matrix in the -frequency band. Second, the dipole involvement vector is obtained from via dynamic statistical parametric mapping (dSPM; with identity noise covariance matrix) [20]. Group results are obtained as average of the individual dipole involvement vectors.

Iii-C Experimental results

The group averaged results of our extended algorithm are depicted in Figure 2.(a). Similar to the results in [5] and the results we obtained with the linear MERLiN algorithm (cf. Figure 2.(b)) the analysis indicates that the MPC is a causal effect of the SPC. The non-linear method yields results that are in high accordance with the ones obtained by our linear method while exhaustive search additionally revealed the anterior middle frontal gyrus as effect of the SPC.

Left hemisphere  Lateral view
Right hemisphere
Medial view

(a) non-linear

Left hemisphere  Lateral view
Right hemisphere
Medial view

(b) linear

Fig. 2: Group averaged dipole involvement corresponding to the spatial filters identified by the (a) non-linear and (b) linear MERLiN algorithm; lateral and medial views of the left and right hemisphere. (All colour scales from “blue” to “red” range from to the largest value to be plotted.)

Iv Conclusions

We have developed the non-linear MERLiN algorithm that is able to recover a causal effect from an observed linear mixture with no constraint on the functional form of this cause-effect relationship. Iteratively projecting out directions and applying the MERLiN algorithm may allow to identify multiple distinct causal effects. For EEG data we found no qualitative difference to the linear method, which indicates that linear methods suffice to identify within-frequency cause-effect relationships in EEG data. Future research will focus on theoretical analysis of the presented methods and assumptions and investigate applicability to other real world data.


  • [1] P. L. Nunez and R. Srinivasan, Electric Fields of the Brain: The Neurophysics of EEG.   Oxford University Press, 2006.
  • [2] M. Grosse-Wentrup, “Understanding brain connectivity patterns during motor imagery for brain-computer interfacing,” in Advances in Neural Information Processing Systems, 2009, pp. 561–568.
  • [3] S. Weichwald, M. Grosse-Wentrup, and A. Gretton, “MERLiN: Mixture Effect Recovery in Linear Networks,” arXiv preprint arXiv:1512.01255, 2015.
  • [4] S. Weichwald, T. Meyer, O. Özdenizci, B. Schölkopf, T. Ball, and M. Grosse-Wentrup, “Causal interpretation rules for encoding and decoding models in neuroimaging,” NeuroImage, vol. 110, pp. 48–59, 2015.
  • [5] M. Grosse-Wentrup, D. Janzing, M. Siegel, and B. Schölkopf, “Identification of causal relations in neuroimaging data with latent confounders: An instrumental variable approach,” NeuroImage, vol. 125, pp. 825–833, 2016.
  • [6] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, 2nd ed.   MIT press, 2000.
  • [7] J. Pearl, Causality: Models, Reasoning and Inference, 2nd ed.   Cambridge University Press, 2009.
  • [8] M. Eichler and V. Didelez, “On Granger causality and the effect of interventions in time series,” Lifetime Data Analysis, vol. 16, no. 1, pp. 3–32, 2010.
  • [9] J. T. Lizier and M. Prokopenko, “Differentiating information transfer and causal effect,” The European Physical Journal B, vol. 73, no. 4, pp. 605–615, 2010.
  • [10] T. Verma and J. Pearl, “Equivalence and synthesis of causal models,” in

    Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence

    .   Elsevier, 1990, pp. 255–268.
  • [11] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, “A Kernel Statistical Test of Independence,” in Advances in Neural Information Processing Systems 20, 2008, pp. 585–592.
  • [12] C. Saunders, A. Gammerman, and V. Vovk, “Ridge Regression Learning Algorithm in Dual Variables,” in

    Proceedings of the 15th International Conference on Machine Learning

    , 1998, pp. 515–521.
  • [13] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), Jun. 2010.
  • [14]

    F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: New features and speed improvements,” Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

  • [15] J. Townsend, N. Koep, and S. Weichwald, “Pymanopt: A Python Toolbox for Manifold Optimization using Automatic Differentiation,” arXiv preprint arXiv:1603.03236, 2016.
  • [16] M. Grosse-Wentrup and B. Schölkopf, “A brain–computer interface based on self-regulation of gamma-oscillations in the superior parietal cortex,” Journal of Neural Engineering, vol. 11, no. 5, p. 056015, 2014.
  • [17] A. C. Chen, D. J. Oathes, C. Chang, T. Bradley, Z.-W. Zhou, L. M. Williams, G. H. Glover, K. Deisseroth, and A. Etkin, “Causal interactions between fronto-parietal central executive and default-mode networks in humans,” Proceedings of the National Academy of Sciences, vol. 110, no. 49, pp. 19 944–19 949, 2013.
  • [18] J. C. Mosher, R. M. Leahy, and P. S. Lewis, “EEG and MEG: Forward solutions for inverse methods,” Biomedical Engineering, IEEE Transactions on, vol. 46, no. 3, pp. 245–259, 1999.
  • [19] S. Haufe, F. Meinecke, K. Görgen, S. Dähne, J.-D. Haynes, B. Blankertz, and F. Bießmann, “On the interpretation of weight vectors of linear models in multivariate neuroimaging,” NeuroImage, vol. 87, pp. 96–110, 2014.
  • [20] A. M. Dale, A. K. Liu, B. R. Fischl, R. L. Buckner, J. W. Belliveau, J. D. Lewine, and E. Halgren, “Dynamic statistical parametric mapping: Combining fMRI and MEG for high-resolution imaging of cortical activity,” Neuron, vol. 26, no. 1, pp. 55–67, 2000.