MERLiN: Mixture Effect Recovery in Linear Networks (cf. http://arxiv.org/abs/1512.01255)
Causal inference concerns the identification of cause-effect relationships between variables, e.g. establishing whether a stimulus affects activity in a certain brain region. The observed variables themselves often do not constitute meaningful causal variables, however, and linear combinations need to be considered. In electroencephalographic studies, for example, one is not interested in establishing cause-effect relationships between electrode signals (the observed variables), but rather between cortical signals (the causal variables) which can be recovered as linear combinations of electrode signals. We introduce MERLiN (Mixture Effect Recovery in Linear Networks), a family of causal inference algorithms that implement a novel means of constructing causal variables from non-causal variables. We demonstrate through application to EEG data how the basic MERLiN algorithm can be extended for application to different (neuroimaging) data modalities. Given an observed linear mixture, the algorithms can recover a causal variable that is a linear effect of another given variable. That is, MERLiN allows us to recover a cortical signal that is affected by activity in a certain brain region, while not being a direct effect of the stimulus. The Python/Matlab implementation for all presented algorithms is available on https://github.com/sweichwald/MERLiNREAD FULL TEXT VIEW PDF
Causal inference concerns the identification of cause-effect relationshi...
Applications of machine learning tools to problems of physical interest ...
We pose causal inference as the problem of learning to classify probabil...
In addition to efficient statistical estimators of a treatment's effect,...
We propose a method to infer causal structures containing both discrete ...
Empirical Dynamic Modeling (EDM) is a nonlinear time series causal infer...
Extracting causal relationships from observed correlations is a growing ...
MERLiN: Mixture Effect Recovery in Linear Networks (cf. http://arxiv.org/abs/1512.01255)
Causal inference requires causal variables. Observed variables do not themselves always constitute the causal relata that admit meaningful causal statements, however, and transformations of the variables might be required to isolate causal signals. Images, for example, consist of microscopic variables (pixel colour values) while the identification of meaningful cause-effect relationships requires the construction of macroscopic causal variables (e. g. whether the image shows a magic wand) . That is, it is implausible that a description of effects of changing the colour value of one single pixel, the microscopic variable, leads to characterisation of a meaningful cause-effect relationship; however, the existence of a magic wand, the macroscopic variable, may lead to meaningful statements of the form “manipulating the image such that it shows a magic wand affects the chances of little Maggie favouring this image”. A similar problem often occurs whenever only a linear mixture of causal variables can be observed. In electroencephalography (EEG) studies, for example, measurements at electrodes placed on the scalp are considered to be instantaneously and linearly superimposed electromagnetic activity of sources in the brain . Again, statements about the microscopic variables are meaningless, e. g. “manipulating the electrode’s signal affects the subject’s attentional state”; however, macroscopic variables such as the activity in the parietal cortex, extracted as a linear combination of electrode signals, admit meaningful causal statements such as “manipulating the activity in the parietal cortex affects the subject’s attentional state”. Standard causal inference methods require that all underlying causal variables, i. e., the sources in the brain, are first constructed –or rather recovered– from the observed mixture, i. e., the electrode signals.
There exist a plethora of methods to construct macroscopic variables both from images and linear mixtures. However, prevailing methods to learn visual features [3, 4, 5] ignore the causal aspect, and are fundamentally different from the recent and (to our knowledge) only work that demonstrates how visual causal features can be learned by a sequence of interventional experiments 
. Likewise, methods to (re-)construct variables from linear mixtures commonly ignore the causal aspect and often rest on implausible assumptions. For instance, independent component analysis (ICA), commonly employed in the analysis of EEG data, rests on the assumption of mutually independent sources[6, 7, 8]. One may argue that muscular or ocular artefacts are independent of the cortical sources and can be extracted via ICA [9, 10]. It seems implausible, though, that cortical sources are mutually independent. In fact, if they were mutually independent there would be no cause-effect relationships between them. Thus, methods ignoring the causal aspect are unsuited to construct meaningful causal variables.
Mixture Effect Recovery in Linear Networks (MERLiN) aims to construct a causal variable from a linear mixture without requiring multiple interventional experiments. The fundamental idea is to directly search for statistical in- and dependences that imply, under assumptions discussed below, a certain cause-effect relationship. In essence, given iid samples of a univariate randomised variable , a univariate causal effect of , and a multivariate variable , MERLiN searches for a linear combination such that is a causal effect of , i. e., . This implements the novel idea to construct causal variables such that the resulting statistical properties guarantee meaningful cause-effect relationships.
As an illustration, consider the directed acyclic graph (DAG) shown in Figure 1, where the gap indicates that is disconnected from all other variables. In this notation edges denote cause-effect relationships starting at the cause and pointing towards the effect. denotes a randomised variable. We assume that only a linear mixture ( is an unknown mixing matrix) of the causal variables can be observed, and that such that is known. MERLiN’s goal is to recover from the observed linear mixture a causal variable that is an effect of , i. e., to find such that is an effect of , where is a valid solution.
We introduce the basic MERLiN algorithm that can recover the sought-after causal variable when the cause-effect relationships are linear with additive Gaussian noise (cf. Section III). We also demonstrate how the algorithm can be extended for application to different (neuroimaging) data modalities by (a) including data specific preprocessing steps into the optimisation procedure, and (b) incorporating a priori domain knowledge (cf. Section V). Here, both concepts are demonstrated through application to EEG data when cause-effect relationships within individual frequency bands are of interest.
Thus, we present three related algorithms MERLiN, MERLiN, and MERLiN that are based on optimisation of precision matrix entries (indicated by the subscript ). The MERLiN and MERLiN algorithms include preprocessing steps that allow application to timeseries data, identifying a linear combination of timeseries signals such that the resulting log-bandpower (indicated by the superscript ) reveals the sough-after cause-effect relationship. A further extension (indicated by the superscript ) takes domain knowledge about time-lags into account.
For stimulus-based neuroimaging studies , the MERLiN and MERLiN algorithms can establish a cause-effect relationship between brain state features that are observed only as part of a linear mixture. As such, MERLiN is able to provide insights into brain networks beyond those readily obtained from encoding and decoding models trained on pre-defined variables 
. Furthermore, it employs the framework of Causal Bayesian Networks (CBNs) that has recently been fruitful in the neuroimaging community[13, 14, 15, 16, 12, 17] — the important advantage over methods based on information flow being that it yields testable predictions on the impact of interventions [18, 19].
MERLiN shows good performance both on synthetic and EEG data recorded during neurofeedback experiments. The Python/Matlab implementation for all presented algorithms is available on https://github.com/sweichwald/MERLiN.
In general causal inference requires three steps.
construction of (causal) variables
inference of cause-effect relationships among the variables defined in 1)
estimation of the functional form and strength of the causal links established in 2)
In this manuscript we focus on and merge the first two steps. More specifically, a causal variable is implicitly constructed by optimising for properties that at the same time establish a specific cause-effect relationship for this variable. Here we briefly introduce the main aspects of Causal Bayesian Networks (CBNs) that define causation in terms of effects of interventions and allow inference of cause-effect relationships (step 2) from conditional independences in the observed distribution. For an exhaustive treatment see [20, 21].
We define a structural equation model (SEM) as a set of structural equations where the so-called noise variables are independently distributed according to . For the set contains the so-called parents of and describes how
relates to the random variables inand
. This induces the unique joint distribution denoted by.111Note that the distribution has a density if has a density and the functions are differentiable.
Replacing at least one of the functions by a constant yields a new SEM. We say has been intervened on, which is denoted by , leads to the SEM , and induces the interventional distribution .
is a cause of () wrt. a SEM iff there exists such that .222 and denote the marginal distributions of corresponding to and respectively. is an effect of iff is a cause of . Often the considered SEM is omitted if it is clear from the context.
For each SEM there is a corresponding graph with and that has the random variables as nodes and directed edges pointing from parents to children. We employ the common assumption that this graph is acyclic, i. e., will always be a directed acyclic graph (DAG).
It is insightful to consider the following implication of Definition 2: If in there is no directed path from to , is not a cause of (wrt. ). The following example shows that without further assumptions the converse is not true in general, i. e., existence of a path does not generally imply a cause-effect relationship. This nuisance will be accounted for by the faithfulness assumption (cf. Definition 6 below). We provide supportive graphical depictions in Figure 2.
Consider a SEM with structural equations and graph shown in Figure (a)a and noise variables . In there is a directed path (in fact even a directed edge) from to while for all , i. e., intervening on does not alter the distribution of . That is, is not a cause of wrt. despite the existence of the edge (cf. Figure (b)b).
So far a DAG simply depicts all parent-child relationships defined by the SEM . Missing directed paths indicate missing cause-effect relationships. In order to specify the link between statistical independence (denoted by ) wrt. the joint distribution and properties of the DAG (representing a SEM ) we need the following definitions.
For a fixed graph disjoint sets of nodes and are d-separated by a third disjoint set (denoted by ) iff all pairs of nodes and are d-separated by . A pair of nodes is d-separated by iff every path between and is blocked by . A path between nodes and is blocked by iff there is an intermediate node on the path such that (i) and is tail-to-tail () or head-to-tail (), or (ii) is head-to-head () and neither nor any of its descendants is in .
A distribution satisfies the global Markov property wrt a graph if
It satisfies the local Markov property wrt if each node is conditionally independent of its non-descendants given its parents. Both properties are equivalent if has a density333For simplicity we assume that distributions have a density wrt. some product measure throughout this text. (cf. [22, Theorem 3.27]); in this case we say is Markov wrt .
generated by a SEM is said to be faithful wrt. , if
Conveniently the distribution generated by a SEM is Markov wrt. (cf. [21, Theorem 1.4.1] for a proof). Hence, if we assume faithfulness444Intuitively, this is saying that conditional independences are due to the causal structure and not accidents of parameter values [20, p. 9]. conditional independences and d-separation properties become equivalent
Summing up, we have defined interventional causation in terms of SEMs and have seen how a SEM gives rise to a DAG. This DAG has two convenient features. Firstly, the DAG yields a visualisation that allows to easily grasp missing cause-effect relationships that correspond to missing directed paths. Secondly, assuming faithfulness d-separation properties of this DAG are equivalent to conditional independence properties of the joint distribution. Thus, conditional independences translate into causal statements, e. g. ‘a variable becomes independent of all its non-effects given its immediate causes’ or ‘cause and effect are marginally dependent’. Furthermore, the causal graph can be identified from conditional independences observed in — at least up to a so-called Markov equivalence class, the set of graphs that entail the same conditional independences .
The proposed algorithms require optimisation of objective functions over the unit-sphere . For generality we view the sphere as a special case of the Stiefel manifold () for
. Implementing the respective objective functions in Theano[24, 25], we use the Python toolbox Pymanopt  to perform optimisation directly on the respective Stiefel manifold using a steepest descent algorithm with standard back-tracking line-search.555For the experiments presented in this manuscript we set both the minimum step size and gradient norm to (arbitrary choice) and the maximum number of steps to (generous choice based on preliminary test runs that met the former stopping criteria in much earlier iterations). Our toolbox allows to adjust both parameters. This approach is exact and efficient, relying on automated differentiation and respecting the manifold geometry.
We consider a situation in which only a linear combination of observed variables constitutes a meaningful causal variable. These scenarios naturally arise if only samples of a linear mixture of the underlying causal variables are accessible (cf. Figure 1). Standard causal inference methods cannot infer cause-effect relationships among the causal variables without first undoing the unknown linear mixing (also known as blind source separation). MERLiN aims to establish a cause-effect relationship among causal variables in a linear network while reconstructing a causal variable at the same time. In other words, a causal variable is reconstructed by optimising for the statistical properties that imply a certain kind of cause-effect relationship.
In this section we first provide the formal problem description. We then derive sufficient conditions for the kind of cause-effect relationship MERLiN aims to establish, and discuss assumptions on the linear mixing. Finally, the basic precision matrix based MERLiN algorithm is introduced, which optimises for these sufficient statistical properties in order to recover a linear causal effect from an observed linear mixture.
The terminology introduced in Section II-A allows to precisely state the problem as follows.
Let and denote (finitely many) real-valued random variables. We assume existence of a SEM , potentially with unobserved variables , that induces . We refer to the corresponding graph as the true causal graph and call its nodes causal variables. We further assume that
affects indirectly via ,666By saying a variable causes indirectly via we imply (a) existence of a directed path , and (b) that there is no directed path without on it (this also excludes the edge ).
is faithful wrt. ,
there are no edges in pointing into .
In an experimental setting the last condition is ensured by randomising .777Randomisation corresponds to an intervention: the structural equation of is replaced by where is an independent randomisation variable, e. g. assigning placebo or treatment according to an independent Bernoulli variable. Figure 3 depicts an example of how might look.
iid888independent and identically distributed samples of and of where is the observed linear mixture of the causal variables and denotes the unknown mixing matrix
the linear combination that extracts the causal variable is assumed known
That is, we have samples of , , and but not of where is an unknown linear mixture of .
Find such that where is an effect of (). In other words, the aim is to recover a causal variable –up to scaling– that is an effect of . For example, recovery of the causal variable is a valid solution. The factor reflects the scale indeterminacy that results from the linear mixing, i. e., since is unknown the scale of the causal variables cannot be determined unless further assumptions are employed or a priori knowledge is available.
We are given that there exists at least one causal variable that is indirectly affected by via . However, we only have access to samples of the linear mixture and samples of . Note the following properties of :
Since is faithful wrt. it follows that (and ).
Since is Markov wrt. it follows that .
We can derive the following sufficient conditions for a causal variable to be indirectly affected by via .
Given the assumptions in Section III-A1 and a causal variable . If and , then indirectly affects via . In particular, a directed path from to , denoted by , exists.
From and being Markov wrt. it follows that and are not d-separated in by the empty set. In there must be at least one path , or for some node . By assumption is affected by , i. e., we have in . Hence, in there must be at least one path , or for some node . Under the assumption of faithfulness, the latter two cases contradict . Hence, in at least one path exists.
From and being faithful wrt. it follows that and are d-separated in by . That is, given every path between and is blocked. In particular, in there is no edge and no path without on it. Hence, is indeed indirectly affected by via . ∎
This leads to our general idea on how to find a linear combination that recovers a causal effect of . If MERLiN finds such that the following two statistical properties hold true
(a) , and (b)
then we have identified a candidate causal effect of , i. e., we have identified a variable such that . Note that conditioning on cannot unblock a path that was blocked before since there are no edges pointing into ; conversely the conditional dependence implies the marginal dependence . Hence, MERLiN can also optimise for the following alternative statistical properties
(a’) , and (b)
to recover a candidate causal effect of . This reformulation is useful since it allows optimisation of two analogous conditional (in)dependence properties instead of marginal and conditional (in)dependence. Ideally and under mixing assumptions discussed below, optimising wrt. these statistical properties will indeed recover a causal variable, i. e., (), that is an effect of . Note that this approach even works in the presence of hidden confounders.
MERLiN’s strategy presented in the previous section is to optimise a linear combination of the observed linear mixture such that two statistical properties are fulfilled. However, without imposing further assumptions on the linear mixing it may be impossible to recover the desired causal variable by this procedure. Here we discuss problems that can occur for arbitrary mixing and specify our mixing assumptions, namely that is an orthogonal matrix.
In the first place, there may not exist a solution to MERLiN’s problem if has rank less than .999Note that being at least rank is not a necessary condition, i. e., an effect of may be recoverable even in cases where has rank less than . As an example consider the case where is an effect of and . However, the focus is a sufficient condition for the existence of a solution. Hence, assume that has rank and, for simplicity, that is a square matrix. This guarantees existence of a solution: if the mixing matrix is invertible a solution to the problem is to recover via .
However, if we only assume to be invertible MERLiN may not be able to recover (a multiple of) a causal variable from the sought-after statistical properties alone. The following example demonstrates the problem that arises from the fact that itself is part of the observed linear mixture.
Assume is the true but unknown causal graph, where the gap indicates that is disconnected from all other variables. Assume all variables are non-degenerate and that the unknown mixing matrix is invertible. Then, any variable recovered as linear combination from the observed linear mixture can be written as
for some .
MERLiN aims to recover the causal variable (up to scaling) by optimising such that the statistical properties
(or equivalently ), and
hold true (cf. Section III-B). Indeed () fulfils these statistical properties and is the desired output. However, all () likewise fulfil the statistical properties while not being (a multiple of) a causal variable.
This example demonstrates that without imposing further constraints on the linear mixing MERLiN may recover (ensuring the dependence on ) with independent variables added on top (ensuring conditional independence of given ), e. g. in above example. Although the sought-after statistical properties hold true for this variable, this is clearly not a desirable output and does not recover a causal variable.
One way to mitigate this situation is to restrict search to the orthogonal complement of . This way, the signal of in the linear mixture is attenuated. In particular, if the mixing matrix is orthogonal restricting search to amounts to complete removal of ’s signal from . We therefore assume that is an orthogonal matrix and restrict the search to . It is then no longer possible to add arbitrary multiples of onto independent variables to introduce the sought-after dependence, i. e., the recovery of non-causal variables like in above example is prevented.
Note that while adding independent variables onto effects is still possible (e. g. consider in above example), it will be counter-acted by setting up the objective function accordingly — roughly speaking, as we ‘maximise dependence’, then these independent variables will be suppressed, since they act as noise and reduce dependence.
The basic MERLiN algorithm aims to recover a linear causal effect from an observed linear mixture. In particular, we assume that the cause-effect relationships between the underlying causal variables and are linear with additive Gaussian noise. In such a linear network, zero entries in the precision matrix imply missing edges in the graph . Hence, if is a linear effect of the precision matrix of the three variables and is of the form
where stars indicate non-zero entries. This implies the partial correlations and which, in the Gaussian case, amounts to the desired conditional (in-)dependences (a’) and (b) (cf. Section III-B) .
Exploiting this link, the precision matrix based MERLiN algorithm (cf. Algorithm 1) implements the general idea presented in Section III-B by maximising the objective function101010For numerical reasons one might want to use the approximation for small to ensure that is differentiable everywhere.
where (here we assumed and invertibility). Optimisation is performed over the unit-sphere after projecting onto the orthogonal complement .
define the objective function for as
where the empirical precision matrix is
is the orthonormal matrix that accomplishes projection onto the orthogonal complement
is the centering matrix
denotes the synthetic dataset that is generated by Algorithm 2. It consists of samples of an orthogonal linear mixture of underlying causal variables that follow the causal graph shown in Figure 3. The parameters and determine the statistical properties of the generated dataset as follows. The parameter adjusts the severity of hidden confounding between and . Note also that the link between and is weaker for higher values of , i. e., . The link between and becomes noisier for higher values of , i. e., . Furthermore, the value of the objective function for recovering is lower for higher values of since –in the infinite sample limit– we have
Hence, these datasets allow to analyse the behaviour of the algorithm for cause-effect relationships of different strengths and its robustness against hidden confounding.
Input: , ,
generate a random orthogonal matrix
by Gram-Schmidt orthonormalising a matrix with entries independently drawn from a standard normal distribution
generate independent mean parameters from
generate independent samples according to the following SEM
where , , and if or if
arrange the samples of in a column vector
arrange each sample of in a column vector and (pre-)multiply by to obtain the corresponding sample of
arrange the samples of as columns of a matrix
Since we can ignore scaling, it is not a problem that we in fact generate an orthonormal matrix.
We introduce two performance measures to assess MERLiN’s performance on synthetic data with known ground truth . Since a solution can only be identified up to scaling, we only need to consider the -sphere . The closer a vector or its negation is to the ground truth the better. This leads to the performance measure of angular distance
Another approach is to assess the quality of the recovered
by the probability of obtaining a vector that is closer toif chosen uniformly at random on the -sphere. We define the probability of a better vector as
where and is the dimension of the input vector. This quantity is obtained by dividing the area of the smallest hyperspherical cap centred at that contains or by half the area of the -sphere. The former equals the area of the hyperspherical cap of height , the latter equals the area of the hyperspherical cap of height . Exploiting the concise formulas for the area of a hyperspherical cap with radius presented in  we obtain
where and is the regularized incomplete beta function. It is interesting to note that
is the cumulative distribution function of adistribution such that .
For simplicity we drop the ground truth vector from the notation and simply assume that the corresponding ground truth vector is always the point of reference. Both performance measures are related inasmuch as iff and iff . However, they capture somewhat complementary information: assesses how close the vector is in absolute terms, while accounts for the increased problem complexity in higher dimensions.
We applied the precision matrix based MERLiN algorithm (cf. Algorithm 1) to the synthetic datasets described in Section IV-A. The results of runs111111For each run we create a new dataset. This is the case for all experiments on synthetic data. The performance measures and are always considered wrt. the corresponding of each dataset instance. for different configurations of are summarised as boxplots in Figures 4 and 5. Recall that lower values of and indicate that is closer to the ground truth . We observe the following:
The results for Gaussian () or binary (; not shown here) variable are indistinguishable.
Performance is insensitive to the severity of hidden confounding, which can be seen by comparing the plots row-wise for the different values of . This behaviour is expected since .
Performance decreases with increasing noise level, i. e., with increasing values of . Note that is a sum of and noise
with varianceand respectively.
The problem becomes harder in higher dimensions, resulting in worse performance. However, the results for indicate that even if the solution is not that close to in an absolute sense () the solution is good in a probabilistic sense.
More samples increase performance. Especially if the noise level and the dimension are not both high at the same time, MERLiN still achieves good performance on samples (cf. the results for or ).
In this section we demonstrate how the basic MERLiN algorithm can be extended to enable application to different data modalities by (a) including data specific preprocessing steps into the optimisation procedure (cf. Section V-A), and (b) incorporating a priori domain knowledge (cf. Section V-B). In particular, we demonstrate this for neuroimaging data, since stimulus-based experiments pose a prototype application scenario for MERLiN for the following reasons.
Recent work in the neuroimaging community focusses on functional networks, i. e., a (linear) combination of activity spread across the brain that is functionally (read causally) relevant . Additionally, the recorded activity is often assumed to be a linear combination of underlying cortical variables, as for example in EEG .
Simple univariate methods suffice to identify an effect of [12, Interpretation rule S1].
The proposed method can readily be applied and complement the standard analysis procedures employed in such experiments. More precisely, MERLiN can recover meaningful cortical networks (read causal variables) that are causally affected by , thereby establishing a cause-effect relationship between two functional brain state features.
Analyses of EEG data commonly focus on trial-averaged log-bandpower in a particular frequency band. Accordingly, we aim at identifying a linear combination of electrode signals such that the trial-averaged log-bandpower of the recovered signal is indirectly affected by the stimulus via another predefined cortical signal. We demonstrate how to do so by extending the basic MERLiN algorithm to include the log-bandpower computation into the optimisation procedure.
More precisely, we consider EEG trial-data of the form where denotes the number of electrodes, the number of trials, and the length of the time series for each electrode and each sample .121212Note that the MERLiN algorithm takes data of the form as input and cannot readily be applied to timeseries data . The aim is to identify a linear combination such that the log-bandpower of the resulting one-dimensional trial signals is a causal effect of the log-bandpower of the one-dimensional trial signals . However, since the two operations of computing the log-bandpower (after windowing) and taking a linear combination do not commute, we cannot compute the trial-averaged log-bandpower for each channel first and then apply the standard precision matrix based MERLiN algorithm. Instead, MERLiN has been adapted to the analysis of EEG data by switching in the log-bandpower computation.
To simplify the optimisation loop we exploit the fact that applying a Hanning window131313We apply a Hanning window in order to keep the feature computation in line with .
and the FFT to each channel’s signal commutes with taking a linear combination of the windowed and Fourier transformed time series. Note that averaging of the log-moduli () of the Fourier coefficients does not commute with taking a linear combination. Hence, windowing and computing the FFT is done in a separate preprocessing step (cf. Algorithm 3), while the trial-averaged bandpower is computed within the optimisation loop after taking the linear combination. Implementation details for the bandpower and precision matrix based MERLiN algorithm are described in Algorithm 4. To ease implementation we treat the complex numbers as two-dimensional vector space over the reals.
Input: , the sampling frequency , and the desired frequency range defined by and
set , , and
for from to , for from to
center, apply Hanning window and compute FFT, i. e., treat as a column vector and set
extract relevant Fourier coefficients corresponding to , i. e., set
remove direction from , i. e., for from to set
is the orthonormal matrix that accomplishes projection onto the orthogonal complement