1 Introduction
The analysis of multivariate data is often complicated by high dimensionality and complex interdependences between the observed variables. In order to identify patterns in such data it is therefore desirable and often necessary to separate different aspects of the data. In multivariate statistics, for example, principal component analysis (PCA) is a common preprocessing step that decomposes the data into orthogonal principle components which are sorted according to how much variance of the original data each component explains. There are two important applications of this. Firstly, one can reduce the dimensionality of the data by projecting it onto the lower dimensional space spanned by the leading principal components which maximize the explained variance. Secondly, since the principle components are orthogonal, they separate in some sense different (uncorrelated) aspects of the data. In many situations this enables a better interpretation and representation.
Often, however, PCA may not be sufficient to separate the data in a desirable way due to more complex interdependences in the multivariate data (see e.g., Section 1.3.3 in Hyvärinen et al. (2001) for an instructive example). This observation motivates the development of independent component analysis (ICA), formally introduced in its current form by Cardoso (1989a) and Comon (1994). ICA is a widely used unsupervised blind source separation technique that aims at decomposing an observed mixture of independent source signals. More precisely, assuming that the observed data is a linear mixture of underlying independent variables, one seeks the unmixing matrix that maximizes the independence between the signals it extracts. There has been a large amount of research into different types of ICA procedures and their interpretations, e.g., Bell and Sejnowski (1995, Infomax) who maximize the entropy, Hyvärinen (1999, fastICA)
who maximizes the kurtosis or
Belouchrani et al. (1997, SOBI) who minimize timelagged dependences, to name only the widespread examples.ICA has applications in many fields, for example in finance (e.g., Back and Weigend, 1997), the study of functional magnetic resonance imaging (fMRI) data (e.g., McKeown et al., 1998a, b; Calhoun et al., 2003), and notably in the analysis of electroencephalography (EEG) data (e.g., Makeig et al., 1996, 1997; Delorme and Makeig, 2004). The latter is motivated by the common assumption that the signals recorded at EEG electrodes are a (linear) superposition of cortical dipole signals (Nunez and Srinivasan, 2006). Indeed, ICAbased preprocessing has become the de facto standard for the analysis of EEG data. The extracted components are interpreted as corresponding to cortical sources (e.g., Ghahremani et al., 1996; Zhukov et al., 2000; Makeig et al., 2002) or used for artifact removal by dropping components that are dominated by ocular or muscular activity (e.g., Jung et al., 2000; Delorme et al., 2007).
In many applications, the data at hand is heterogeneous and parts of the samples can be grouped by the different settings (or environments) under which the observations were taken. For example, we can group those samples of a multisubject EEG recording that belong to the same subject. For the analysis and interpretation of such data across different groups, it is desirable to extract one set of common features or signals instead of obtaining individual ICA decompositions for each group of samples separately. Here, we present a novel, methodologically sound framework that extends the ordinary ICA model, respects the group structure and is robust by explicitly accounting for groupwise stationary confounding. More precisely, we consider a model of the form
(1.1) 
where remains fixed across different groups,
is a vector of independent source signals and
is a vector of stationary confounding noise variables with fixed covariance within each group (an intuitive example where such a scenario may be encountered in practice is illustrated in Figure 7). Based on this extension to ordinary ICA, we construct a method and an easy to implement algorithm to extract one common set of sources that are robust against confounding within each group and can be used for acrossgroup analyses. The unmixing also generalizes to previously unseen groups.1.1 Relation to existing work
ICA is wellstudied with a tremendous amount of research related to various types of extensions and relaxations of the ordinary ICA model. In light of this, it is important to understand where our proposed procedure is positioned and why it is an interesting and useful extension. Here, we look at ICA research from three perspectives and illustrate how our proposed coroICA methodology relates to existing work. First off, in Section 1.1.1 we compare our proposed methodology with other noisy ICA models. In Section 1.1.2, we review ICA procedures based on approximate joint matrix diagonalization. Finally, in Section 1.1.3 we summarize the existing literature on ICA procedures for grouped data and highlight the differences to coroICA.
1.1.1 Noisy ICA models
The ordinary ICA model assumes that the observed process is a linear mixture of independent source signals without a confounding term . Identifiability of the source signals is guaranteed by assumptions on such as nonGaussianity or specific time structures. For coroICA we require—similar to other secondorder based methods (cf. Section 1.1.2)—that the source process is nonstationary. More precisely, we require that either the variance or the autocovariance of changes across time. An important extension of the ordinary ICA model is known as noisy ICA (e.g., Moulines et al., 1997) in which the data generating process is assumed to be an ordinary ICA model with additional additive noise. In general, this leads to further identifiability issues. These can be resolved by assuming that the additive noise is Gaussian (e.g., Hyvärinen, 1999), which enables a separation of the true (nonGaussian) sources from the (Gaussian) noise. Another possibility is to assume that the noise is independent over time, while the source signals are timedependent (e.g., Choi and Cichocki, 2000b). In contrast, our assumption on the noise term is much weaker, since we only require it to be stationary and hence in particular allow for timedependent noise in coroICA. As we show in our simulations in Section 4.2.3 this renders our method robust with respect to confounding noise: coroICA is more robust against timedependent noise while remaining competitive in the setting of timeindependent noise. We refer to the book by Hyvärinen et al. (2001) for review of most existing ICA models and the assumptions required for identifiability.
1.1.2 ICA based on approximate joint diagonalization
As an extension of PCA, the concept of ICA is naturally connected to the notion of joint diagonalization of covariancetype matrices. One of the first procedures for ICA was FOBI introduced by Cardoso (1989b), which aims to jointly diagonalize the covariance matrix and a fourth order cumulant matrix. Extending on this idea Cardoso and Souloumiac (1993) introduced the method JADE which improves on FOBI by diagonalizing several different fourth order cumulant matrices. Unlike FOBI, JADE uses a general joint matrix diagonalization algorithm which is the de facto standard for all modern approaches. In fact, there is a stillactive field that focuses on approximate joint matrix diagonalization, commonly restricted to positive semidefinite matrices, and often with the purpose of improving ICA procedures (e.g., Cardoso and Souloumiac, 1996; Ziehe et al., 2004; Tichavsky and Yeredor, 2009; Ablin et al., 2018).
Both JADE and FOBI are based on the assumption that the signals are nonGaussian. This ensures that the sources are identifiable given independent and identically distributed observations. A different stream of ICA research departs from this assumption and instead assumes that the data are a linear mixture of independent weakly stationary timeseries. This model is often referred to as a secondorder sourceseparation model (SOS). The time structure in these models allows to identify the sources by jointly diagonalizing the covariance and autocovariance. The first method developed for this setting is AMUSE by Tong et al. (1990) who diagonalize the covariance matrix and the autocovariance matrix for one fixed lag. The performance of AMUSE is, however, fragile with respect to the exact choice of the lag, which complicates practical application (Miettinen et al., 2012). Instead of only using a single lag Belouchrani et al. (1997) proposed the method SOBI which uses all lags up to a certain order and jointly diagonalizes all the resulting autocovariance matrices. SOBI is to date still one of the most commonly employed methods, in particular in EEG analysis.
The SOS model is based on the assumption of weak stationarity of the sources which in particular implies that the signals have fixed variance and autocovariance structure across time. This assumption can be dropped and the resulting models are often termed nonstationary source separation models (NSS). The nonstationarity can be leveraged to boost the performance of ICA methods in various ways (see Matsuoka et al., 1995; Hyvärinen, 2001; Choi and Cichocki, 2000a, b; Choi et al., 2001; Pham and Cardoso, 2001). All aforementioned methods make use of the nonstationarity by jointly diagonalizing different sets of covariance or autocovariance matrices and mainly differ by how they perform the approximate joint matrix diagonalization. For example, the methods introduced by Choi and Cichocki (2000a, b); Choi et al. (2001) make use of nonstationarity across sources by separating the data into blocks and jointly diagonalizing either the covariance matrices, the autocovariances or both across all blocks. For our experimental comparisons, we implemented all three of these methods with the slight modification that we use the recent uwedge approximate joint matrix diagonalization procedure due to Tichavsky and Yeredor (2009). We denote the resulting three ICA variants that either diagonalize blocks of covariances, blocks of autocovariances or both by choiICA (var), choiICA (TD) and choiICA (var & TD), respectively. For a detailed description of both SOS and NSSbased methods we refer the reader to the review by Nordhausen (2014).
An exhaustive comparison of all methods is infeasible on the one hand due to the sheer amount of different models and methods and on the other hand due to the fact that appropriately maintained and easy adaptable code—for most methods—simply does not exist. Therefore, we focus our comparison on the following representative, modern methods that are closely related to coroICA: fastICA, SOBI, choiICA (TD), choiICA (var), choiICA (TD & var). The methods and their respective assumptions on the source and noise characteristics are summarized in Table 1.
method  signal type  allowed noise 

choiICA (TD)  varying timedependence  timeindependent 
choiICA (var)  varying variance  none 
choiICA (var & TD)  varying variance and timedependence  none 
SOBI  fixed timedependence  timeindependent 
fastICA^{1}^{1}1The fastICA method can be extended to include Gaussian noise (see Hyvärinen, 1999).  nonGaussian  none 
coroICA  varying timedependence and/or variance  groupwise stationary 
1.1.3 ICA procedures for grouped data
Applications in EEG and fMRI have motivated the development of a wide variety of blind source separation techniques which are capable of dealing with grouped data, e.g., where groups correspond to different subjects or recording sessions. A short review is given in Hyvärinen (2013) and a detailed exposition in the context of fMRI data is due to Calhoun et al. (2003).
Consider we are given groups and observe a corresponding data matrix for each group, where is the number of observed signals and the number of observations. Using this notation, all existing ICA procedures for grouped data can be related to one of three underlying models extending the classical mixing model . The first, often also referred to as “temporal concatenation”, assumes that the mixing remains equal while the sources are allowed to change across groups leading to data of the form
(1.2) 
The second model, often also referred to as “spatial concatenation”, assumes the sources remain fixed () while the mixing matrices are allowed to change, i.e.,
(1.3) 
Finally, the third model assumes that both the sources and the mixing remains fixed across groups which implies that for all it holds that
(1.4) 
In all three settings the baseline approach to ICA is to simply apply a classical ICA to the corresponding concatenated or averaged data, i.e., to apply the algorithm to the temporally/spatially concatenated data matrices on the lefthand side of above equations or the average over groups. These adhoc approaches are appealing, since they postulate straightforward procedures to solving the problem on grouped data and facilitate interpretability of the resulting estimates. It is these adhoc approaches that are implemented as the default behavior in toolboxes like the widely used eeglab for EEG analyses (Delorme and Makeig, 2004).
Several procedures have been proposed tailored to specific applications that extend on these baselines by employing additional assumptions. The most prominent such extensions are tensorial methods that have found popularity in fMRI analysis. They express the group index as an additional dimension (the data is thus viewed as a tensor) and construct an estimate factorization of the tensor representation. Many of these procedures build on the so called PARAFAC (parallel factor analysis) model (Harshman, 1970). Recasting the tensor notation, this model is of the form (1.3) with for all groups and for diagonal matrices . As can be seen from this representation the PARAFAC model allows the mixing matrices to change across groups while they are constrained to be the same up to different scaling of the mixing matrix columns (intuitively, across groups the source dimensions are allowed to project with different strengths onto the observed signal dimensions). Given that the matrices are sufficiently different it is possible to estimate this model uniquely without further assumptions. However, in the case that some of these diagonal matrices are equal identifiabliliy is lost. In such cases Beckmann and Smith (2005) suggest to additionally require that the individual components of the sources are independent. This is comparable to the case where orthogonality is not sufficient for separation of Gaussian sources but independence is.
The coroICA procedure also allows for groupeddata but aims at inferring a fixed mixing matrix A, i.e., a model as given in (1.2) is considered. In contrast to vanilla concatenation procedures, our methodology naturally incorporates changes across groups by allowing and adjusting for different stationary confounding noise in each group. We argue why this leads to a more robust procedure and also illustrate this in our simulations and real data experiments. More generally, our goal is to learn an unmixing which allows to generalize to new and previously unseen groups; think for example about learning an unmixing based on several different training subjects and extending it to new so far unseen subjects. Such tasks can appear in braincomputer interfacing applications and can also be of relevance more broadly in feature learning for classification tasks where classification models are to be transferred from one group/domain to another. Since our aim is to learn a fixed mixing matrix that is confoundingrobust and readily applicable to new groups, coroICA cannot naturally be compared to models that are based on spatial concatenation (1.3) or fixed sources and mixings (1.4); these methods employ fundamentally different assumptions on the model underlying the data generating process, the crucial difference being that we allow the sources and their time courses to change between groups.
1.2 Our contribution
One strength of our methodology is that it explicates a statistical model that is sensible for data with group structure and can be estimated efficiently, while being backed by provable identification results. Furthermore, providing an explicit model with all required assumptions enables a constructive discussion about the appropriateness of such modeling decisions in specific application scenarios. The model itself is based on a notion of invariance against confounding structures from groups, an idea that is also related to invariance principles in causality (Haavelmo, 1944; Peters et al., 2016); see also Section 3 for a discussion on the relation to causality.
We believe that coroICA is a valuable contribution to the ICA literature on the following grounds:

We introduce a methodologically sound framework which extends ordinary ICA to settings with grouped data and confounding noise.

We prove identifiability of the unmixing matrix under mild assumptions, importantly, we explicitly allow for timedependent noise thereby lessening the assumptions required by existing noisy ICA methods.

We provide an easy to implement estimation procedure.

We illustrate the usefulness, robustness, applicability, and limitations of our newly introduced coroICA algorithm as well as characterize the advantage of coroICA over existing ICAs: The source separation by coroICA is more stable across groups since it explicitly accounts for groupwise stationary confounding.

We provide an opensource scikitlearn compatible readytouse Python implementation available as coroICA from the Python Package Index repository as well as R and Matlab implementations and an intuitive audible example which is available at https://sweichwald.de/coroICA/.
2 Methodology
We consider a general noisy ICA model inspired by ideas employed in causality research (see Section 3). We argue below that it allows to incorporate group structure and enables joint inference on multigroup data in a natural way. For the model description, let and
be two independent vectorvalued sequences of random variables where
. The components are assumed to be mutually independent for each while, importantly, we allow for any weakly stationary noise . Letbe an invertible matrix. The
dimensional data processis generated by the following noisy linear mixing model
(2.1) 
is a linear combination of source signals and confounding variables . In this model, both and are unobserved. One aims at recovering the mixing matrix as well as true source signals from observations of . Without additional assumptions, the confounding makes it impossible to identify the mixing matrix . Even with additional assumptions it remains a difficult task (see Section 1.1.1 for an overview of related ICA models). Given the mixing matrix it is straightforward to recover the confounded source signals .
Throughout this paper, we denote by the observed data matrix and similarly by and the corresponding (unobserved) source and confounding data matrices. For a finite data sample generated by this model we hence have
In order to distinguish between the confounding and the source signals we assume that the two processes are sufficiently different. This can be achieved by assuming the existence of a group structure such that the covariance of the confounding remains stationary within a group and only changes across groups.
Assumption 1 (groupwise stationary confounding)
There exists a collection of disjoint groups with and such that for all the process is weakly stationary.
Under this assumption and given that the source signals change enough within groups, the mixing matrix is identifiable (see Section 2.2). Similar to existing ICA methods discussed i Section 1.1.2, we propose to estimate the mixing matrix by jointly diagonalizing empirical estimates of dependence matrices. In contrast to existing methods, we explicitly allow for and adjust for the confounding . The process of finding a matrix that simultaneously diagonalizes a set of matrices is known as joint matrix diagonalization and has been studied extensively (e.g., Ziehe et al., 2004; Tichavsky and Yeredor, 2009). In Section 2.3, we show how to construct an estimator for based on approximate joint matrix diagonalization.
The key step in adjusting for the confounding is to make use of the assumption that in contrast to the signals the confounding remains stationary within groups. Depending on the type of signal in the sources one can consider different sets of matrices. Here, we distinguish between two types of signals.
Variance signal
In case of a variance signal, the variance process of each signal source changes over time. These changes can be detected by examining the covariance matrix over time. For and using (2.1) it holds for all that
Since the source signal components are mutually independent, the covariance matrix is diagonal. Moreover, due to Assumption 1 the covariance matrix of the confounding is constant, though not necessarily diagonal, within each group. This implies for all groups and for all that
(2.2) 
is a diagonal matrix.
Timedependence signal
In case of a timedependence signal, the timedependence of each signal source changes over time, i.e., for fixed , changes over time. These changes lead to changes in the autocovariance matrices . Analogous to the variance signal it holds for all that
Since the source signal components are mutually independent, the autocovariance matrix is diagonal and due the stationarity of (see Assumption 1) the autocovariance is constant within each group. This implies for all groups , for all and for all that
(2.3) 
is a diagonal matrix.
For both signal types, we can identify by simultaneously diagonalizing differences of (auto)covariance matrices. Details and identifiability results are given in Section 2.3. The two signal types considered differ from both, the more classical settings of nonGaussian timeindependent signals as considered for example by fastICA, and the stationary signals with fixed timedependence assumed for SOBI (cf. Table 1). Owing to the nonstationarity of the signal we can allow for more general forms of noise.
2.1 Motivating examples
To get a better understanding of our proposed ICA model in (2.1), we illustrate two different aspects: the group structure and the noise model.
Noise model
coroICA can be viewed as a noisy ICA, where the noise is allowed to be groupwise nonstationary. This generalizes existing noisy ICA methods, which, to the best of our knowledge, all assume that the noise is independent over time with various additional assumptions. The following example illustrates the intuition behind our model via a toyapplication to natural images.
Example 2.1 (unmixing noisy images)
We provide an illustration of how our proposed method compares to other ICA approaches under the presence of noise. Four images, each pixels and with three RGB color channels, are used to construct four sources as follows.^{2}^{2}2The images are freely available from Pexels GmbH (2018). Every color channel is converted to a one dimensional vector by cutting each image into equally sized patches (i.e., each patch consists of pixels) and concatenating the rowwise vectorized patches. This procedure preserves the local structure of the image. We concatenate the three color channels and consider them as separate groups for our model. Thus, each of the four sources consists of observations, that is, three groups of observations corresponding to the RGB color channels. Next, we construct locally dependent noise that differs across color channels. Here, locally dependent means that the added noise is similar (and dependent) for pixels which are close to each other. This results in four noise processes . We combine the sources with the noise and apply a random mixing matrix to obtain the following observed data
The recast noisy images are illustrated in the first row and the recast observed mixtures in the second row of Figure 1. The last three rows are the resulting reconstructions of three different ICA procedures, coroICA, fastICA and choiICA (TD). As expected, fastICA as a noisefree ICA method, appears frail to the noise in the images. While choiICA (TD) is able to adjust for independent noise, it is unable to properly adjust for the spatial dependence of the noise process and thus leads to undesired reconstruction results. In contrast, coroICA is able to recover the noisy images. It is the noise and its characteristics that break the two competing ICA methods, since all three methods are able to unmix the images in the noisefree case (not shown here).
The noise model we employ is motivated by recent advances in causality research where the groupwise stationary noise can be interpreted as unobserved confounding factors in linear causal feedback models. We describe this in more detail with an explicit example application to Arctic ice core data in Section 3.
Group structure
A key aspect of our model is that it aims to leverage groupstructure to improve the stability of the umixing under the presence of groupwise confounding. Here we refer to the following notion of stability: A stable unmixing matrix extracts the same set of independent sources when applied to the different groups; it is robust against the confounding that varies across groups and introduces dependences. A standard ICA method is not able to estimate the correct unmixing , if the data generating process follows our confounded ICA model in (2.1). These methods extract signals that are not only corrupted by the groupwise confounding but also are mixtures of the independent sources and are thus not stable in the aforementioned sense. This is illustrated by the “America’s Got Talent Duet Problem” (cf. Example 2.2), an extension and alteration of the classical “cocktail party problem”.
Example 2.2 (America’s Got Talent Duet Problem)
Consider the problem of evaluating two singers at a duet audition individually. This requires to listen to the two voices separately, while the singers perform simultaneously. There are two sound sources in the audition room (the two singers) and additionally several noise sources which corrupt the recordings at the two microphones (or the jury member’s two ears). A schematic of such a setting is illustrated in Supplement B, Figure 12. The additional noise comes from an audience and two open windows. One can assume that this noise satisfies our Assumption 1 on a single group. The sound stemming from the audience can be seen as an average of many sounds, hence remaining approximately stationary over time. Also typical sounds from an open window satisfy this assumption, for example sound from a river or a busy road. Our methodology, however, also allows for more complicated settings in which the noise shifts at known points in times, for example if someone opens or closes a window or starts mowing the lawn outside. In such cases we use the known time blocks of stationary noise as groups and apply coroICA on this grouped data. An example with artificial sound data related to this setting is available at https://sweichwald.de/coroICA/. We show that coroICA is able to recover useful sound signals with the two voices being separated into different dimensions and thus allows to listen to them individually. In contrast, existing ICAs applied to the time concatenated data fail to unmix the two singers.
2.2 Identifiability
Identifiability requires that the source signals change sufficiently strong within groups. The precise notion of a strong signal depends on the type of signal. As discussed previously, we consider two types of nonstationary signals (i) variance signals and (ii) timedependence signals. Depending on the signal type we formalize two slightly different assumptions that characterize source signals that ensure identifiability. Firstly, in the case of a variance signal, we have the following assumption.
Assumption 2 (signals with independently changing variance)
For each pair of components we require the existence of three (not necessarily unique) groups and three corresponding pairs , and such that the two vectors
are neither collinear nor equal to zero.
In case of timedependence signals we have the analogous assumption.
Assumption 3 (signals with independently changing timedependence)
For each pair of components we require the existence of three (not necessarily unique) groups and three corresponding pairs , and for which there exists such that the two vectors
are neither collinear nor equal to zero.
Intuitively, these assumptions ensure that the signals are not changing in exact synchrony across components, which removes degenerate types of signals. In particular, they are satisfied in the case that the variance or autocovariance processes change pairwise independently over time. Whenever one of these assumptions is satisfied, the mixing matrix is uniquely identifiable.
Theorem 2.3 (identifiability of the mixing matrix)
Proof
A proof is given in Supplement A.
2.3 Estimation
In order to estimate from a finite observed sample we first partition each group into subgroups. We then compute the empirical (auto)covariance matrices on each subgroup. Finally, we estimate a matrix that simultaneously diagonalizes the differences of these empirical (auto)covariance matrices using an approximate joint matrix diagonalization technique.
More precisely, for each group we first construct a partition consisting of subsets of such that each satisfies that and . This partition should be granular enough to capture the changes in the signals described in Assumption 2 or 3. We propose partitioning each group based on a grid such that the separation between grid points is large enough for a reasonable estimation of the covariance matrix and at the same time small enough to capture variations in the signals. In our experiments, we observed robustness with respect to the exact choice; only too small partitions should be avoided since otherwise the procedure is fragile due to poorly estimated covariance matrices. More details on the choice of the partition size are given in Remark 2.4. Depending on whether a variance or timedependence signal or a hybrid thereof is considered, we fix time lags .
Next, for each group , each distinct pair , and each we define the matrix
where denotes the empirical (auto)covariance matrix for lag and is the data matrix restricted to the columns corresponding to the subgroup . Assumption 1 ensures that is approximately diagonal. We are therefore interested in finding an invertible matrix which approximately jointly diagonalizes the matrices in the set
(2.4) 
The number of matrices in this set grows quadratically in the number of partitions. This can lead to large numbers of matrices to be diagonalized. Another option that reduces the computational load is to compare each partition to its complement, which leads to the following set of matrices
(2.5) 
or to compare only neighboring partitions as in
(2.6) 
where is the partition to the right of . The task of jointly diagonalizing a set of matrices is a wellstudied topic in the literature and is referred to as approximate joint matrix diagonalization. Many solutions have been proposed for different assumptions made on the matrices to be diagonalized. In this paper, we use the uwedge algorithm^{3}^{3}3As a byproduct of our work, we are able to provide a new stable opensource Python/R/Matlab implementation of the uwedge algorithm which is also included in our respective coroICA packages. introduced by Tichavsky and Yeredor (2009). The basic idea behind uwedge
is to find a minimizer of a proxy for the loss function
over the set of invertible matrices, where .
The full estimation procedure based on the set defined in (2.5) is made explicit in the pseudo code in Algorithm 1 (where ApproximateJointDiagonalizer stands for a general approximate joint diagonalizer; here we use uwedge).
Remark 2.4 (choosing the partition and the lags)
Whenever there is no obvious partition of the data, we propose to partition the data into equally sized blocks with a fixed partition size. The decision on how to choose a partition size should be driven by type of nonstationary signal one expects and the dimensionality of the data. For example, in the case of a variance signal the partition should be fine enough to capture areas of high and low variance, while at the same time being coarse enough to allow for sufficiently good estimates of the covariance matrices. That said, for applications to real data sets the signals are often of various length implying that there is a whole range of partition sizes which all work well. In cases with few data points, it can then be useful to consider several grids with different partition sizes and diagonalize across all resulting differences simultaneously. This somewhat removes the dependence of the results on the exact choice of a partition size and increases the power of the procedure. We employ this approach in Example 3.5. In general, the lags should be chosen as , , or , depending on whether a variance signal, timedependence signal, or a hybrid thereof is considered. For timedependence signal, we recommend to determine up to which timelag the autocorrelation of the observed signals has sufficiently decayed, and use all lags up to that point.
2.4 Assessing the quality of recovered sources
Assessing the quality of the recovered sources in an ICA setting is an inherently difficult task, as is typical for unsupervised learning procedures. The unidentifiable scale and ordering of the sources as well as the unclear choice of a performance measure render this task difficult. Provided that ground truth is known, several scores have been proposed, most notably the Amari measure introduced by
Amari et al. (1996) and the minimum distance (MD) index due to Ilmonen et al. (2010). Here, we use the MD index, which is defined aswhere the set consists of matrices for which each row and column has exactly one nonzero element. Intuitively, this score measures how close
is to a rescaled and permuted version of the identity matrix. One appealing property of this score is that it can be computed efficiently by solving a linear sum assignment problem. In contrast to the Amari measure, the MD index is affine invariant and has desirable theoretical properties
(see Ilmonen et al., 2010).We require a different performance measure for our real data experiments where the true unmixing matrix is unknown. Here, we check whether the desired independence (after adjustment for the constant confounding) is achieved by computing the following covariance instability score (CIS) matrix. It measures the instability of the covariance structure of the unmixed sources and is defined for a each groups and a corresponding partition (see Section 2.3) by
where
is the empirical standard deviation of
and the fraction is taken elementwise. The CIS matrix is approximately diagonal whenever can be written as the sum of independent source signals and confounding with fixed covariance. This is condensed into one scalar that reflects how stable the sources’ covariance structure is by averaging the offdiagonals of the CIS matrixThe differences taken in the CIS score extract the variance signals such that the mean covariance instability score (MCIS) can be understood as a measure of independence between the recovered variance signal processes. High values of MCIS imply strong dependences beyond stationary confounding between the signals. Low values imply weak dependences. MCIS is a reasonable score whenever there is a variance signal (as described in Section 2
) in sources and is a sensible evaluation metric of ICA procedures in such cases. In case of timedependence signal (as described in Section
2), one can define an analogous score based on the autocovariances. Here, we restrict ourselves to the variance signal case as for all our applications this appeared to constitute the dominant part of the signal.In case of variance signals the MCIS appears natural and appropriate as independence measure: It measures how well the individual variance signals (and hence the relevant information) are separated. To get a better intuition, let denote the mixing and the corresponding unmixing matrix (i.e., , are columns of and are rows of ). Then it holds that,
(2.7) 
Under our groupwise stationary confounding assumption (Assumption 1) this implies that within all groups , it holds for all that
(2.8) 
This equation holds also in the confoundingfree case and it reflects the contribution of the signal (in terms of variance signal) of the th recovered source to the the variance signal in all components of the observed multivariate data .
While in the population case the equality in (2.8) is satisfied exactly, this is no longer the case when the (un)mixing matrix is estimated on finite data. Consider two subsets for some group , then using the notation from Section 2.3 and denoting by and the estimates of and , respectively, it holds that
(2.9) 
The approximation is close only if the empirical estimate correctly unmixes the th source. Essentially, MCIS measures the extent to which this approximation holds true for all components simultaneously across the subsets specified by the partition . It is also possible to consider individual components by assessing how closely the following proportionality is satisfied
(2.10) 
In EEG experiments, this can also be assessed visually by comparing the topographic maps corresponding to columns of A with socalled activation maps corresponding to the lefthand side in (2.10). More details on this are provided in Section 4.3.3.
3 Causal perspective
Our underlying noisy ICA model (2.1) and the assumption on the noise (Assumption 1) are motivated by causality. ICA is closely linked to the problem of identifying structural causal models (SCMs) (see Pearl, 2009; Imbens and Rubin, 2015; Peters et al., 2017). Shimizu et al. (2006) were the first to make this connection explicit and used ICA to infer causal structures. To make this more precise consider the following linear SCM
(3.1) 
where are observed covariates and are noise terms. An SCM induces a corresponding causal graph over the involved variables by drawing an edge from variables on the righthand side to the one on the lefthand side of (3.1). Moreover, we can define noise interventions (Pearl, 2009) by allowing the distributions of the noise terms to change for different . In the language of ICA, this means that the signals encode the different interventions (over time) on the noise variables. Assuming that the matrix is invertible, we can rewrite (3.1) as
which can be viewed as an ICA model with mixing matrix . Instead of taking the noise term as independent noise sources one can also consider . In that case the linear SCM in (3.1) describes a causal model between the observed variables in which hidden confounding is allowed. This is illustrated in Figure 2, which depicts a 3 variable SCM with feedback loops and confounding. Learning a causal model as in (3.1) with ICA is generally done by performing the following two steps.

(ICA) The matrix is inferred by ICA up to an undefined scale and permutation of its rows by using an appropriate ICA procedure. This step is often infeasible in the presence of confounding since existing ICA methods only allow noise under restrictive assumptions (cf. Table 1).

(identify ) There are essentially two assumptions that one can make in order for this to work. The first is to assume the underlying causal model has an acyclic structure as in Shimizu et al. (2006). In such cases the matrix needs to be permuted to an upper triangular matrix. The second option is to allow for feedback loops in the causal model but restrict the types of feedback to exclude infinite loops as in Hoyer et al. (2008) and Rothenhäusler et al. (2015).
When performing step (i) there are two important modelling assumptions that are made when selecting the ICA procedure: (a) the type of allowed signals (types of interventions) and (b) the type of allowed confounding. For the classic ICA setting with nonGaussian source signals and no noise this translates to the class of linear nonGaussian models, such as Linear NonGaussian Acyclic Models (LiNGAMs) introduced by Shimizu et al. (2006). While such models are a sensible choice in a purely observational setting (i.e., no samples from interventional settings) they are somewhat misspecified in terms of (a) when data from different interventional settings or timecontinuous intervention shifts are observed. In those settings, it is more natural to use ICA methods that are tailored to sequential shifts as for example choiICA or coroICA. Moreover, most common ICA methods consider noisefree mixing, which from a causal perspective implies that no hidden confounding is allowed. While noisy ICA weakens this assumption, existing methods only allow for timeindependent or even iid noise, which again greatly restricts the type of confounding. In contrast, our proposed coroICA allows for any type of blockwise stationary confounding, hence greatly increasing the class of causal models which can be inferred. This is attractive for causal modeling as it is a priori unknown whether hidden confounding exists. Therefore, our proposed procedure allows for robust causal inference under general confounding settings. In Example 3.5, we illustrate a potential application to climate science and how the choice of ICA can have a strong impact on the estimates of the causal parameters.
Example 3.5 (application to climate science)
To motivate the foregoing causal model we consider a prominent example from climate science: the causal relationship between carbon dioxide () and temperature (T). More precisely, we consider Antarctic ice core data that consists of temperature and carbon dioxide measurements of the past 800’000 years due to Bereiter et al. (2014, carbon dioxide) and Jouzel et al. (2007, temperature)
. We combined both temperature and carbon dioxide data and recorded measurements every 500 years by a cubic interpolation of the raw data. The data is shown in Figure
3 (right). Oversimplifying, one can model this data as an SCM with timelags as followswhere with componentwise independent signal sources and a stationary confounding process. This representation is often also referred to as structural auto regressive model (SVAR) (see e.g., Lütkepohl, 2005). A graphical representation of such a model is shown in Supplementary B.2, Figure 13.
Assuming that this was the true underlying causal model, we could use it to predict what happens under interventions. From a climate science perspective an interesting intervention is given by doubling the concentration of and determining the resulting instantaneous (faster than 1000 years) effect on the temperature. This effect is commonly referred to as equilibrium climate sensitivity (ECS) due to which is loosely defined as the change in degrees temperature associated with a doubling of the concentration of carbon dioxide in the earth’s atmosphere. In the fifth assessment report of the United Nations Intergovernmental Panel on Climate Change it has been stated that ”there is high confidence that ECS is extremely unlikely less than 1 °C and medium confidence that the ECS is likely between 1.5 °C and 4.5 °C and very unlikely greater than 6 °C” (Stocker, 2014, Chapter 10). Since the measurement frequency in our model is quite low (500 years) and we model the logarithm of carbon dioxide the ECS corresponds to
The noise terms in our model and consist of two parts. The first are independent signals that only affect temperature and carbon dioxide independently corresponding to the variation in external conditions (e.g., environmental catastrophes like volcano eruptions and large wildfires, sunspot activity or icecoverage). The second part consists of a stationary component which affects both temperature and carbon dioxide which could, for example, be effects due the earth’s rotation. We applied coroICA to this data set in order to estimate climate sensitivity and compared it with results obtained when using fastICA or choiICA (var). The results are given in Figure 3. In the twodimensional setting considered here, step (ii) (i.e., identifying the causal parameters and from the estimated mixing matrix) only requires to assume that feedback loops do not blowup, which simply means in terms of the causal parameters. Trying both potential permutations of the sources with subsequent scaling and checking whether the assumption is satisfied leads to the correct causal parameters.
4 Experiments
In this section, we analyze empirical properties of coroICA. To this end, we first illustrate the performance of coroICA as compared to timeconcatenated versions of (noisy) ICA variants on simulated data with and without confounding. We also compare on real data and outline potential benefits of using our method when analyzing multisubject EEG data.
4.1 Competing methods
In all of our numerical experiments, we apply coroICA as outlined in Algorithm 1, where we partition each group based on equally spaced grids and run a fixed number of iterations of the uwedge approximate joint diagonalizer. Unless specified otherwise, coroICA refers to coroICA (var) (i.e., the variance signal based version) and we explicitly write coroICA (var), coroICA (TD) and coroICA (var & TD) whenever appropriate to avoid confusion. We compare with all of the methods in Table 1. Since no Python implementation was publicly available, we implemented the choiICAs and SOBI methods ourselves also based on a fixed number of iterations of the uwedge approximate joint diagonalizer. For fastICA we use the implementation from the scikitlearn Python library due to Pedregosa et al. (2011) and use the default parameters.
For the simulation experiments in Section 4.2, we also compare to random projections of the sources, where the unmixing matrix is simply sampled with iid standard normal entries. The idea of this comparison is to give a baseline of the unmixing problem and enhance intuition about the scores’ behavior. In order to illustrate the variance in this method, we generally sample random projections and show the results for each of them. A random mixing does not lead to interpretable sources, thus we do not compare with random projections in the EEG experiments in Section 4.3.
4.2 Simulations
In this section, we investigate empirical properties of coroICA in wellcontrolled simulated scenarios. First off, we show that we can recover the correct mixing matrix given that the data is generated according to our model (2.1) and Assumptions 1 and 2 hold, while the other ICAs necessarily fall short in this setting (cf. Section 4.2.1). Moreover, in Section 4.2.2 we show that even in the absence of any confounding (i.e., when the data follows the ordinary ICA model and in our model) we remain competitive with all competing ICAs. Finally, in Section 4.2.3 we analyze the performance of coroICA for various types of signals and noise settings. Our first two simulation experiments are based on blockwise shifting variance signals, which we describe in Data Set 1 and our third simulation experiment is based on GARCH type models described in Data Set 2.
4.2.1 Dependence on confounding strength
For this simulation experiment, we sample data according to Data Set 1 and choose to simulate (dimension ) samples from groups where each group contains observations. Within each group, we select a random partition consisting of subsets while ensuring that these have the same size on average. We fix the signal strength to and consider the behavior of coroICA (trained on half of the groups with an equally spaced grid of partitions per group) for different confounding strengths . The results for repetitions are shown in Figure 4. To allow for a fair comparison we take the same partition size for choiICA (var).
[roundcorner=5pt, frametitle=Data Set 1: Blockwise shifting variance signals] For our simulations we select equally sized groups of the data points and for each group construct a partition . Then, we sample a model of the form
where the values on the righthand side are sampled as follows:

are sampled with iid entries from and , respectively.

For each the variables are sampled from , where the are sampled iid from .

For each and the variables are sampled from , where the are sampled iid from .
The parameters and are selected in such a way that the expected confounding strength and variance signal strength
are as dictated by the respective experiment. Due to the uniform distribution this reduces to
The results indicate that in terms of the MD index the competitors all become worse as the confounding strength increases. All competing ICAs systematically estimate an incorrect unmixing matrix. coroICA on the other hand only shows a very small loss in precision as confounding increases; the small loss is expected due to the decreasing signal to noise ratio. In terms of MCIS, the behavior is analogous but slightly less well resolved; with increasing confounding strength the unmixing estimation of all competing ICAs is systematically biased resulting in bad separation of sources and high MCIS scores both outofsample and insample.
4.2.2 Efficiency in absence of group confounding
For this simulation experiment, we sample data according to Data Set 1 and choose to simulate (dimension ) samples from groups where each group contains observations. Within each group, we then select a random partition consisting of subsets while ensuring that these have the same size on average. This time, to illustrate performance in the absence of confounding, we fix the confounding strengths and consider the behavior of coroICA (applied to half of the groups with an equally spaced grid of partitions per group) for different signal strengths . The results for repetitions are shown in Figure 5. Again, choiICA (var) is applied with the same partition size.
The results indicate that overall coroICA performs competitive in the confoundingfree case. In particular, there is no drastic negative hit on the performance of coroICA as compared to choiICA (var) in settings where the data follows the ordinary ICA model. The slight advantage compared to fastICA in this setting is due to the signal type which favors ICA methods that focus on variance signals.
4.2.3 Comparison with other noisy ICA procedures
To get a better understanding of how our proposed ICA performs for different signal and noise types, we perform the simulation described in Data Set 2. We illustrate the different behavior with respect to the different types of signal by applying all three of our proposed coroICA procedures (coroICA (var), coroICA (TD) and coroICA (var & TD)) and compare them to the corresponding ChoiICA variants which do not adjust for confounding (choiICA (var), choiICA (TD) and choiICA (var & TD)). While all coroICA procedures can deal with any type of stationary noise, the choiICAs only work for more restrictive types of noise (see Table 1). Additionally, we also compare with fastICA to assess its performance in the various noise settings. The results are depicted in Figure 6.
[roundcorner=5pt, frametitle=Data Set 2: GARCH simulation] For this simulation we consider different settings of the confounded mixing model
More precisely, we consider the following three different GARCH type signals: (i) changing variance, (ii) changing timedependence and (iii) both changing variance and changing timedependence. For each of these signal types we consider two types of confounding (noise) terms: (a) timeindependent and (b) timedependent autoregressive noise. For both we construct independent processes and then combine them with a random mixing matrix as follows
Full details are given in Supplement B.3.
In all settings the most general method coroICA (var & TD) is able to estimate the correct mixing. The two signal specific methods coroICA (TD) and coroICA (var) are also able to accurately estimate the mixing in settings where a corresponding signal exists. It is also worth noting that they slightly outperform coroICA (var & TD) in these settings. In contrast, when comparing with the choiICA variants, coroICA is in general able to outperform the corresponding method. Only in the setting of a changing timedependence with timeindependent noise, choiICA (TD) is able to slightly outperform coroICA (TD).
4.2.4 Summary of the performance of coroICA
In summary, coroICA performs well on a larger model class consisting of both the groupwise confounded as well as the confoundingfree case. More precisely, an advantage over all competing ICAs is gained in confounded settings (as shown in Section 4.2.1) while there is arguably no disadvantage in the unconfounded case (cf. Section 4.2.2). This suggests that whenever the data is expected to contain at least small amounts of stationary noise or confounding, one may be better off using coroICA as the richer model class will guard against wrong results. The results in Section 4.2.3 further underline the robustness of our proposed method to various types of noise (and signals) for which other methods break. Again, even for settings that favor other methods coroICA remains competitive.
4.3 EEG experiments
ICA is often applied in the analysis of EEG data. Here, we illustrate the potential benefit and use of coroICA for this. Specifically, we consider a multisubject EEG experiment as depicted in Figure 7. The goal is to find a single mixing matrix that separates the sources simultaneously on all subjects. Our proposed model allows that the EEG recordings for each subject have a different but stationary noise term .
We illustrate the applicability of our method to this setting based on two publicly available EEG data sets. [roundcorner=5pt, frametitle=Data Set 3: CovertAttention data] This data set is due to Treder et al. (2011) and consists of EEG recordings of 8 subjects performing multiple trials of covertly shifting visual attention to one out of 6 cued directions. The data set contains recordings of

8 subjects,

for each subject there exist 6 runs with 100 trials,

each recording consists of 60 EEG channels recorded at 1000 Hz sampling frequency, while we work with the publicly available data that is downsampled to 200 Hz.
Since visual inspection of the data revealed data segments with huge artifacts and details about how the publicly available data was preprocessed was unavailable to us, we removed outliers and highpass filtered the data at 0.5 Hz. In particular, along each dimension we set those values to the median along its dimension that deviate more than 10 times the median absolute distance from this median. We further preprocess the data by rereferencing to common average reference (car) and projecting onto the orthogonal complement of the null component. For our unmixing estimations, we use the entire data, i.e., including intertrial breaks.
For classification experiments (cf. Section 4.3.2) we use, in line with Treder et al. (2011), the 8–12 Hz bandpassfiltered data during the 500–2000 ms window of each trial, and use the logvariance as bandpower feature (Lotte et al., 2018). The classification analysis is restricted to valid trials (approximately 311 per subject) with the desired target latency as described in Treder et al. (2011). Results on the CovertAttention Data Set 3 are presented here, while the results of the analogous experiments on the BCICompIV2a Data Set 4 are deferred to Supplement C. For both data sets, we compare the recovered sources of coroICA with those recovered by competing ICA methods. Since ground truth is unknown we report comparisons based on the following three criteria:
 stability and independence

We use MCIS (cf. Section 2.4) to assess the stability and independence of the recovered sources both in and outofsample.
 classification accuracy

For both data sets there is label information available that associates certain time windows of the EEG recordings with the task the subjects were performing at that time. Based on the recovered sources, we build a classification pipeline relying on feature extraction and classification techniques that are common in the field
(Lotte et al., 2018). The achieved classification accuracy serves as a proxy of how informative and suitable the extracted signals are.  topographies

For a qualitative assessment, we inspect the topographic maps of the extracted sources, as well as the corresponding power spectra and a raw timeseries chunk. This is used to illustrate that the sources recovered by coroICA do not appear random or implausible for EEG recordings and are qualitatively similar to what is expected from other ICAs. Furthermore, we provide an overview over all components achieved on Data Set 3 by SOBI, fastICA, and coroICA in the Supplementary Section D, where components are well resolved when the corresponding topographic map and activation map are close to each other (cf. Section 2.4).
4.3.1 Stability and independence
We aim to probe stability not only insample but also verify the expected increase in stability when applying the unmixing matrix to data of new unseen subjects, i.e., to new groups of samples with different confounding specific to that subject. In order to assess stability and independence of the recovered sources in terms of the MCIS both in and outofsample and for different amounts of training samples, we proceed by repeatedly splitting the data into a training and a test data set. More precisely, we construct all possible splits into training and test subjects for any given number of training subjects. For each pair of training and test set, we fit an unmixing matrix using coroICA and all competing methods described in Section 4.1. We then compute the MCIS on the training and test data for each method separately and collect the results of each trainingtest split for each number of training subjects.
Results obtained on the CovertAttention data set (with equally spaced partitions of 15 seconds length) are given in Figure 8 and the results for the BCICompIV2a data set (with equally spaced partitions of 15 seconds length) are shown in Supplement C.1, Figure 14. For both data sets the results are qualitatively similar and support the claim that the unmixing obtained by coroICA is more stable when transferred to new unseen subjects. While for the competing ICAs the instability on heldout subjects does not follow a clear decreasing trend with increasing number of training subjects, coroICA can successfully make use of additional training subjects to learn a more stable unmixing matrix.
Due to the characteristics and low signaltonoise ratio in EEG recordings, the evaluation based on the absolute MCIS score is less well resolved than what we have seen in the simulations before. For this reason we additionally provide a more focused evaluation by considering the MCIS fraction: the fraction of the MCIS achieved on a subject by the respective competitor method divided by the MCIS achieved on that subject by coroICA when trained on the same subjects. Thus, this score compares MCIS on a per subject basis, where values greater than indicate that the respective competing ICA method performed worse than coroICA. Figure 9 shows the results on the CovertAttention Data Set 3 confirming that coroICA can successfully incorporate more training subjects to derive a better unmixing of signals.
4.3.2 Classification based on recovered sources
While the results in the previous section indicate that coroICA can lead to more stable separations of sources in EEG than the competing methods, in scenarios with an unknown ground truth the stability of the recovered sources cannot serve as the sole determining criterion for assessing the quality of recovered sources. In addition to asking whether the recovered sources are stable and independent variance signals, we hence also need to investigate whether the sources extracted by coroICA are in fact reasonable or meaningful. In the “America’s Got Talent Duet Problem” (cf. Example 2.2) this means that each of the recovered sources should only contain the voice of one (independent) singer (plus some confounding noise that is not the other singer). For EEG data, this assessment is not as easy. Here, we approach this problem from two angles: (a) in this section we show that the recovered sources are informative and suitable for common EEG classification pipelines, (b) in Section 4.3.3 we qualitatively assess the extracted sources based on their power spectra and topographic maps.
In both data sets there are labeled trials, i.e., segments of data during which the subject covertly shifts attention to one of six cues (cf. Data Set 3) or performs one of four motor imagery tasks (cf. Data Set 4). Based on these, one can try to predict the trial label given the trial EEG data. To mimic a situation where the sources are transferred from other subjects, we assess the informativeness of the extracted sources in a leaveksubjectsout fashion as follows. We estimate an unmixing matrix on data from all but subjects, compute bandpower features for each extracted signal and for each trial (as described in Data Set 3 and Data Set 4), and on top of those we train an ensemble of
bootstrapped shrinkage linear discriminant analysis classifiers where each is boosted by a random forest classifier on the wrongly classified trials. This pipeline (signal unmixing, bandpowerfeature computation, trained ensemble classifier), is then used to predict the trials on the
heldout subjects.The results are reported in Figure 10 and Supplement C.2, Figure 16 which show for each number of training subjects, the accuracies achieved on the respective heldout subjects when using the unmixing obtained on the remaining subjects by either coroICA or one of the competitor methods. The results on both data sets support the claim that the sources recovered by coroICA are not only stable but in addition also capture meaningful aspects of the data that enable competitive classification accuracies in fullyoutofsample classification.
It is worth noting that these classification results depend heavily on the employed classification pipeline following the source separation. Here, our goal is only to show that coroICA does indeed separate the data into informative sources. In practice, and when only classification accuracy matters, one might also consider using a labelinformed source separation (Dähne et al., 2014), employ common spatial patterns (Koles et al., 1990) or use decoding techniques based on Riemannian geometry (Barachant et al., 2012).
4.3.3 Topographic maps
The components that coroICA extracts from EEG signals are stable (cf. Section 4.3.1) and meaningful in the sense that they contain information that enables classification of trial labels, which is a common task in EEG studies (cf. Section 4.3.2). In this section, we complement the assessment of the recovered sources by demonstrating that the results obtained by coroICA lead to topographies, activation maps, power spectra and raw timeseries that are similar to what is commonly obtained during routine ICA analyses of EEG data when the plausibility and nature of ICA components is to be judged.
Topographies are common in the EEG literature to depict the relative projection strength of extracted sources to the scalp sensors. More precisely, the columnvector of that specifies the mixing of the th source component is visualized as follows. A sketched top view of the head is overlayed with a heatmap where the value at each electrodes’ position is given by the corresponding entry in . These topographies are indicative of the nature of the extracted sources, for example the dipolarity of source topographies is a criterion invoked to identify cortical sources (Delorme et al., 2012) or the topographies reveal that the source mainly picks up changes in the electromagnetic field induced by eye movements. Another way to visualize an extracted source is an activation map, which is commonly obtained by depicting the vector (where is th row of unmixing matrix ) and shows for each electrode how the signal observed at that electrode covaries with the signal extracted by (Haufe et al., 2014). Besides inspecting the raw timeseries data, another criterion invoked to separate cortical from muscular components is the log power spectrum. For example, a monotonic increase in spectral power starting at around 20 Hz is understood to indicate muscular activity (Goncharova et al., 2003) and peaks in typical EEG frequency ranges are used to identify brainrelated components.^{4}^{4}4These are commonly employed criteria which are also advised in the eeglab tutorial (Delorme and Makeig, 2004, https://sccn.ucsd.edu/wiki/Chapter_09:_Decomposing_Data_Using_ICA) and the neurophysiological biomarker toolbox wiki (Hardstone et al., 2012, https://www.nbtwiki.net/doku.php?id=tutorial:how_to_use_ica_to_remove_artifacts)..
In Figure 11, we depict the aforementioned criteria for three exemplary components extracted by coroICA on the CovertAttention Data Set 3. Following the discussion in Section 2.4 we show the activation maps as
which captures variance changing signal and allows to asses the quality of a recovered source by comparison to the topographic map (cf. Equation 2.4). Here, the idea is to demonstrate that coroICA components are qualitatively similar to components extracted by commonly employed SOBIICA or fastICA. Therefore, we choose to display one example of an ocular component (2^{nd} where the topography is indicative of eye movement), a cortical component (7^{th} where the dipolar topography, the typical frequency peak at around 8–12 Hz, and the amplitude modulation visible in the raw timeseries are indicative of the cortical nature), and an artifactual component (51^{st} where the irregular topography and the high frequency components indicate an artifact). For comparison, we additionally show for each component the topographies of the components extracted by SOBIICA or fastICA by matching the recovered source which most strongly correlates with the one extracted by coroICA. The components extracted by coroICA closely resemble the results one would obtain from a commonly employed ICA analysis on EEG data.
For completeness, we provide an overview over all components extracted on Data Set 3 by SOBI, fastICA, and coroICA in the Supplementary Section D. Components are well resolved when the corresponding topographic map and activation map are close to each other (cf. Section 2.4), which, by visual inspection, appears to be more often the case for coroICA than for the competing methods.
5 Conclusion
In this paper, we construct a method for recovering independent sources corrupted by groupwise stationary confounding. It extends ordinary ICA to an easily interpretable model which we believe is relevant for many practical problems as is demonstrated in Section 4.3 for EEG data. Based on this model, we give explicit assumptions under which the sources are identifiable in the population case (cf. Section 2.2). Moreover, we introduce a straightforward method for estimating the sources based on the wellunderstood concept of approximate joint matrix diagonalization. As illustrated in the simulations in Section 4.2, this estimation procedure performs competitive even for data from an ordinary ICA model, while additionally being able to adjust for groupwise stationary confounding. Finally, we show that the coroICA model indeed performs reasonably on EEG data sets and leads to improvements in comparison to commonly employed approaches, while at the same time preserving an easy interpretation of the recovered sources.
Acknowledgements
The authors thank Vinay Jayaram, Nicolai Meinshausen, Jonas Peters, Gian Thanei and anonymous reviewers for helpful discussions and constructive comments.
References
 Ablin et al. (2018) Ablin, P., Cardoso, J.F., and Gramfort, A. (2018). Beyond Pham’s algorithm for joint diagonalization. arXiv preprint arXiv:1811.11433.
 Amari et al. (1996) Amari, S., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind signal separation. In Advances in neural information processing systems, pages 757–763.
 Back and Weigend (1997) Back, A. and Weigend, A. (1997). A first application of independent component analysis to extracting structure from stock returns. International Journal of Neural Systems, 8(4):473–484.
 Barachant et al. (2012) Barachant, A., Bonnet, S., Congedo, M., and Jutten, C. (2012). Multiclass braincomputer interface classification by Riemannian geometry. IEEE Transactions on Biomedical Engineering, 59(4):920–928.
 Beckmann and Smith (2005) Beckmann, C. and Smith, S. (2005). Tensorial extensions of independent component analysis for multisubject fMRI analysis. NeuroImage, 25(1):294–311.
 Bell and Sejnowski (1995) Bell, A. and Sejnowski, T. (1995). An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159.
 Belouchrani et al. (1997) Belouchrani, A., AbedMeraim, K., Cardoso, J.F., and Moulines, E. (1997). A blind source separation technique using secondorder statistics. IEEE Transactions on Signal Processing, 45(2):434–444.
 Bereiter et al. (2014) Bereiter, B., Eggleston, S., Schmitt, J., NehrbassAhles, C., Stocker, T., Fischer, H., Kipfstuhl, S., and Chappellaz, J. (2014). Revision of the EPICA Dome C CO2 record from 800 to 600 kyr before present. Geophysical Research Letters, 42(2):542–549.
 Calhoun et al. (2003) Calhoun, V., Adalı, T., Hansen, L., Larsen, J., and Pekar, J. (2003). ICA of functional MRI data: an overview. In Proceedings of the International Workshop on Independent Component Analysis and Blind Signal Separation, pages 281–288.
 Cardoso (1989a) Cardoso, J.F. (1989a). Blind identification of independent components with higherorder statistics. In Workshop on HigherOrder Spectral Analysis, pages 157–162.

Cardoso (1989b)
Cardoso, J.F. (1989b).
Source separation using higher order moments.
In International Conference on Acoustics, Speech, and Signal Processing,, pages 2109–2112 vol.4.  Cardoso and Souloumiac (1993) Cardoso, J.F. and Souloumiac, A. (1993). Blind beamforming for nonGaussian signals. IEE Proceedings F  Radar and Signal Processing, 140(6):362–370.
 Cardoso and Souloumiac (1996) Cardoso, J.F. and Souloumiac, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 17(1):161–164.
 Choi and Cichocki (2000a) Choi, S. and Cichocki, A. (2000a). Blind separation of nonstationary and temporally correlated sources from noisy mixtures. In Neural Networks for Signal Processing X, 2000. Proceedings of the 2000 IEEE Signal Processing Society Workshop, volume 1, pages 405–414.
 Choi and Cichocki (2000b) Choi, S. and Cichocki, A. (2000b). Blind separation of nonstationary sources in noisy mixtures. Electronics Letters, 36(9):848–849.
 Choi et al. (2001) Choi, S., Cichocki, A., and Belouchrani, A. (2001). Blind separation of secondorder nonstationary and temporally colored sources. In Proc. IEEE Workshop on Statistical Signal Processing (IEEE SSP 2001), pages 444–447.
 Comon (1994) Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3):287–314.

Dähne et al. (2014)
Dähne, S., Meinecke, F., Haufe, S., Höhne, J., Tangermann, M.,
Müller, K.R., and Nikulin, V. (2014).
SPoC: a novel framework for relating the amplitude of neuronal oscillations to behaviorally relevant parameters.
NeuroImage, 86:111–122.  Delorme and Makeig (2004) Delorme, A. and Makeig, S. (2004). EEGLAB: an open source toolbox for analysis of singletrial EEG dynamics including independent component analysis. Journal of Neuroscience Methods, 134(1):9–21.
 Delorme et al. (2012) Delorme, A., Palmer, J., Onton, J., Oostenveld, R., and Makeig, S. (2012). Independent EEG Sources Are Dipolar. PLOS One, 7:1–14.
 Delorme et al. (2007) Delorme, A., Sejnowski, T., and Makeig, S. (2007). Enhanced detection of artifacts in EEG data using higherorder statistics and independent component analysis. NeuroImage, 34(4):1443–1449.
 Ghahremani et al. (1996) Ghahremani, D., Makeig, S., Jung, T., Bell, A., and Sejnowski, T. (1996). Independent component analysis of simulated EEG using a threeshell spherical head model. Technical report, Naval Health Research Center San Diego CA.
 Goncharova et al. (2003) Goncharova, I., McFarland, D., Vaughan, T., and Wolpaw, J. (2003). EMG contamination of EEG: spectral and topographical characteristics. Clinical Neurophysiology, 114(9):1580–1593.

Haavelmo (1944)
Haavelmo, T. (1944).
The Probability Approach in Econometrics.
Econometrica, 12:iii–115.  Hardstone et al. (2012) Hardstone, R., Poil, S.S., Schiavone, G., Jansen, R., Nikulin, V., Mansvelder, H., and LinkenkaerHansen, K. (2012). Detrended fluctuation analysis: a scalefree view on neuronal oscillations. Frontiers in Physiology, 3:450.
 Harshman (1970) Harshman, R. (1970). Foundations of the PARAFAC procedure: Models and conditions for an explanatory multimodal factor analysis.
 Haufe et al. (2014) Haufe, S., Meinecke, F., Görgen, K., Dähne, S., Haynes, J.D., Blankertz, B., and Bießmann, F. (2014). On the interpretation of weight vectors of linear models in multivariate neuroimaging. NeuroImage, 87:96–110.
 Hoyer et al. (2008) Hoyer, P., Shimizu, S., Kerminen, A., and Palviainen, M. (2008). Estimation of causal effects using linear nonGaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49(2):362–378.
 Hyvärinen (1999) Hyvärinen, A. (1999). Fast and robust fixedpoint algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634.
 Hyvärinen (1999) Hyvärinen, A. (1999). Fast ICA for noisy data using Gaussian moments. In Circuits and Systems, 1999. ISCAS’99. Proceedings of the 1999 IEEE International Symposium on, volume 5, pages 57–61. IEEE.
 Hyvärinen (2001) Hyvärinen, A. (2001). Blind source separation by nonstationarity of variance: a cumulantbased approach. IEEE Transactions on Neural Networks, 12(6):1471–1474.
 Hyvärinen (2013) Hyvärinen, A. (2013). Independent component analysis: recent advances. Philosophical transactions (Series A), 371(1984):20110534.
 Hyvärinen et al. (2001) Hyvärinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley & Sons, Inc.

Ilmonen et al. (2010)
Ilmonen, P., Nordhausen, K., Oja, H., and Ollila, E. (2010).
A new performance index for ICA: properties, computation and asymptotic analysis.
In International Conference on Latent Variable Analysis and Signal Separation, pages 229–236. Springer.  Imbens and Rubin (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York, NY.
 Jouzel et al. (2007) Jouzel, J., MassonDelmotte, V., Cattani, O., Dreyfus, G., Falourd, S., Hoffmann, G., Minster, B., Nouet, J., Barnola, J.M., Chappellaz, J., Fischer, H., Gallet, J. C., Johnsen, S., Leuenberger, M., Loulergue, L., Lüthi, D., Oerter, H., Parrenin, F., G., R., Raynaud, D., Schilt, A. Schwander, J., Selmo, E., Souchez, R., Spahni, R., Stauffer, B., Steffensen, J. P., Stenni, B., Stocker, T., Tison, J. L., Werner, M., and Wolff, E. W. (2007). Orbital and millennial antarctic climate variability over the past 800,000 years. Science, 317(5839):793–796.
 Jung et al. (2000) Jung, T.P., Makeig, S., Humphries, C., Lee, T.W., Mckeown, M., Iragui, V., and Sejnowski, T. (2000). Removing electroencephalographic artifacts by blind source separation. Psychophysiology, 37(2):163–178.

Kleinsteuber and Shen (2013)
Kleinsteuber, M. and Shen, H. (2013).
Uniqueness analysis of nonunitary matrix joint diagonalization.
IEEE Transactions on Signal Processing, 61(7):1786–1796.  Koles et al. (1990) Koles, Z. J., Lazar, M. S., and Zhou, S. Z. (1990). Spatial patterns underlying population differences in the background EEG. Brain Topography, 2(4):275–284.
 Lotte et al. (2018) Lotte, F., Bougrain, L., Cichocki, A., Clerc, M., Congedo, M., Rakotomamonjy, A., and Yger, F. (2018). A review of classification algorithms for EEGbased braincomputer interfaces: a 10 year update. Journal of Neural Engineering, 15(3):031005.
 Lütkepohl (2005) Lütkepohl, H. (2005). New Introduction to Multiple Time Series Analysis. Springer Science & Business Media.
 Makeig et al. (1996) Makeig, S., Bell, A., Jung, T.P., and Sejnowski, T. (1996). Independent component analysis of electroencephalographic data. In Advances in Neural Information Processing Systems (NIPS 8), pages 145–151.
 Makeig et al. (1997) Makeig, S., Jung, T.P., Bell, A., Ghahremani, D., and Sejnowski, T. (1997). Blind separation of auditory eventrelated brain responses into independent components. Proceedings of the National Academy of Sciences, 94(20):10979–10984.
 Makeig et al. (2002) Makeig, S., Westerfield, M., Jung, T.P., Enghoff, S., Townsend, J., Courchesne, E., and Sejnowski, T. (2002). Dynamic brain sources of visual evoked responses. Science, 295(5555):690–694.
 Matsuoka et al. (1995) Matsuoka, K., Ohoya, M., and Kawamoto, M. (1995). A neural net for blind separation of nonstationary signals. Neural Networks, 8(3):411–419.
 McKeown et al. (1998a) McKeown, M., Jung, T.P., Makeig, S., Brown, G., Kindermann, S., Lee, T.W., and Sejnowski, T. (1998a). Spatially independent activity patterns in functional MRI data during the Stroop colornaming task. Proceedings of the National Academy of Sciences, 95(3):803–810.
 McKeown et al. (1998b) McKeown, M., Makeig, S., Brown, G., Jung, T.P., Kindermann, S., Bell, A., and Sejnowski, T.J. (1998b). Analysis of fMRI data by blind separation into independent spatial components. Human Brain Mapping, 6(3):160–188.
 Miettinen et al. (2012) Miettinen, J., Nordhausen, K., Oja, H., and Taskinen, S. (2012). Statistical properties of a blind source separation estimator for stationary time series. Statistics & Probability Letters, 82(11):1865–1873.
 Moulines et al. (1997) Moulines, E., Cardoso, J.F., and Gassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 3617–3620. IEEE.
 Nordhausen (2014) Nordhausen, K. (2014). On robustifying some second order blind source separation methods for nonstationary time series. Statistical Papers, 55(1):141–156.
 Nunez and Srinivasan (2006) Nunez, P. and Srinivasan, R. (2006). Electric Fields of the Brain: The Neurophysics of EEG. Oxford University Press.
 Pearl (2009) Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, USA, 2nd edition.

Pedregosa et al. (2011)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
(2011).
Scikitlearn: Machine Learning in Python.
Journal of Machine Learning Research, 12(10):2825–2830.  Peters et al. (2016) Peters, J., Bühlmann, P., and Meinshausen, N. (2016). Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012.
 Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA.
 Pexels GmbH (2018) Pexels GmbH (2018). Pexels. https://www.pexels.com/. Accessed: 20181030.
 Pham and Cardoso (2001) Pham, D.T. and Cardoso, J.F. (2001). Blind separation of instantaneous mixtures of nonstationary sources. IEEE Transactions on Signal Processing, 49(9):1837–1848.
 Rothenhäusler et al. (2015) Rothenhäusler, D., Heinze, C., Peters, J., and Meinshausen, N. (2015). BACKSHIFT: Learning causal cyclic graphs from unknown shift interventions. In Advances in Neural Information Processing Systems (NIPS 28), pages 1513–1521.
 Shimizu et al. (2006) Shimizu, S., Hoyer, P., Hyvärinen, A., and Kerminen, A. (2006). A linear nonGaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10):2003–2030.
 Stocker (2014) Stocker, T. (2014). Climate change 2013: the physical science basis: Working Group I contribution to the Fifth assessment report of the Intergovernmental Panel on Climate Change. Cambridge University Press.
 Tangermann et al. (2012) Tangermann, M., Müller, K.R., Aertsen, A., Birbaumer, N., Braun, C., Brunner, C., Leeb, R., Mehring, C., Miller, K., MüllerPutz, G., Nolte, G., Pfurtscheller, G., Preissl, H., Schalk, G., Schlögl, A., Vidaurre, C., Waldert, S., and Blankertz, B. (2012). Review of the BCI Competition IV. Frontiers in Neuroscience, 6:55.
 Tichavsky and Yeredor (2009) Tichavsky, P. and Yeredor, A. (2009). Fast approximate joint diagonalization incorporating weight matrices. IEEE Transactions on Signal Processing, 57(3):878–891.
 Tong et al. (1990) Tong, L., Soon, V., Huang, Y., and Liu, R. (1990). AMUSE: A new blind identification algorithm. In Circuits and Systems, 1990., IEEE International Symposium on, pages 1784–1787. IEEE.
 Treder et al. (2011) Treder, M., Bahramisharif, A., Schmidt, N., Van Gerven, M., and Blankertz, B. (2011). Braincomputer interfacing using modulations of alpha activity induced by covert shifts of attention. Journal of NeuroEngineering and Rehabilitation, 8(1):24.
 Zhukov et al. (2000) Zhukov, L., Weinstein, D., and Johnson, C. (2000). Independent component analysis for EEG source localization. IEEE Engineering in Medicine and Biology Magazine, 19(3):87–96.
 Ziehe et al. (2004) Ziehe, A., Laskov, P., Nolte, G., and Müller, K.R. (2004). A fast algorithm for joint diagonalization with nonorthogonal transformations and its application to blind source separation. Journal of Machine Learning Research, 5(7):777–800.
Appendix A Identifiability proof
Proof
The theorem is proven by the correct invocation of Kleinsteuber and Shen (2013, Theorem 1). We first define the unmixing matrix and introduce the sets of matrices
and
Due to the assumed ICA model and Assumption 1 all matrices in the sets and are diagonal (cf. (2.3)). Moreover, for and it holds that
and
Using notation as in Kleinsteuber and Shen (2013) we define for all the vectors
depending if there is a variance signal or timedependence signal, respectively. Then, Assumption 2 or Assumption 3 implies for all distinct pairs that
Hence, for either or it holds that , where is as defined in Kleinsteuber and Shen (2013). Hence, we can invoke Kleinsteuber and Shen (2013, Theorem 1) to conclude that any matrix for which is diagonal for all is equal to the identity matrix up to scaling and permutation of its columns. Next, we consider the two signal types separately.

variance signal: If there is a variance signal that satisfies Assumption 2, assume there exists an invertible matrix such that for all and all it holds that
Comments
There are no comments yet.