Correlated Components Analysis --- Extracting Reliable Dimensions in Multivariate Data

01/26/2018 ∙ by Lucas C. Parra, et al. ∙ CUNY Law School Berlin Institute of Technology (Technische Universität Berlin) 0

How does one find data dimensions that are reliably expressed across repetitions? For example, in neuroscience one may want to identify combinations of brain signals that are reliably activated across multiple trials or subjects. For a clinical assessment with multiple ratings, one may want to identify an aggregate score that is reliably reproduced across raters. The approach proposed here --- "correlated components analysis" --- is to identify components that maximally correlate between repetitions (e.g. trials, subjects, raters). This can be expressed as the maximization of the ratio of between-repetition to within-repetition covariance, resulting in a generalized eigenvalue problem. We show that covariances can be computed efficiently without explicitly considering all pairs of repetitions, that the result is equivalent to multi-class linear discriminant analysis for unbiased signals, and that the approach also maximize reliability, defined as the mean divided by the deviation across repetitions. We also extend the method to non-linear components using kernels, discuss regularization to improve numerical stability, present parametric and non-parametric tests to establish statistical significance, and provide code.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the following scenario: A group of people are shown a movie while their neural activity is recorded. Guided by the assumption that the movie evokes similar brain responses in different subjects, the goal is to identify movie-related brain activity that is common across individuals (Dmochowski et al., 2012). Because activity is distributed across multiple sensors, we would like to find linear combinations of sensors that have a similar time course in all subjects. In other words, we would like to identify underlying components or factors that have high inter-subject correlation.

In a different setting, the task may be to assess motor skills. A clinician observes a group of individuals at various tasks and provides a rating for each person and task (Rousson et al., 2002). Typically, ratings are combined across tasks so that each person obtains a single aggregate performance score. But how should individual task ratings be combined? Traditionally, this is done by simple averaging of related scores. We propose instead a data-driven approach that aims to make the aggregate scores as consistent as possible across different raters. This means that we want to combine scores so as to maximize inter-rater reliability.

What is common in these two scenarios is the goal of identifying directions in multivariate data sets that are reliably reproduced across repetitions. Correlated Components Analysis (CorrCA) accomplishes this goal. CorrCA was originally developed in the context of neuroimaging studies to extract similar activity in multiple subjects (Dmochowski et al., 2012), and was re-developed independently to extract reliable brain responses across repeated trials in a single subject (Tanaka et al., 2013). Here we will show that the technique can also be used to identify aggregate ratings with high inter-rater reliability.

More generally, CorrCA is applicable whenever a data volume is available with dimensions , where are congruent repetitions. The method identifies directions in the -dimensional space along which the that maximally correlate between repeated measurements, with correlation measured across

samples. This is similar to Principal Component Analysis

(PCA, Hotelling, 1933) in that CorrCA finds a set of

-dimensional projection vectors that linearly decompose a data set (of size

). Instead of capturing dimensions with maximum variance within a single data set as in PCA, CorrCA captures the dimensions with maximum correlation between the data sets. As we will show here, the optimality criterion of CorrCA can also be used to derive Canonical Correlation Analysis (CCA, Hotelling, 1936), including multi-set CCA (MCCA, McKeon, 1966; Kettenring, 1971). The methods differs in that CorrCA uses a single set of linear projections for all data sets, whereas CCA and MCCA yield a different set of projection vectors for each repetition. MCCA is well established for the purpose of identifying similar brain activity across subjects (Li et al., 2009; Correa et al., 2010; Hwang et al., 2012; Afshin-Pour et al., 2012; Lankinen et al., 2014; Zhang et al., 2017). Here we will show for brain signals and rating data that CorrCA can achieve better performance than MCCA by virtue of the smaller number of free parameters.

The main contribution of this paper is to provide a formal characterization of CorrCA and with this, establish a few novel theoretical results. In particular, we show that maximizing the correlation between repeated measurements is indeed equivalent to maximizing the reliability of the projected data. By “reliability” we mean the average over repetitions, divided by the variance over repetitions, which is also a sensible definition for signal-to-noise ratio. We establish a direct link to MCCA, as well as Linear Discriminant Analysis (LDA). CorrCA and MCCA both maximize inter-subject correlation, defined here the ratio of between- to within-subjects covariance, while LDA maximizes the between-class over the within-class variance. Surprisingly, LDA and CorrCA give the same result for equal-mean signals. This direct equivalence allows us to formulate a strict statistical test for significance based on Fisher’s variance ratio (F-statistics). The equivalence to LDA suggests a straightforward extension of CorrCA to non-linear mappings using the well-known kernel approach. We also provide a validation of CorrCA for identifying shared dimensions and validate the proposed tests of statistical significance using simulated data. Another important contribution of this paper is to formalize inter-subject correlation (ISC) as a correlation metric which applies to more than two signals. ISC differs from conventional definitions of correlation, yet seamlessly integrates into the frameworks of Linear Discriminant Analysis, Canonical Correlation Analysis, and the F-statistic.

Before we present the mathematical treatment, we begin with a simple two-dimensional example to visualize the basic concept of identifying reliable signal components that are contained in a shared linear dimension.

Figure 1: Extraction of a correlated component in 2D time courses. A common electrical current source, , is captured by voltage sensors and for subject . This common source contributes with identical scaling and for both subjects: , , where is a source that contributes with a different value to each subject but with the same mixing coefficients . Note that current sources contribute additively to a voltage recording in a resistive medium, hence the addition. (A) Time courses for both subjects. (B) Time courses (C) Same data, but now individual time points are connected (by gray line) between the subjects. The source directions and are indicated with a red and blue arrow respectively and are the same for both subjects, i.e. they are “shared” dimensions. (D) Projection with maximal correlation. The time courses of the two subjects are perfectly correlated with inter-subject correlation, ISC=1. (E) Projection that is not common between subjects. (F) Along dimension , both subjects’ data are expressed identically, i.e. this is the (“reliable”) source dimension that is shared between the two subjects. Values along dimension differ randomly between subjects. This dimension is not “reliable” across subjects.

Extracting shared dimensions — an illustrative example

For this illustrative example we consider a simulation of electrical brain signals recorded from multiple subjects. Assume that we have two sensors, and , which record signals over time from each of two subjects, as shown in Figure 1 A & B. The figure caption details the generation of the data; briefly, there are two “sources” of signal: one has a common waveform for both subjects, while a second source has a subject-dependent waveform. The common waveform induces a correlation between subjects in both sensors, although the exact shared waveform is not immediately apparent in the individual time courses shown in panels A & B). The source directions do become clear from a scatter plot of the same data (panel C, blue and red arrows). Here, for each time point

we have connected the samples of both subjects. Evidently, at any point in time, the signals of the two subjects lie along a line (in the direction of the source not shared by the subjects, blue arrow), but the lines are shifted in parallel (in direction of the common source, red arrow). The locations along the gray lines are determined by the subject-dependent waveform, while the shift of the parallel line from one time instance to the next is determined by the common waveform. Now we select a linear transformation,

, to map this data onto a component space, , such that the component signals are maximally correlated between subjects. With this choice, the time course for both subjects in the first component dimension, , are now perfectly correlated (panel D). Thus, we have extracted the common source of variation in these two subjects, whereas the second component dimension, , now captures a waveform that is uncorrelated between subjects (panel E). As a result, the gray lines in the scatter plot are now vertical (panel F), separating the signal into reproducible () and non-reproducible dimensions (). Evidently, captured the source directions in these data (blue and red arrows). The objective of Correlated Components Analysis is to identify the unknown projection matrix by analyzing time courses .

2 Theory

2.1 Problem statement: Maximal inter-subject correlation

The specific objective of CorrCA is to find dimensions in multivariate data that maximize correlation between repetitions. Denote the -dimensional data by , where enumerates exemplars, and represents repeated renditions of these data. In the case of brain signals, the exemplars represent time points, so that indexes a sequence in time of length , and indexes repeated recordings (subjects or trials). In the case of a behavioral assessment instrument, may index individuals receiving a rating, while indexes multiple raters, or repeated ratings given to the same individuals by a single rater. In both instances, the observations are -dimensional ( sensors that record brain signals, or items rated in the assessment instrument). For simplicity, in the following we will stick with the terminology of time and subjects, so that correlation is measured between subjects and across time. When necessary, we will rephrase the resulting expressions for the case of assessing inter-rater correlation or test-retest correlation.

The goal is now to identify a linear combination of these measures, defined by a projection vector :

(1)

such that the correlation between subjects (repetitions) is maximized. We define inter-subject correlation (ISC) as the ratio of the between-subject covariance, , and the within-subject covariance, :

(2)

The between-subject covariance is a sum over all pairs of subjects, and within-subject covariance is a sum over all subjects:

(3)
(4)

where, , is the sample-mean for subject . The factor in definition (2) is required so that correlation is normalized to (see Appendix G.2). To simplify the notation, in our “variance” and “covariance” definitions we will generally omit common scaling factors, in this case, .

While we refer to as the inter-subject correlation (ISC), it could also be termed the inter-rater correlation or inter-repeat correlation (IRC) depending on the context. In Appendix G.1 we discuss this definition of correlation and its relation to other measures of correlation, such as intra-class correlation or Pearson’s correlation coefficient. Note that is also the variance around the subject-mean.

By inserting Eq. (1) into definition (2) it follows readily that

(5)

where and are the between-subject and within-subject covariance matrices of defined analogously to (3)-(4):

(6)
(7)

Here is the sample-mean vector for subject . The projection vector that maximizes can be found as the solution of , which yields:

(8)

Given that is invertible,

is an eigenvector of

with eigenvalue . Maximal inter-subject correlation is achieved by projecting the data onto the eigenvector of with maximal eigenvalue. We refer to this principal eigenvector as . In addition to this projection with maximal ISC, there may be additional dimensions that capture significant ISC in the data. Our goal is to find all such projection vectors,

. Thus, the objective function is to maximize ISC summed over all components. However, we want each component to capture a different, independent aspect of the data. We therefore require components to be uncorrelated, similar to the constraints enforced by Independent Components Analysis or Canonical Correlation Analysis. (The alternative of requiring orthogonal projection vectors, as in PCA is discussed in Appendix

A.5.) Thus, in total the goal is to maximize the following cost function with respect to :

(9)

subject to the constraints that components are mutually uncorrelated, i.e., , for all . The solution to this problem is given by the eigenvector matrix that satisfies the eigenvalue equation (see Appendix A.4):

(10)

where is a diagonal matrix with diagonal terms measuring the ISC, , of the corresponding projections of the data, . Note that these additional projections of the data are uncorrelated, as desired, because their correlation matrix, — defined as in Eq. (7) but for — is diagonal:

(11)

The fact that is a diagonal matrix is a well-know property of the generalized eigenvalue equation (10) (see Appendix A.1).

To summarize, the principal eigenvector of the between-subject over the within-subject covariance matrices provides the projection of the data that maximizes ISC. Subsequent smaller eigenvectors provide additional projections of the data that are uncorrelated with one another. The corresponding eigenvalues are equal to the ISC that each projection achieves. It follows that the last projection captures the direction with the smallest ISC. Note that the corresponding projections are uniquely defined, provided that the ISC are distinct (see Appendix A.3).

In the illustrative example of Figure 1 with , the eigenvalues were and , for the red and blue directions respectively (see Figure 1 D and 1 F). Evidently the shared dimension where the two subjects vary in unison (red) has been identified. A concrete application of CorrCA to the problem of identifying brain activity that is shared between subjects is described in Section 3.1. A thorough evaluation of the conditions under which the algorithm recovers the optimal projection vectors will be presented in Section 4 (Fig. 7).

Given a set of extracted correlated components, the question arises as to how many of these components exhibit statistically significant correlations. For zero mean, independent test data we can use the F-statistic to determine statistical significance (see Section 2.4, Eq. 26). Statistical testing is described in Appendix D, where we also treat the case of non-independent data using shuffle statistics. The methods for statistical testing will be validated with numerical simulations in Section 4.

2.2 Fast computation of inter-subject correlation

In Eq. (6), the term is excluded from the sum over . Note that the excluded term is precisely , so that adding the two covariances together completes a sum with all pairs, including . We will refer to this, therefore, as the total covariance:

(12)
(13)

This relationship is useful because can be simplified to:

(14)

where and are the mean across subjects and the grand mean, respectively. Thus, surprisingly, to compute inter-subject correlation, one never has to consider correlations of pairs of subjects, because can be computed from and , and neither one requires a sum over pairs of subjects. With an order of operations, this makes the computation more efficient than the direct implementation of Eqs. (6)-(7), which requires an oder of operations. The direct implementation was used in all previous work (e.g., Dmochowski et al., 2012; Tanaka et al., 2013; Cohen and Parra, 2016), but becomes problematic when one uses populations sizes reaching hundreds of subjects or more (e.g. Petroni et al., 2018; Alexander and et al., 2017).

2.3 Relationship to Linear Discriminant Analysis

Perhaps the naming and the additivity of the covariance matrices introduced here — Eqs. (6), (7), (12) — reminds some readers of the between-class and within-class scatter matrices used in Linear Discriminant Analysis (LDA). In the current notation these are:

(15)
(16)
(17)

Here, index enumerates the different classes and index enumerates the exemplars in each class (see Table 2 for an overview on the use indices in different algorithms). Therefore , and, , are the class mean and grand mean vectors, respectively. Scatter matrices also satisfy an additivity rule: (Duda et al., 2012). The goal of LDA it to maximize the separation between classes, which is measured by the ratio of between and within class variance:

(18)

As before, the maximal separation is obtained for the eigenvector of with the maximum eigenvalue . And as before, we can extract additional projections that capture class separation. If we require that these projections are uncorrelated, then they are given by the additional eigenvectors of . This concept was first proposed by Rao (1948) and generalizes the case of two classes introduced by Fisher (1936). We refer to this as classical LDA, and note that it maximizes the sum of variance ratios computed separately for each projection (see  A.4). An alternative approach is needed if instead the goal is to maximize the ratio of summed variances (Yan and Tang, 2006), or if one prefers orthogonal projection (Cunningham and Ghahramani, 2015).

This optimality criterion and resulting eigenvalue problem look strikingly similar to that of CorrCA in (5). Despite the similar naming, however, the scatter matrices differ from the between-subject and within-subject covariance matrices, and it is not immediately obvious how the two optimality criteria relate to one another. There is, however, a close relationship between the two (see Appendix B):

(19)
(20)

where is the covariance of across subjects:

(21)

When these mean values are equal across subjects, e.g. when they are zero-mean, then the scatter matrices are a linear combination of the between- and within-subject covariance matrices. Under this assumption, we can also write a simple functional relationship between class separation and inter-subject correlation :

(22)

Note that this relationship is monotonically increasing, for , because the slope of this relationship is strictly positive:

(23)

This means that maximizing class separation in (18) also results in maximal inter-subject correlation , provided we equalize the means. In fact, finding vectors that maximize of Eq. (18) gives the same set of solutions as maximizing of Eq. (5).

To see this, note that the goal of maximizing a ratio of two matrices, such as in Eq. (5), can also be achieved by joint-diagonalization of these two matrices (see Appendix A.1):

(24)

where and are diagonal matrices. Note that if jointly diagonalizes two matrices, it also diagonalizes any linear combinations of the two (Fukunaga, 2013, briefly recapitulated in Appendix A.2). Because and are linear combination of and , then the that satisfies (24) also diagonalizes and . Thus, this captures the eigenvectors to as well as , and satisfies in both cases the same constraint of mutually uncorrelated components. Because the eigenvalues of the two problems are monotonically related, the eigenvectors will be sorted in the same order for both problems. Thus, the solutions to both optimization problems are the same, including the constraint of uncorrelated components. To summarize, CorrCA and classical LDA yield the same result, provided that sample-means are equal across subjects, e.g. zero mean.

The equivalence between CorrCA and LDA is perhaps surprising. What does this result mean? In discriminant analysis index refers to different classes, while for the correlation analysis of CorrCA (and MCCA) it refers to samples, which are used to measure correlation between repeats. What we have demonstrated here is that increasing separation between samples is equivalent to increasing correlation across samples. To understand this, note that separation is high when multiple repeats have similar values, yet these values have to be different for different samples. This means that multiple repeats have to vary together across samples. In other words, the variations across samples have to be highly correlated across repeats. This is exemplified in the result of Figure 1 F. Remember that the gray lines in this figure connect subject for the same sample . In the language of classification, the gray lines connect two exemplars from the same class (there are 20 classes with two exemplars in each class). We can see that gray lines (classes) are well separated along the component axis, with minimal variance between exemplars. In fact, the within-class variance is zero on the axis as the gray lines are vertical. Thus, CorrCA has found with this component infinite class separation, corresponding to the perfect correlation as shown in Figure 1 D (Eq. 22 yields for ).

As an aside, note that the close link between CorrCA and LDA suggests that CorrCA could be readily extended to non-linear mappings following the approach of kernel-LDA (Mika et al., 1999). This is discussed in the Appendix F where we introduce kernel-CorrCA.

CorrCA LDA JD
Table 1: Relationship between the optimality criteria of different algorithms. is inter-subject correlation, and is class-separation. * assuming zero-mean signals, or unbiased raters ().

2.4 Maximum reliability and F-statistics

In the present context, the scatter matrices and are interesting for another, perhaps more important reason. Consider the case where represents repeated measurements of raters, or repeated measures on the same subjects, so that now measures inter-rater correlation, or inter-repeat correlation (IRC). According to Eq. (16), measures (except for a scaling factor) the sample-covariance of , which is the mean across repeats. On the other hand, according to Eq. (17), measures the sample-covariance averaged over repeats. When projected onto with they define the variance of the mean across repeats, , and average variance around these means, , respectively. The ratio of the two variances is a sensible definition for signal-to-noise ratio (SNR):

(25)

We take this SNR also as a metric of repeat-reliability. In this view, the results of the previous section show that maximizing correlation between repeats is equivalent to maximizing repeat-reliability. In particular, Eq. (22) provides the relationship between SNR and IRC. An application of CorrCA to the problem of identifying ratings that are reliably reproduced between different raters is described in Sections 3.2 and 3.3. In Section 3.4 we describe results on the problem of identifying components of brain activity that is reliably reproduced across repeated trials.

A component decomposition method that specifically maximizes signal-to-noise ratio was previously described in (de Cheveigné and Parra, 2014). This method can be formulated as joint diagonalization (JD) problem similar to CorrCA or LDA. When SNR is defined as in Eq. (25), then the JD approach diagonalizes matrices and in the present notation. By the same argument as in Section 2.3 it is clear that this JD approach provides the same solution as LDA and CorrCA. The optimization criteria for the three techniques are summarized in Table 1.

We note that is a ratio of variances, which permits a direct link to the -statistic, as used in the analysis of variance:

(26)

For normally and independently distributed samples this quantity follows the

-distribution with degrees of freedom

and (Papoulis and Pillai, 2002, Chapter 6). We can therefore use the -distribution to compute statistical significance for . This gives us a strict parametric test of significance for each CorrCA component (on unseen data and provided the means are equalized, i.e. ). We will leverage this observation when we perform tests for statistical significance of evaluated for test data (see Appendix D).

2.5 Forward model

An import aspect of multivariate models is parameter interpretation. CorrCA is an example of a “backward model” in the sense that the observed data is transformed into components that capture the source of covariation. It shares this property with a host of other methods such as PCA and ICA. An alternative modeling strategy is “forward modeling”, such as Factor Analysis or General Linear Modeling, which transform known variables to recover the observed data. Such models often capture the physical process of data generation. For instance, in the case of electromagnetic brain signals, a forward model refers to the “image” that a current source in the brain generates on the sensors.

For CorrCA, the backward model is given by the matrix which maps observations to components . A corresponding forward model can be defined as the projection that best recovers measurements from the components (Parra et al., 2002, 2005):

(27)

The least-squares estimate of this forward model projection is

(28)

Note that the columns of this matrix capture how correlated a putative source is with the observed sensor recordings . This approach of recovering the forward model from the backward model is explained in more detail in Haufe et al. (2014), along with a discussion on how forward and backward model parameters need to be interpreted.

The utility of the forward model will be demonstrated for the case of brain activity, where we ask how the brain activity of a specific component manifests at the sensor level (Section 3.1). For the case of ratings we ask which rating items contribute the most to a reliable aggregate scores (Sections 3.2 and 3.3). In this case the original data is already (approximately) uncorrelated, as we can see in the diagonal form of in Figure 4D. For diagonal the projections are orthogonal and the forward and backward models are identical, except of an overall scale for each component. In that case we can directly look at the the backward model as is customary with orthogonal projections such as in PCA where one looks as the component “loading”.

2.6 Relationship to multi-set Canonical Correlation Analysis

Thus far we have assumed that all repetitions should receive a common projection vector. This makes sense in the case of multiple trials if we expect that the sources of interest behave similarly across repetitions. However, in the case of multiple subjects or multiple raters it may make sense to assume that each should receive their own projection vectors. For instance, brain activity recorded at a specific location may differ between subjects because of anatomical or functional differences. This brings us to the case treated by Canonical Correlation Analysis (CCA) or more specifically multi-set CCA (MCCA). There are different versions of MCCA (Kettenring, 1971; Nielsen, 2002). We will derive one of these methods as a maximization of inter-subject correlation as defined in Eq. (2). Let every subject now have their own projection vector :

(29)

By inserting this into definition (2) it follows readily that

(30)

where are the cross-covariance matrices between and :

(31)

The projection vectors that maximize can be found as the solution of , which yields for each the following equation:

(32)

This set of equations for all can be written as a single equation if we concatenate them to a single -dimensional column vector :

(33)

where, , is a matrix combining all , and the diagonal blocks are arranged in a block-diagonal matrix :

(34)

If is invertible, the stationary points of the ISC are now the eigenvectors of , or more generally, the generalized eigenvectors of Eq. (33). We can arrange all such eigenvectors as columns to a single matrix . The eigenvector with the largest eigenvalue maximize the ISC because increases linearly with . The subsequent eigenvectors maximize ISC subject to the constraint that the component signals are uncorrelated from the first (see Appendix A.4), in other words, the solution diagonalizes :

(35)

The solution of the MCCA problem as an eigenvalue problem can be derived in multiple ways (Asendorf, 2015). In (Kettenring, 1971) it is presented as the maximization of the eigenvalue of the correlations matrix between subjects (MAXVAR criterion), but it can also be derived by maximizing the sum of correlations between subject (SUMCORR criterion) (Nielsen, 2002). Here we have derived MCCA as a maximization of ISC. In this sense CorrCA can be seen as a version of MCCA constrained to have identical projection vectors for all data sets (repeats/subjects). MCCA has parameters per component, whereas CorrCA has only . When there is an abundance of data () and there is a good reason to believe that the datasets should be treated individually, then MCCA may be appropriate. However, when is large compared to the data size () CorrCA may be preferable. This will be demonstrated in section 3.1.

3 Applications

In the following sections we demonstrate existing and new applications of CorrCA, while also examining issues related to statistical significance testing and regularization of the results. We also compare CorrCA with MCCA and kernel-CorrCA.

3.1 Extract brain activity that correlates between subjects

We have previously used CorrCA to analyze brain activity recorded from multiple subjects (Dmochowski et al., 2012, 2014; Ki et al., 2016; Cohen and Parra, 2016; Poulsen et al., 2017; Cohen et al., 2017; Petroni et al., 2018; Iotzov et al., 2017). Briefly, subjects in these experiments watched movies or listened to narrated stories while their neural responses were recorded with multiple sensors (electroencephalography (EEG) — electrical potentials measured at multiple locations on the scalp surface). CorrCA was then used to extract components that were most reliable across subjects. These experiments showed that brain signals are significantly correlated between subjects on a rapid time-scale of less than a second (signals were high-pass filtered at 0.5 Hz). While the inter-subject correlation (ISC) values are small (), these values are statistically significant and very reliable across sessions (Dmochowski et al., 2012) even in realistic environments (Poulsen et al., 2017). Interestingly, the level of correlation between subjects as measured by ISC is indicative of how attentive subjects are towards the stimulus (Ki et al., 2016) and, therefore, predictive of how well individuals remember the content of the videos (Cohen and Parra, 2016, see Appendix G.3 for a definition of ISC for individual subjects). ISC of the EEG also predicts the retention of TV and online audiences (Dmochowski et al., 2014; Cohen et al., 2017).

Here we reanalyzed the data of Cohen and Parra (2016) to compare CorrCA with MCCA (see Section 2.6) on their ability to learn reliable projections of the data, while also using the opportunity to demonstrate forward model computation (Section 2.5) and statistical testing (Appendix D). The data set consisted of the EEG from subjects as their brain activity was captured with scalp electrodes across time samples corresponding to the 197-second video stimulus. We first compared CorrCA with MCCA in terms of the number of reliable dimensions extracted by each, computed here using a non-parametric test using circular shuffle statistics (see Section D). The ISC values found on the entire data set are shown for the first 5 components in Figure 2 A , where blue bars correspond to CorrCA and green to MCCA. Moreover, darker shading indicates that the ISC of the component was found to be statistically significant: 3 significant dimensions were found by CorrCA, while 2 were found by MCCA, despite the (training set) ISC values being much higher for the latter. Indeed, when evaluating ISC on unseen data using 5-fold cross-validation, CorrCA yielded higher ISC values than MCCA for 4 of the first 5 dimensions (Figure 2 B). Note that MCCA fits a model for each repetition (i.e., subject) of the data, meaning that the number of free parameters was times that of CorrCA. Combined with the low signal-to-noise ratio of the EEG, this led MCCA to overfit the data (compare the ISCs in panel A with those in panel B). Both CorrCA and MCCA learned spatial filters that were applied to the multi-electrode EEG data to maximize the ISC. The activity captured by each spatial filter (component) is best depicted by “forward models” which are illustrated row-wise for the first three correlated components in Figure 2 C. Each component is marked by a smooth topography consistent with activation of the cerebral cortex. Moreover, each forward model has a different spatial profile, suggesting that the correlated components extract distinct neural activity. Whereas the forward models stemming from CorrCA are smooth, those produced by MCCA are not: Figure 2D shows the subject-specific forward models for the first three components of 5 of the 18 subjects (i.e., subjects 1, 5, 9, 13, and 17). These topographies are unlikely to be generated by a genuine neural source, and instead further support the assertion that MCCA over-fit the data. CorrCA can be seen as a from of regularization that addresses this over-fitting, but other forms of regularization could be envisioned (de Cheveigne et al., 2018).

Figure 2: Correlated components of the EEG evoked by natural video stimuli. (A) The ISC achieved by both CorrCA (blue bars) and MCCA (green bars) on EEG data from 18 subjects viewing a 197-second video clip (“Sunday at Rocco’s”, 2007, Storycorp). Darker shades indicate that the learned components were found to be statistically significant using circular shuffle statistics at (; see Appendix D): 3 significant components were found with CorrCA, while 2 were identified with MCCA. Light shades indicate non-significant ISC. (B) The ISC as measured by 5-fold cross-validation on the same data set. On unseen data, CorrCA produces higher ISCs across 4 of the top 5 components, suggesting that MCCA overfit the training data. (C) The spatial distributions of the first three correlated components displayed here row-wise as “forward models” (see Section 2.5). The three components exhibit smooth topographies with unique spatial profiles, suggesting that each component recovers genuine and distinct neural activity. (D) The forward models of the subject-specific components learned by MCCA. Each column displays the forward models of the components from an individual subject (i.e., subjects 1, 5, 9, 13, and 17). The components are markedly less smooth, consistent with over-fitting, as MCCA learns separate parameters for each of 18 subjects. For details on this data set, please refer to Cohen and Parra (2016).

3.2 Maximizing inter-rater reliability

The theory laid out in Section 2.4 shows that one can use CorrCA to identify components with maximal inter-rater reliability. In this following example, clinicians assessed the motor skill of children by measuring performance for various standard tasks (Zurich Neuromotor Assessment). The rating obtained for each task here is the time it takes the child to perform each task. This time is normalized relative to the values of a standard group of children of the same age (Rousson et al., 2008). Although this is an objective measure, there is nonetheless variability due to variable performance of the child when repeating the task, and also a subjective variability in the observer operating the stop-watch. In this specific data set, we have ratings from children on tasks from raters. The 12 measures are mostly positively correlated across the children (Figure 3 A, see panel G for descriptive task labels). Reliability between raters is measured here as the inter-rater correlation (IRC), which differs across tasks (Figure 3 B).

Figure 3: Ratings from the Zurich Neuromotor Assessment for children, tasks that were rated, and clinical raters. The ratings are continuous numbers and normed by age. The goal of CorrCA is to identify aggregate scores that have high inter-rater correlation (IRC) and test-retest correlation. This is equivalent to maximizing inter-rater reliability and test-retest reliability. (A, B, C) Original rating data. (D, E, F) Component extracted with CorrCA. Dark-shaded bars indicated statistically significant IRC (in panel E using circular shuffle statistics for training data; in panels B, C, F using F-statistics with Bonferroni correction; ; see Appendix D) (G) Projection matrix for the first 5 significant components. Data were provided by Valentin Rousson (Rousson et al., 2008). See text for details on each panel.

The goal of CorrCA now is to find a (linear) combination of these 12 task ratings that provide a reliable overall aggregate score (one that is similar for the two independent raters). The resulting projection vectors are shown in Figure 3 G. Each column corresponds to a different component and has been normalized to have unit norm. Components are sorted in decreasing order of IRC (Figure 3 E) and therefore decreasing order of inter-rater reliability (see Section 2.4, Eq. 25). Of these components, the first five reach statistical significance at an -level of 0.05 (highlighted in Figure 3 E). Inspecting the ”backward model” it is clear that first component relies heavily on a single task (task 2, which has the highest IRC) but has small positive and negative contributions from other tasks. Components 2 and 3 are mostly a contrast between task 3 and 8, and 2 and 4, respectively. The detailed results can be obtained by running the code accompanying this article. In total, the method extracts different aggregate scores of high reliability, relying in some instances on a small subset of tasks. This has the potential of simplifying the assessment of motor performance while increasing its reliability. An important aspect of these various components is that, by design, they are uncorrelated to one another (uncorrelated across the 30 children; Figure 3 D), whereas the original task ratings all measures similar aspects (they are predominantly positively correlated; Figure 3 A).

In this data set, children were also evaluated a second time by the same rater in order to assess test-retest reliability (Figure 3 C). Comparing panels B and C, it is evident that the test-retest variability is somewhat higher than the inter-rater variability (i.e., the evaluation does indeed depend to some degree on the subjective judgment of the rater). The components extracted with the highest IRC also exhibit the strongest test-retest correlations (at least for the first few components). This confirms that these aggregate measures generalize to data that was not used in the CorrCA optimization, and that minimizing subjective variability of the observer also minimizes test-retest variability.

3.3 Identifying agreement between child and parent

We would like to provide an additional example of CorrCA on subjective ratings data. In this example, the goal is to identify questions on which parent and child agree. Specifically, we analyzed data collected with the Alabama Parenting Questionnaire (APQ) as part of the Healthy Brain Network initiative (Alexander and et al., 2017). In the APQ, parents and children answered questions pertaining to their relationship by providing a numerical value from 1 to 5 for each question. From this, the APQ averages various questions to provide aggregate scores related to different aspects: “involvement score”, “positive parenting score”, “poor supervision score”, “inconsistent discipline score”, and “corporal punishment score”. At time of this analysis, complete data were available for children/parent pairs (ages 6-17, 40% female), i.e. raters. The first observation on these data is that the answers from children and parents are only weakly correlated (, Figure 4 A). The aggregate scores fare even worse (

). A two-sample t-test suggests that these aggregate scores are less reliable than individual questions (

) including several scores with negative correlations. Evidently, children do not much agree with their parents when it comes to parenting issues, despite answering identical questions on the topic.

We applied CorrCA to find a combination of answers for which parents and children agree. To this end, we divided families at random into a training and a test set of equal size () and computed the projections that maximized IRC on the training set. We then evaluated the resulting projections in terms of the correlation between child and parent for both train and test set (Figure 4 B & C). By design, IRC in the training set are large for the first components. On the test set, the first component also has a relatively large IRC, whereas the other components do not differ substantially in IRC from the original set of questions (IRC hovers around 0.2 in the original ratings, , and the aggregate component scores, ). The lower IRC values on the test set as compared to the training set indicate the presence of over-fitting, although the discrepancy has been reduced here using shrinkage regularization (using ; see Appendix C). To determine how many significantly correlated dimensions are actually present in these data, we used circular shuffle statistics (see Appendix D) on all the data () and found components with . On the test data alone the F-statistics also found significant components (Figure 4 B & C, dark shaded).

Figure 4: Ratings from the Alabama Parenting Questionnaire for families and questions, with ratings from a child and one of its parents. Here, the goal of CorrCA is to maximize IRC, i.e., to identify agreement between children and parents. Data were collected by the Healthy Brain Networks Initiative (Alexander and et al., 2017)

. Data were divided evenly and at random into a training and a test set. Second and third columns show results for training and test sets, respectively. Red error bars indicate the standard deviation across 100 random partitions of the data into training and test sets, and sampling with replacement within that (bootstrap estimates). Dark-shaded bars indicate significant IRC at

(F-statistic, Bonferroni corrected with for panels A and C; and circular shuffle statistic with for panels B and E). (A) IRC between parent and child on all original questions. Dark-shaded indicates bars indicate significant IRC (tested on Pearson’s correlations, . (B) IRC for the components extracted with CorrCA on the training set. Dark-shaded bars indicate significant IRC (circular shuffle statistic, ). (C) IRC on the test set. Only the first component achieves strong IRC. In this case, we used shrinkage regularization (, see Appendix C). Dark-shaded bars indicate significant IRC (F-statistic, Bonferroni corrected, ) (D) Correlation between questions across all families. (E) IRC for components extracted with MCCA (Section 2.6) and kernel-CorrCA (Appendix F). Dark-shaded bars indicate significant IRC (circular shuffle statistic, ) (F) Comparison of IRC on the test set for the 4 strongest components extracted with CorrCA, MCCA and kernel-CorrCA. (G) Component weights for the strongest component extracted with CorrCA.

The component with the highest IRC ( on test set) has a combination of questions contributing to reasonably reliable aggregate scores (, Figure 4 G). The following are four questions that contribute most strongly to this aggregate score of component 1 (weightings of in parenthesis): Question 32: “You are at home without an adult being with you” (-0.40); Question 21: “You go out after dark without an adult with you” (-0.37); Question 11: “Your parent helps with your homework” (+0.37); Question 6: “You fail to leave a note or let your parents know where you are going” (-0.33). Positive weights indicate that the answers should have large scores to increase the value on this aggregate measure (and small scores for negative weights). Therefore, this measure seems to capture parent involvement versus independence of the child. In total, while parents and children largely disagreed, there is moderate agreement on how independent their lives are. Given this result, we suspected that this metric increases with the child’s age. Indeed, there is a strong correlation of this component with the child’s age across the cohort (Pearson’s and for parent and child scores respectively, ; with no significant difference with the child’s sex).

Given that parents and children may genuinely differ in their judgments, it would make sense to allow for different weighting to the questions. Therefore, we also applied MCCA to this data as described in Section 2.6. We found only significant components at (as opposed to for CorrCA; Figure 4E, dark shaded). ISC summed over these first four components was numerically smaller for MCCA vs CorrCA (0.69 vs 1.10, values obtained on the test data without regularization; Figure 4 F). Therefore, it appears that on this data the sample size of was not sufficiently large to support a factor of two increase in the number of free parameters (from 42 for CorrCA to 84 for MCCA).

We have also tested the non-linear kernel-CorrCA method (see Appendix F) on these rating data (Figure 4 E). There was a small gain in terms of inter-rater correlation or inter-trial correlation respectively in the test data (Figure 4 F). However, the resulting component projections of the strongest components were not substantially different from the linear projections, suggesting that there were no strong non-linear relationships in the data. We find a similar result with kernel-CorrCA for the the data presented in the next Section 3.4 (results not shown).

3.4 Identifying stimulus-related components in EEG with high SNR

PCA is often used to extract a low-dimensional representation of high-dimensional data. To this end, PCA selects orthogonal projections that capture most of the variance of the projected samples. When multiple repetitions of the data are available, CorrCA can be used for the same purpose of reducing dimensionality of the data, but with the objective of capturing dimensions that are reliably reproduced across repetitions. Here we apply this to a conventional scenario in neuroimaging: neural activity is collected over time with multiple sensors, while the same experimental stimulus is repeated in multiple trials. We propose to use CorrCA to identify spatio-temporal patterns of activity that are well preserved across trials, similar to what was proposed in

(Tanaka et al., 2013; de Cheveigné and Parra, 2014; Dmochowski et al., 2015; Dmochowski and Norcia, 2015). Ideally, the corresponding time-courses of multiple components are uncorrelated in time so that they capture different activity. CorrCA guarantees this by design as it diagonalizes . CorrCA is therefore similar to other source separation methods such as Independent Components Analysis (Makeig et al., 1996), Denoising Source Separation (Särelä and Valpola, 2005), or Joint-Decorrelation (de Cheveigné and Parra, 2014). In fact, CorrCA can be viewed as a form of blind-source separation, as no labels are required for the analysis.

Here we analyze EEG data that were collected during an Eriksen Flanker Task (Parra et al., 2002). Subjects were presented with a collection of arrows pointing left or right. They were asked to decide as quickly as possible between two alternative choices (the direction of the center arrow) by pushing corresponding buttons with either their left or right hand. We analyze data from a single participant who responded correctly in 550 trials, and made an error in 46 trials (Figure 5). This error, which participants notice as soon as they make their incorrect response, leads to a negative EEG potential 0-100ms after the button press (this period is indicated between black and red vertical lines in panel B & C for electrode #7, see electrode location in panel H). This phenomenon is known as the “error-related negativity” (ERN, Yeung et al., 2004). Button pushes also elicit the “readiness potential” — a gradually increasing potential preceding the button push at ms. Aside of the readiness potential and error-related negativity there are a series of other potential deflections following the button push. In this example, we have time samples, electrodes, and trials. Time is aligned across trials so that a specific event always occurs at the same time (in this case, a button press at time

ms). The traditional approach to analyzing such data is to average the electric potentials over trials, yielding what are called event-related potentials (ERPs, panel B). The reliability of these trial-averages is relatively low (panels B& D show mean and standard-error, and panels C & E show single trial data). We selected electrodes #7 and #22 for display as these capture distinct time courses and locations identified by prior literature. The SNR (as defined in Section

2.4) is maximal for the same cluster of frontal electrodes (#12, #17, #44, #43) and the mean SNR across all electrodes is (Figure 5 A). We are multiplying with because, typically, the goal is to determine if the mean value differs significantly from zero, and thus the standard-error of the mean is the relevant quantity.

Figure 5: Event-related potentials. EEG recordings from a single participant collected during a speed two-alternative forced-choice task (Eriksen Flanker Task). (A-E) Original EEG data , highlighting two channels, (Fz electrode, SNR=0.97) where the “error-related negativity” is expected to be most pronounced (between 0 and 100ms; black and red vertical lines), and (PCz electrode, SNR=1.3) where a “readiness potential” is typically observed. (F-L) First and second correlated components with the strongest inter-trial correlation. (A & F) SNR computed from the inter-trial correlation (Eq. 25) for original data and extracted components respectively. Dark shaded bars indicate significant components (Circular shuffle statistic on the training set; ; see Appendix D) (B, D, G, F) Average across repeated trials. Blue-shaded area indicates standard-error of the mean. Evidently the extracted components have a smaller SEM relative to the mean values. The SNR measures are multiplied with to parallel the SEM shown here. SNR of 2, for instance, indicates that the mean is 2 SEM away from zero baseline. (C, I, E, F) EEG resolved over all trials to visualize the inter-trial variability. (H & K) Forward model of the two strongest components with distributions consistent with the ERN and readiness potential respectively. Data from (Parra et al., 2002).

When we apply CorrCA to these data, we obtain a series of components with high inter-trial correlation (Figure 5 F). CorrCA components were extracted using shrinkage regularization (, see Appendix C) on a training set of trials (leaving 1 trial out for testing). Performance was evaluated on the leave-one-out test set. The first three components (highlighted in dark blue) are significant at the level according to a non-parametric test based on circularly shifted surrogate data (see Appendix D). The first two of these components have an SNR of and . Evidently, component 1 is very reliable (panel G shows that the mean is several SEM away from the baseline). The deflections are now obvious even in single trials without the need to average (panel I). The spatial distribution of this component activity (panel H) is indicative of the well-known ERN. Component 2 (panel J) is indicative of the readiness potential (Deecke et al., 1969). It rises leading up to the button push at time ms, and has a spatial distribution over parietal and motor areas (panel K).

Note that CorrCA has reduced the dimensional data to just a few components that capture the reliable portion of the data (high SNR; panel F). By compactly representing neural activity in a small number of components, it becomes easier to probe for changes in activity due to manipulation of experimental variables, whose effects on the original dimensional data may be obscured due to the lower SNR and the multiple comparison problem.

3.5 Effect of regularization on the reliability of EEG responses

A common numerical difficulty with generalized eigenvalue problems is the required matrix inverse. In the case of CorrCA, it is necessary to invert in equation (8). When some of the eigenvalues are small (or even zero), then random fluctuations due to noise will dominate the inverse. It is thus important to regularize the estimate of

. In the past, we have used the Truncated Singular Value Decomposition (TSVD)

(Hansen, 1994) and shrinkage (Ledoit and Wolf, 2004) for this purpose (Dmochowski et al., 2012, 2014; Ki et al., 2016; Cohen and Parra, 2016). These methods are described in detail in Appendix C.

To evaluate the effects of these regularization methods on the results of CorrCA we use data from our study on neural responses to advertisements (Dmochowski et al., 2014, we select one commercial among several aired during the 2012 Super Bowl). For both the TSVD and shrinkage approaches, we spanned the space of regularization strength (a single parameter, and , in each case) and measured ISC. The training set was the EEG data from 12 subjects during the first viewing of the ad, while the EEG data collected during the second viewing served as the test set.

The sum of the first 3 ISC values for both regularization methods followed a similar trajectory with increasing regularization strength (decreasing and increasing ; Figure 6 B & C). On the training data, ISC drops monotonically, whereas on the test data a maximum is achieved at intermediate regularization values. On these data, shrinkage regularization was most effective at and TSVD when retaining out of a total of dimensions in the within-subject covariance . The forward models of the first three correlated components for each regularization technique are shown in Figure 6 A. Notice that regularization tends to smooth out the topographies of the second and third components compared to no regularization.

Figure 6: Regularization increases reliability of EEG test data. The inter-subject correlations of EEG responses from 12 subjects viewing a popular television advertisement (Dmochowski et al., 2014). We took the data of the highest-scoring advertisement maximized for ISC while varying levels of TSVD and shrinkage regularization. (A) The spatial distributions of the three most correlated components. Note that the distributions of the second and third components were distinct for unregularized, TSVD-regularized and shrinkage-regularized CorrCA. (B) Performance with TSVD as a function of (decreasing corresponds to stronger regularization), and (C) performance with shrinkage as a function of (increasing corresponds to stronger regularization). Components were extracted on the data from a first viewing of the advertisement (training) and evaluated on data from a second viewing (test). Reliability peaks at for shrinkage, and when truncating to dimensions, with very similar peak values of (sum of first three inter-subject correlations) for the two regularization approaches.

4 Validation on simulated data

We evaluated the conditions under which CorrCA identifies the true source directions. We also determined the performance of CorrCA at identifying components with high ISC, and compared this against a simple benchmark based on PCA of the subject-average data. Additionally, we compared the accuracy of the statistical tests described in Appendix D at estimating the correct number of components. To this end, we generated artificial data with known properties. The data are generated assuming a known number of correlated components (signals) and additive spatially correlated noise. The following factors were varied parametrically: the number of subjects (raters, repeats), , the number of samples per subject, , the number of dimensions, , the number of correlated components, , and the signal-to-noise ratio (SNR) as defined in the space of multivariate measurements. Each simulated correlated component was identically reproduced in each subject, so that the ISC of all components was equal to one. We further varied the temporal dependency structure of the signal and noise components (IID or pink noise), the distribution of the signal and noise components (Gaussian or distributed), and the heterogeneity of the signal and noise subspaces (either the same or different for all subjects). Signal components were generated to be correlated across subjects, but uncorrelated with each other. Noise components were uncorrelated with each other as well as across subjects. The number of noise components was fixed to . The following default parameters were used unless otherwise noted: , , , , SNR = 0 dB, no regularization, IID Gaussian signal and noise components, ISC = 1 for all components. Details on data generation and evaluation methods can be found in Appendix E.

4.1 Accuracy as a function of Signal-to-Noise Ratio

Figure 7 depicts CorrCA performance as a function of measurement SNR for fixed . SNR values range from -40 dB to 40 dB. It is apparent that average training and test ISC reach the optimal value of 1 for very high SNR (panel A). For very low SNR ranges, training ISC significantly exceeds test ISC, indicating over-fitting. The identification of the correlated subspace converges to the optimal value for high SNR (panel B). However, individual components are not perfectly identifiable even for high SNR because they share the same ISC values (see Appendix A.3 for a discussion). The identifiability of individual components becomes possible if the ISC of the simulated signal components are adjusted to linearly decrease from 1 down to 1/, and if the measurement SNR is high (data not shown). In practice, one should keep in mind that the uniqueness of the ISC (and thereby, their individual identifiability) depends on the amount of noise that is collinear to each original correlated dimension. In the IID case, all estimates of

converge to the true number of 10 for high SNR (panel C). Here, the parametric test and the surrogate data approach using random circular shifts perform similarly, while the surrogate data approach using phase scrambling performs considerably worse at low SNR, presumably due to the absence of a well-defined notion of frequency and phase in the sense of the Fourier transform. In the case of dependent (pink noise) samples, both surrogate data approaches perform identically, converging to the true value of

for high SNR (panel D). In contrast, the parametric test, which incorrectly assumes IID data, overestimates the number of correlated components across the entire SNR range, as expected.

Note that the significance test is applied here to multiple components (). Under appropriate conditions, all three methods converge to the correct number of dimensions in these simulate data (). This indicates that all three methods (described in Appendix D) properly account for these multiple comparisons.

Figure 7: CorrCA performance as a function of signal-to-noise ratio (SNR). Other parameters were fixed to , , , . Plotted are averages across 100 experimental repetitions, while corresponding standard deviations are shown as transparent areas. (A) The average ISC of the reconstructed components converges to the maximal value of 1 for very high SNR. For very low SNR, the expected value of 0 is exceeded on training data, indicating overfitting. CorrCA outperforms a simple PCA on the subject-averaged data. (B) CorrCA finds to correct subspace of correlated components if the SNR is sufficiently high. Individual components may, however, not be identifiable if their ISC are too similar. (C) In case of IID data, all statistical tests to estimate converge to the true number of 10 for high SNR. (D) In case of dependent (pink noise) samples, only surrogate data approaches perform well, while the parametric test overestimates .

4.2 Accuracy of estimating dimensionality of correlated components

The performance of CorrCA as a function of the true number of correlated components, , is depicted in Figure 8. Here, , , , SNR = 0 dB, where was varied from 3 to 28. As the overlap between signal and noise subspaces increases with , the average ISC of the reconstructed correlated components drops (panel A). A similar decline is seen in the reconstruction of simulated correlated components and their forward models (data not shown). As a result, the percentage of correlated components that can be recovered (i.e., reach statistical significance) also decreases with increasing (panels B & C).

We also compared the ISC values obtained with CorrCA against a simple benchmark, namely, PCA of the subject-averaged data. While this simple method captures the correct directions for high SNR case, CorrCA significantly outperforms at medium and low SNR values (Figure 7 A). The same is true for simulations with varying number of underlying dimensions (Figure 8 A), where we see that CorrCA outperforms this simple method in all instances tested.

Figure 8: CorrCA performance as a function of the true number of correlated components, K. Other parameters were fixed to , , , SNR = 0 dB. (A) The average ISC of the reconstructed correlated components drops with increasing . (B) Consequently, the number of correlated components can be less well estimated, the higher . For IID data, this number is bound from above by the true number for all statistical approaches. (C) Overestimation of may occur for pink noise data when the parametric F test is used.

Figure 9 shows the estimated number of correlated components as a function of the number of data points, , and the number of repetitions (subjects/raters), , with fixed , , SNR = 0 dB. In case of IID data, all estimates converge to the true value of 10 for increasing or . This is also true for pink noise data in combination with non-parametric statistical tests. For the parametric test, however, the number of seemingly correlated components diverges as increases, and is overestimated regardless of .

Figure 9: Estimated number of correlated components as a function of the number of data points, T, (left panels), and the number of repetitions (subjects/raters), N, (right panels). Other parameters were fixed to , , SNR = 0 dB. (A, B) In case of IID data, all estimates converge to the true value of 10 for increasing or . (C, D) For pink noise data, only non-parametric statistical tests converge to the true value.

4.3 Dependence on signal and noise distributions

Figure 10 A & B shows the performance of CorrCA for different signal and noise distributions. Here, , , , , SNR = 0 dB. Signal and noise components were generated either as Gaussian or squared Gaussian (

-distributed). For Gaussian distributed components, we considered an additional case, where observations

were dichotomized (thus, Bernoulli distributed) by taking their sign. This simulation scenario resembled the typical setting in which two raters provide a binary assessment of the same issue. While non-Gaussianity of the underlying components does not substantially affect the extraction of correlated components, the same is considerably more difficult in the presence of dichotomous observations. An additional analysis was conducted for IID Gaussian data and

to assess the performance of CorrCA when signal and noise covariance structures (as defined by the mixing matrices and , ) differ across subjects (Figure 10 C & D). Significant drops in the number of estimated correlated components as well as the average ISC are observed if the signal covariance, the noise covariance, or both, differ across subjects. This result illustrates that identical covariance structures of signal and noise components across subjects are a fundamental assumption of CorrCA. If this assumption is suspected to be violated, the use of MCCA (Section 2.6) may be beneficial, or approaches that can gradually move from CorrCA to MCCA (Kamronn et al., 2015).

Figure 10: CorrCA performance as a function of signal and noise distributions. Other parameters were fixed to , , , SNR = 0 dB. (A, B) Performance for Gaussian and squared Gaussian (-distributed) signal and noise components, as well as for dichotomized (Bernoulli-distributed) observations. Data points were either drawn IID or sampled from a pink noise process. While non-Gaussianity of the underlying components does not substantially affect the extraction of correlated components, the same is considerably more difficult in the presence of dichotomous observations. (C, D) performance on IID Gaussian data when signal and noise covariance structure differs across repetitions (e.g., subjects, raters). Significant drops in the number of estimated correlated components as well as the average ISC are observed if the signal covariance, the noise covariance, or both, differ across subjects.

5 Conclusion

The goal of this work was to provide a formal, yet didactic and comprehensive treatment of this new component analysis technique. The analytic development resulted in a direct link to MCCA and a surprising equivalence to LDA, thus linking the new methods with classic concepts of multivariate analysis. We also found an efficient scheme for computing ISC/IRC which surprisingly does not require computation of pairwise correlations. We also identified the F-statistic as an exact parametric test for statistical significance, which is valid in the case of normal and identically distributed samples. For the case of dependent samples we demonstrated and validated circular-shift shuffle statistics as an efficient non-parametric test of significance. We also presented an extension to non-linear mappings using kernels. Finally, perhaps the most important contribution of this work is the formalization of inter-subject correlation. Conventional metrics, such as pairwise Pearson’s correlation are limited to two signals. Inter-subject correlation as defined here is applicable to more than two signals. ISC differs from conventional definitions of correlation (see Section refapdx:ICC), yet seamlessly integrates in well established mathematics related to linear discriminants, canonical correlation, and the F-statistic.

We make all code available at http://parralab.org/corrca . This code is significantly faster than previous instantiations and has been extensively evaluated. All figures presented in this paper can be reproduced with this new code base and associated data. We hope that this analysis and code inspires new uses for correlated component analysis.

We are grateful to Valentin Rousson who pointed us to work on inter-class correlation, and Oskar Jenni of the Children University Hospital in Zurich, Switzerland who collected the data published in (Rousson et al., 2008), which we reanalyzed here in Figure 3. Similarly, we would like to thank Mike Milhalm for pointing us to the problem of identifying parent/child agreement and Lindsay Alexander for providing access to the Healthy Brain Network data (Alexander and et al., 2017) for Figure 4. We also thank Samantha Cohen for providing the data from (Cohen and Parra, 2016) for Figure 2. We thanks Samantha Cohen, Jens Madsen Michael X Cohen, and two anonymous reviewers for useful comments on an earlier version of this manuscript. We want to thank Flaviano Morone for assisting with the proof in Appendix A.4.

Appendix A Generalized eigenvectors for two symmetric matrices

a.1 Joint diagonalization

Here we will review the well-know relationship between the joint diagonalization of two symmetric matrices and and the eigenvectors of . Consider eigenvectors arranged as columns in with the corresponding eigenvalues in the diagonal matrix :

(36)

To find the solution for this equation, we can replace the positive definite matrix with its Cholesky factorization

(37)

Insert this into (36) and left-multiply with to obtain:

(38)
(39)

with and . Equation (39) is a conventional eigenvalue equation with the same eigenvalues as (36). It is now easy to see that the corresponding eigenvectors diagonalize matrix

(40)

They also diagonalize , which we see by left-multiplying (36) with and right-multiplying with . This yields:

(41)

Thus, the eigenvectors that solve (36) jointly diagonalize and . The reverse is also true. A matrix that satisfies:

(42)
(43)

where and are diagonal matrices, also satisfy (36). To see this, solve (43) for and substitute this into (42), which yields:

(44)

and therefore

(45)

which is the same as the eigenvalue equation (36) with

(46)

a.2 Linear combinations

A matrix that simultaneously diagonalizes two matrices and , as in (42)-(43), will also diagonalize any linear combination of these two matrices (Fukunaga, 2013), because

(47)

which is obviously a diagonal matrix. Therefore, the eigenvectors of are also eigenvectors of , assuming vectors and are not collinear.

a.3 Identifiability

An additional important observation is that the eigenvectors of (36) are only uniquely defined for unique eigenvalues. If an eigenvalue is degenerate, say with multiplicity , then the corresponding eigenvectors span a -dimensional subspace. Any vector within that subspace is a solution to the eigenvalue equation with that eigenvalue. In this space any arbitrary rotation will yield a valid solutions. Thus, when estimating components, it is possible that solutions with close-by ISC are “mixed”. Such components are not uniquely identifiable. One should also note that the sign of any eigenvector is arbitrary. If is a solution, then so is . One can always arbitrarily swap the sign of for any component , which also reverses the sign of and — the th column of the forward model . Thus, keep in mind that in all figures the sign of individual and are arbitrary. The same problem arises with most signal decomposition methods such as PCA, ICA or CCA. A common approach used in practice to “fix” the sign is to select one sensor/dimension with the strongest forward-model coefficient, and set that to a positive value (if indeed it is significantly larger than other coefficients with opposite sign).

a.4 Optimization of sum ISC

The optimization criterion (9) has the form

(48)

and is subject to the constraints that components are uncorrelated (relative to ):

(49)

It will be convenient to convert this optimality criterion into conventional Rayleigh quotients. To do so use the Cholesky decomposition (37) as before and replace, and to yield:

(50)

with the constraint now being a conventional orthogonality constraint:

(51)

One can now optimize for matrix with the as its column vectors. The solution to this constrained optimization problem is well established (Horn and Johnson, 1990, Corollary 4.3.39). It is given by, , the eigenvalue matrix of equation (39). Inserting into (39) gives the solution to the original problem, which is given by the generalize value equation (36).

To see that the eigenvector matrix of maximizes , first note that except for the constraint, the terms in the sum can be optimized independently, and that they all have the same form. One can therefore consider the maximum of a single term, written here simply as

(52)

which is to be optimized subject to the orthogonality constraint (51). From here on the argument proceeds conventionally. Let be the orthonormal eigenvectors of matrix with eigenvalues . We can express the solution vectors in this orthogonal basis as

(53)

where is the projection of on that orthonomal basis of these eigenvectors, . Inserting (53) into (52) gives

(54)

This upper bound is attained for and all others equal zero, and thus . For the second vector we require that it is orthogonal to the first, and thus we have to constrain . The sum in (54) now excludes the first dimension and has therefore an upper bound of which is attained for . The argument continues iteratively until for the last component only, , is orthogonal to all other projections. It also happens to be the minimum of .

a.5 Uncorrelated components vs orthogonal projections

We have required here that the extracted components are uncorrelated, i.e. the projection vectors diagonalize : . This means that the component dimensions of are uncorrelated, but only in the averaged over all subject/repeats . Note that for an individual subject, may not be uncorrelated. The same is true for MCCA as derived in Section 2.6. Instead of uncorrelated, we could have required that the projection vectors are orthogonal, , i.e. the linear transformation is limited to rotations. To find the orthogonal (rotations) that maximize ISC, one will have to resort to constrained optimization algorithms (e.g. Cunningham and Ghahramani, 2015). A related issues arises in the context of Canonical Correlation Analysis which attempts to find a linear mapping between two datasets such that the components are uncorrelated (Yohai and Ben, 1980; De la Torre, 2012). Alternatively, the Procrustes problem aims to find a mapping between two data-sets with orthogonal projections (Schönemann, 1966). (The later is what is used for “hyper-alignment” between subjects in fMRI by Haxby et al., 2011). CCA and Procrustes are related, and in fact one can smoothly titrate between orthogonal and uncorrelated solutions (Xu et al., 2012). In our case we favor uncorrelated components, because we want the components to capture independent information. At a minimum, this means the components have to be uncorrelated. In that sense, our method can be seen as a source separation method, similar to Independent Component Analysis (Makeig et al., 1996), Denoising Source Separation (Särelä and Valpola, 2005), or others (Parra and Sajda, 2003). One advantage of restricting the transformations to rotations is that euclidean distances in the new space are preserved. This means that differences found in the original sensor space are preserved in the new component space. Another advantage is that all original dimensions contribute equally to the new dimensions. In contrast, the non-orthogonal shearing operation of CorrCA or CCA may discount some dimensions and emphasize others. Both these advantages of orthonormal mappings, however, require that all new dimensions are considered. This is typically the case in the applications of hyper-alignment (Guntupalli et al., 2016). For the purposes of identifying dimensions that are reliably reproduced, however, we find in practice that many dimensions are not significantly correlated between repeats, and one will typically want to remove those. Similarly, one will want to dismiss sensors in the original data that are not reproducible across repeats while emphasizing others. Ultimately, however, the application will dictate whether one should enforce uncorrelated components or orthogonal projections. In either case, the goal can remain to maximize ISC, and the solution can be found, either through an eigenvalue question (as we presented here for uncorrelated components) of with constrained optimization (for orthogonal projections, along the lines of Cunningham and Ghahramani, 2015).

Appendix B Relationship between Scatter matrices and between- and within-subject covariance matrices

Comparing Eq. (16) to (14) yields:

(55)

One can also relate to as follows. Expanding the definition (6) yields (with abbreviated notation for the sums):

(56)

Matrix , defined in (21), captures the variance of the mean across subjects and is zero when all subjects have equal mean: .111 For time domain signals, this is often the case as it is customary to subtract the mean value (constant offset) to make the signals zero-mean. For rating data, an unequal mean indicates rater bias, i.e. a rater gives systematically lower/higher ratings than other raters, and it may be advantageous to remove this bias by equalizing all mean values. Note that, when , the present definition of ISC is equivalent to the classic definition of intra-class correlation (ICC, see Appendix B). But in general, this matrix is not zero, which is important when generalizing the approach to non-linear transformations. There, enforcing zero mean in does not guarantee zero mean in . With this result, scatter and covariance matrices can be related as follows:

(57)
(58)

Appendix C Regularization and robustness

Note that a high inter-subject correlation corresponds to large eigenvalues of . These eigenvalues will be large if the selected direction exhibits large power in the space of , or low power in the space of . The former is meaningful and desired, but the latter can occur with rank-deficient or impoverished data and may lead to spuriously high inter-subject correlation. Truncating the eigenvalue spectrum of has the desired effect of forcing the extracted data dimensions to have both high power in and . This makes it less likely to find spuriously reliable dimensions. Formally, the TSVD approach is to select