## Abstract

Representational similarity analysis (RSA) tests models of brain computation by investigating how neural activity patterns change in response to different experimental conditions. Instead of predicting activity patterns directly, the models predict the geometry of the representation, i.e. to what extent experimental conditions are associated with similar or dissimilar activity patterns. RSA therefore first quantifies the representational geometry by calculating a dissimilarity measure for all pairs of conditions, and then compares the estimated representational dissimilarities to those predicted by the model. Here we address two central challenges of RSA: First, dissimilarity measures such as the Euclidean, Mahalanobis, and correlation distance, are biased by measurement noise, which can lead to incorrect inferences. Unbiased dissimilarity estimates can be obtained by crossvalidation, at the price of increased variance. Second, the pairwise dissimilarity estimates are not statistically independent. Ignoring the dependency makes model comparison with RSA statistically suboptimal. We present an analytical expression for the mean and (co)variance of both biased and unbiased estimators of Euclidean and Mahalanobis distance, allowing us to exactly quantify the bias-variance trade-off. We then use the analytical expression of the covariance of the dissimilarity estimates to derive a simple method correcting for this covariance. Combining unbiased distance estimates with this correction leads to a novel criterion for comparing representational geometries, the unbiased distance correlation, which, as we show, allows for near optimal model comparison.

## Author summary

Representational Similarity Analysis (RSA) compares brain activity data to predictions from computational models by evaluating which brain activity states are similar and which are dissimilar to each other. For RSA, we first calculate all pairwise dissimilarities between activity patterns, which characterize the geometry of the brain representation. In this paper, we propose a new criterion for determining how closely this representational geometry corresponds to the prediction of a computational model. The new criterion takes into account that the activity patterns are measured with noise, which can bias inference. It also accounts for the fact that the different pairwise dissimilarities are not independent. Using simulations, We show that the new criterion performs near-optimally, making it the method of choice for the comparison of representational geometries.

## Introduction

Systems neuroscience investigates how patterns of brain activity implement the computational processes that support behavior. The computations can be understood as transformations of representations that reflect task-relevant information about the external world, the state of the animal’s body, its needs, goals, plans or actions. Models of brain computation seek to explain how the brain processes information [Kriegeskorte2019]. In order to formally test such models, we must compare the representations in the models to the activity patterns measured in the brain. An essential challenge for computational neuroscience, therefore, is to develop methods for comparing representations between brains and models.

The approach we focus on here is to characterize brain representations at the level of the neural population [Haxby2001, Hung2005]

. This approach abstracts from the roles of individual neurons, making it easier to compare representations between brains and models

[kriegeskorte2008]. Brain representations are characterized by measuring patterns of activity across a brain region. Each activity pattern is associated with an experimental condition, for example the presentation of a particular sensory stimulus, and defines a point in the multivariate response space [Edelman1998]. The distances among these points define the geometry of the representation. If the noise is isotropic and homoscedastic, each distance determines how discriminable two patterns are in the representation, and the distance matrix determines the encoded information. The representational geometry additionally captures aspects of the format of the code, revealing, for example, what subset of the encoded information is amenable to linear readout [Kriegeskorte2019].The analysis of representational geometries has come to be called representational similarity analysis (RSA, [kriegeskorte2008]). RSA has been applied to data from invasive electrophysiology, fMRI, electroencephalography (EEG), or other methods, to characterize brain representations at the population level. RSA proceeds in three steps: In the first step, the estimated activity patterns are used to compute a condition-by-condition representational dissimilarity matrix (RDM, see Figure 1). An important decision here is the choice of dissimilarity measure. Choices include the accuracy of pairwise decoders, correlation distance, or Euclidean and Mahalanobis distances ([Walther2016]

). In a second step, the data and models are compared by relating the vector of upper-triangular elements of the data RDM (Figure 1) and the corresponding vectors for the model RDMs. Because the dissimilarity estimates typically lack units, models cannot predict the values of the dissimilarities directly. Instead models predict the ratios or ranks of the dissimilarities. Therefore, the off-diagonal elements are compared using cosine similarity, the Pearson correlation, or the Spearman or Kendall

rank correlation [nili2014]. In the third and final step, models are inferentially compared using frequentist parametric [Ejaz2015] or non-parametric tests [nili2014].In this paper we address two closely related problems for inference using RSA. The first problem is that distances (including correlation, Euclidean, and Mahalanobis distances) are positively biased when directly estimated from noisy data [Walther2016]. Even if the true activity patterns are identical, and so the true distance is zero, the measured activity patterns will differ by virtue of the measurement noise, and the estimated distance will be larger than zero. If different conditions are measured with different noise levels, or if measurement noise is correlated across conditions, different distances will be biased to different extents and the representational geometry will be distorted, potentially leading to systematically incorrect inferences [cai2019]. To avoid the noise-induced bias in distance estimates, we have previously proposed crossvalidated dissimilarity estimators [nili2014, Walther2016], which provide unbiased distance estimates with an interpretable 0 point. The removal of bias by crossvalidation comes at the cost of slightly increased variance. In this paper, we derive analytical expressions for the bias and variance of both biased and unbiased distance estimates. This allows us to gain analytic insights into when the use of unbiased distance estimates is advantageous.

The second problem is that the elements of an RDM have a complex covariance structure. This covariance, if not accounted for, can make model selection sub-optimal. The comparison between RDMs can be visualized in a space in which each unique dissimilarity of the RDM defines a dimension (Figure 2). When comparing RDMs with the cosine similarity, the model that has the smaller angle with the data RDM will be considered the better fit. While this approach is not systematically wrong, model comparison will not be optimal. In the second part of the paper, we therefore propose a simple method to address this issue: Using the analytical expression for the covariance of the different dissimilarity estimates, we can effectively calculate a cosine similarity in a ”whitened” space, in which the noise is isotropic. We show that is that this whitened RDM cosine similarity based on biased distance estimates is equivalent two existing multivariate dependence criteria, the linear Centered Kernel Alignment [kornblith2019], and the distance correlation [Szekely2007]

. By combining the whitened RDM cosine similarity with unbiased distance estimates, we define a novel criterion, the unbiased distance correlation. This technique substantially improves the power of inferential model comparisons, and performs, for normally distributed data, close to the theoretical maximum of the likelihood-ratio test

[Neyman1933, Diedrichsen2011, Diedrichsen2017Neuroimage]. At the same time the new criterion is robust against violations of noise assumptions, making it for many applications the method of choice for RSA model comparison.## Results

### Basic Definitions

Let matrix be the true activation values for experimental conditions measured over measurement channels. Each row of contains an activity pattern across channels, elicited by a single experimental condition. Each column of contains an activity profile across conditions, for a single channel. As an example, consider the analysis of fMRI data, where the channels are voxels. In this case, the data () are time-series of blood-oxygenation level dependent (BOLD) signal measurements for every voxel. These data can be separated into different scanner runs or sessions of data recording that can be assumed to be independent. The measured data is assumed to be a linear function of the true activity patterns and a design matrix , which indicates to what degree each activity pattern is active at each time.

(1) |

From each partition of the data, we can derive an estimate of the activity patterns . The first step for RSA is to compute the dissimilarities between activity patterns. Let be the row of , that is, the true activity pattern for the condition across voxels. We define the dissimilarity to be between conditions and . In the results, we consider the Euclidean distance, but we show in the methods how the results generalize to the Mahalanobis distance. Both dissimilarities are based on the difference between activity patterns . Specifically the squared Euclidean distance is

(2) |

Note that we are normalizing all dissimilarities by the number of channels to make the measures comparable across regions of different sizes. In an experiment with conditions, we have a total of unique pairwise distances.

### Biased and unbiased estimates for the squared Euclidean distance

The simplest estimator of the squared Euclidean distance can be obtained by first averaging the estimated pattern-differences across partitions

(3) |

and then taking the inner product of these estimated pattern-difference vectors. When plugging Eq. 3 into Eq. 2, we can see that the distance estimate relies on all the pairwise products of pattern differences:

(4) |

As shown in the methods, this estimate is positively biased by measurement noise. The positive bias arises because we are multiplying a noisy pattern estimate with itself. The size of the bias is determined by the measurement variance of the pattern difference .

(5) |

If the measurement variance across all pattern differences is the same, the bias is a constant value across all dissimilarities, and can be accounted for by using Pearson or rank correlations to compare RDMs (see below). However, if the variance differs, the bias will systematically differ across dissimilarities, and possibly distort the representational geometry in favour of the wrong model [cai2019].

To avoid this bias, we estimate squared distances by only multiplying pattern estimates from different, and hence independent, partitions [nili2014, Walther2016] with each other. Thus, we drop from Eq. 4 all pairs where .

(6) |

In contrast to the biased estimate, , we denote the unbiased estimate as . The bias is removed, as only independent partitions enter the product (for details, see Methods). Avoiding products where noise is multiplied with itself ensures that the expected value of the estimator is the distance we want to estimate:

(7) |

In other words, the cross-validated distance estimator is unbiased.

### Variance of distance estimates

The removal of the bias, however, does not come for free: As can be seen from Eq. 4 and 6, the unbiased estimate uses fewer pairs of activity patterns to estimate the true distance. We therefore expect this estimate to have a higher variance than the unbiased estimate. Indeed, we show in the methods that the variance of the biased distance is

(8) |

A very similar expression is obtained for the unbiased estimate of the distance:

(9) |

Both expressions have two components: The first term of the equation arises from the multiplication of noise with noise. The variance scales in the square of the measurement variability of the corresponding pattern differences (). The only difference between the biased and unbiased estimate is the size of this component, which is larger by factor for the unbiased estimate. The second term arises from the multiplication of the true pattern difference (). If the true distance is zero, i.e. if the are no differences between the true activity patterns, this second term vanishes. The overall balance between these two terms also depends on the strength of the signal (), and on the noise covariance structure across channels ().

The insights from the equations are summarized in Figure 3, which shows the mean and variance of the squared distance estimates for a range of true distances between 0 and 1.2. If the true distance is 0, the mean of the unbiased estimate is zero, whereas the mean of the biased estimate is inflated by . In exchange, the variance of the biased estimate is lower by a factor of . This difference is caused by using different numbers of pairwise products: The biased estimate uses all possible pairs, whereas the unbiased estimate excludes of the pairs (those of each partition with itself). Thus, with partitions, the variance of the unbiased estimate will be twice as large. However, the difference diminishes as the number of independent partitions increases. The second term of Eq. 8, 9 causes the variance of the distance estimate to increase linearly with the true squared distance. This signal-dependence affects biased and unbiased estimates equally.

### Model comparison using RDM correlations or cosine similarity

Whether it is better to use biased or unbiased distance estimates depends on how these estimates are used in subsequent inference. A common use case is to compare the measured RDM to different competing models of neuronal representations. For this, the upper triangular part of the RDM (which is symmetric about a diagonal of zeros) is vectorized (Fig. 1). The vector of estimated distances () is then correlated with the vector of model-predicted distances (), and the model with the highest correlation is selected.

Which correlation coefficient is appropriate depends on the level at which the models are meant to make predictions. If the models predict merely the rank-ordering of the distances, a Spearman rank correlation (or when any of the models predict equal representational distances for different pairs of stimuli, Kendall’s rank correlation) is appropriate [nili2014]. If the models make predictions about distances on an interval scale, one can use the Pearson correlation,

). Calculating a correlation (whether Kendall, Spearman, or Pearson) allows for arbitrary scaling between observed and predicted distances, which is useful because the model-predicted dissimilarities are typically in arbitrary units and the scaling of the data depends on the signal-to-noise ratio. These correlation coefficients are also invariant to an additive constant. In the context of biased distance estimates this is useful, as it discounts the positive bias arising from noise, if the noise is equal across conditions.

If we have removed the bias already by using an unbiased distance estimator, we have a ratio-scale measure, where 0 is an informative point, indicating that the two patterns only differ by measurement noise. To exploit this additional information, we can compute the correlation without subtracting the mean across distances. This quantity is the cosine similarity: the cosine of the angle between the vectorized data RDM and the model RDM (Fig. 2).

(10) |

### Biased or unbiased distance estimates?

For models that predict the RDM on a ratio-scale, this leaves us with two consistent choices: We can either compute the cosine similarity between the model and unbiased distance estimates, or the Pearson correlation between the model and biased distance estimates.

The latter technique, however, will only work correctly if the positive bias is the same across all elements of the RDM - that is, when all pair-wise pattern differences are measured with the same variability. If one condition has a smaller number of trials than another (e.g., after the exclusion of error trials), distances involving this condition will be systematically larger. Similarly, if the estimation errors for one pair of conditions are more correlated than for another pair (e.g. because some conditions were measured with fMRI in temporal proximity, and so the pattern estimates are dependent), then the variance of their pattern differences will be smaller, leading to systematically lower distance estimates. In both cases, the use of biased distance estimates can bias inference towards the incorrect model [cai2019].

Biased distance estimates therefore should only be used if we can be relatively sure that there are no substantial differences in measurement variability across all pairwise pattern differences. Even if this is the case, unbiased distance estimates may still be preferable to biased estimates because they enable us to exploit the additional information inherent in the zero point to sensitively adjudicate among models.

Consider for example the two RSA models depicted in Fig. 4a. Both models have the same category structure, with condition 1 and 2 belonging to one category, and conditions 3 and 4 belonging to another. The two models only differ in the ratio of the within-category to the between-category distances. To distinguish between them, therefore, requires a meaningful zero point. To verify this, we simulated data from each model, varying the number of independent partitions (see Methods). We then determined the better fitting model, using either the Pearson correlation with biased distance estimates, or the cosine similarity with unbiased estimates. As an outcome measure, we counted how often each inference method could identify the true data-generating model. In this particular example, the inference remains at chance level (0.5) when using Pearson correlations of biased estimates – inferences can only be made using the cosine similarity on unbiased distance estimates.

In contrast, the two models depicted in Fig. 4b differ in the predicted similarity structure, with the second model involving 3 different levels of distances. This makes an absolute zero point less important for inference – indeed, using biased and unbiased distances estimates leads to approximately equal number of correct model decision. For cross-validated estimates, the performance depends on the number of partitions, as the variance of these estimates is higher by than the biased estimates. In this situation, using unbiased estimates is preferable for more than 4 partitions, whereas biased estimates are better when there are 2 or 3 partitions.

Finally, the second model in Fig. 4c includes a wide range of different distances, including some that are relatively close to zero. This abolishes the advantage of the interpretable zero point almost entirely. In this case using biased distances estimates is advantageous, especially when only a small number of partitions are available.

It should be kept in mind, however, that all these simulations assume that differences between each pair of activity patterns can be measured with equal variance. Any deviation from this assumption can substantially change the measured representational structure and bias model selection. The use of cross-validated, unbiased distance estimates provides a safe guard against such biasing influences — an advantage that in many cases will be well worth the small cost in statistical power.

### Covariance of distance estimates

The use of correlations (for biased distance estimates) or cosine similarity (for unbiased estimates) for model comparison would be fully adequate if all elements of the RDM were estimated independently and with the same variance. However, our analytical expression for the full covariance matrix of dissimilarities (Eq. 29, 31) shows that this is not the case, even if the underlying activity patterns are measured independently and with the same variance.

The correlation between distance estimates arises from the fact that the pattern difference between condition 1 and 2 is not independent from the pattern difference between condition 1 and 3 (even if all conditions are measured independently). The covariance matrix for the 10 distance for a design with 5 conditions (assuming no true pattern differences and i.i.d noise) is shown in Figure 5a. Distances that share one of the conditions (i.e., between and ) have a correlation of . Only estimates of distances that do not share any conditions (i.e., and ) are uncorrelated. If the measurement noise on the patterns is not i.i.d., a more complex co-dependence structure can arise.

For a design with a larger number of conditions (K=18, Fig. 2b), the number of uncorrelated distances increases, i.e. the covariance matrix becomes sparser. This does not mean, however, that we can ignore the dependence structure. As we will show below, accounting for the co-dependence structure becomes especially important for designs with large numbers of conditions. Similar to what we’ve observed for the variance, the covariance between distance estimates also increases with increasing true difference between corresponding patterns. Figure 5c shows the covariance matrix for distance estimates from data generated from the RDM model depicted in Figure 1. Large true distances exhibit larger variances and also larger covariances with other distances. The exact shape of the covariance matrix depends on exact form of the true pattern differences , as well as the covariance of noise across voxels ().

Given the correlation between elements of the estimated RDM, we therefore would expect that both and will perform sub-optimally (see Figure 2). Indeed, in a previous paper [Diedrichsen2017], we have shown, for several different simulation scenarios, that ignoring this covariance structure leads to 3-12% fewer correct model-selection decisions, as compared to a full likelihood-ratio test between models, implemented in Pattern Component Modelling [Diedrichsen2011, Diedrichsen2017].

### RDM comparison in whitened RDM-space

To improve model inference, we should therefore use the covariance matrix to transform the elements of the RDM into a whitened space, in which all dimensions are measured independently and with the same variance. Originally [Diedrichsen2017], we had suggested to use the full expression for the covariance (Eq. 31), and to estimate required quantities iteratively from the data. However, it is difficult to obtain stable estimates for the signal () and to account for how it interacts with the spatial structure of the noise (). As a result, this approach is slow and unreliable.

Here we propose a simplification to avoid estimating the right side of Eq. 29 and Eq. 31. Specifically, we suggest using the covariance structure of the distances under the assumption that all distances are zero. In this case the variance simplifies to

(11) |

where is a proportionality constant that is not important for most applications. To take this covariance matrix into account for model comparison, we can prewhiten the distances by pre-multiplying them with . In the case of the cosine similarity between unbiased distance estimates, this leads to a new criterion, the whitened RDM cosine similarity:

(12) |

Similarly, we can define a whitened RDM Pearson correlation, simply by first subtracting the mean dissimilarity of data () and model () RDM.

(13) |

### Influence of RDM whitening on model comparisons

To determine the influence of RDM whitening in the context of realistic model comparisons, we simulated data using the designs of three published fMRI experiments and evaluated the associated models, all of which made quantitative predictions about representational distances (Fig. 6). The first two experiments measured activation patterns associated with 5 (Exp 1) or 31 (Exp 2) different finger movements in primary motor cortex [Ejaz2015]. The associated models where derived either from the structure of the associated muscle activity or from the the natural statistics of movement. The third experiment measured the activation patterns elicited by 92 images showing a range of animate and inanimate objects in human inferior temporal cortex. The models were derived from the 8 layers of a neuronal network [Khaligh-Razavi2014].

As for Figure 4, we simulated data from each model, using different signal-to-noise levels. We then used RDM correlations or whitened RDM correlations, as well as cosine similarity or whitened cosine similarity to find the best model. For each method we recorded the number of correct model decisions. In addition to the RSA-based methods, we also used Pattern Component Modeling (PCM) [Diedrichsen2011, Diedrichsen2017], which directly compares the marginal likelihood of the data given the models, under the assumption that both signal and noise are normally distributed. For such data, PCM implements the likelihood-ratio test between models. For the case of our simulations, where all assumptions hold, PCM therefore implements the optimal inference procedure [Neyman1933] and provides an upper performance bound for any model-comparison technique.

Across the three different experimental simulations, the simple Pearson correlation or cosine similarity (Fig. 6, solid lines) performed clearly sub-optimally as compared to PCM. Taking the covariance structure of the distances into account (dashed lines) substantially improved model decisions. Indeed, in many cases, the performance of RSA inference was close to optimal. This suggests that accounting for the signal-dependent, second half of the covariance formula (Eq. 29, 31) would not improve inference much further. Instead, these simulations indicate that the observed sub-optimal performance was mostly caused by the assumption that the distances are uncorrelated, rather than by the assumption that the distances have the same variance.

### Factors influencing the advantage of RDM whitening

From the simulations in Figure 6, it appears that the importance of taking the covariance structure of the distances estimates into account is more pronounced for experiments with more conditions. To test this idea directly, we simulated data for Exp 1 with 5 conditions and 32 partitions. We then reanalyzed the data, relabeling the measures from even partitions as conditions 1-5, and from the odd partitions as condition 6-10. This increases the number of conditions to 10 and reduces the number of partitions to 16. We repeated this procedure two times more, finally ending up with 40 conditions and 4 partitions. As the underlying data is the same, the performance of PCM was relatively constant across these situations. When using cosine similarities, however, the importance of taking the covariance structure into account increased with increasing number of conditions (Fig. 7).

This may appear at first somewhat counter-intuitive, as the proportion of uncorrelated distances pairs in the RDM increases with increasing number of conditions. However, the structure of the covariance matrix (Fig. 5b) is such that the anisotropy of the covariance structure increases with the number of conditions. The axis of highest variability of the distance estimates is always in the direction of the average of all distances. This direction is associated with an eigenvalue of

. There are also orthogonal directions with an eigenvalue of , and directions with an eigenvalue of 1. Thus, the ratio of the larger to the smaller eigenvalues of (a measure of anisotropy) scales linearly with the number of conditions. That means that with 40 conditions, the all-mean dimension has 40 times higher variability than most of the other directions. When ignoring the covariance structure, all dimensions are counted to be equally important, which leads to sub-optimal inferences.This consideration also implies that for some model comparison problems, taking into account the covariance will not change the inference. This is the case when two models only differ only on dimensions of model space that can be measured with equal variability (i.e., have the same eigenvalue in ). Most model comparison problems will improve, however, and inference will never get worse. Thus, using whitened RDM correlations is always recommended.

### Relationship to Distance Correlation and Centered Kernel Alignment

Interestingly, the whitened RDM cosine similarity (calculated from biased distance estimates and assuming i.i.d. measurement noise) is identical to two statistical measures of multivariate dependence: the linear Centered Kernel Alignment (CKA, [kornblith2019, Cristianini2006]), and the distance correlation [Szekely2007]. Linear CKA is a normalized version of the Hilbert-Schmidt independence criterion (HSIC [Gretton2005]) between two sets of multivariate patterns. Let and be two matrices, with the same number of rows, containing patterns for observations (e.g., trials or time points). To make the mean of each column of these matrices equal to zero, we can pre-multiply the patterns with the centering matrix

. We then can define the centred second moment matrix of the patterns as

(14) |

The HSIC is the dot-product between the elements of the two second-moment matrices.

(15) |

The linear CKA is the normalized version of this quantity, just like the correlation coefficient is a normalized version of the covariance.

(16) |

To prove the equivalence of Eq. 12 and Eq. 16, we can express the vector of squared distances as a linear combination of the elements of the second moment matrix,

(17) | |||||

(18) |

which we can write more succinctly using a properly defined linear transformation matrix

, such that(19) |

Given the structure, we can verify that

(20) |

A more intuitive explanation of this equivalence is that all unique elements of an estimate of are mutually uncorrelated, if the true and (Eq. 37). The covariance-structure of the distances is then simply induced by the fact that some of the distances share common elements of the second moment matrix, with the covariance structure determined (up to a constant) by .

The distance correlation [Szekely2007] is defined as the correlation between two double-centered RDM matrices , i.e. the correlation between and . Because of the direct relationship between the second-moment matrix and the doubled-centered RDM

(21) |

a distance correlation is equivalent to Eq. 16, and hence to Eq. 12. The equivalence between whitened RDM cosine similarity and linear CKA / distance correlation provides a novel way of motivating these two measures. It also immediately suggests an important extension. Whereas the whitened RDM cosine similarity between biased distances (Eq. 4) is equivalent to the distance correlation, the whitened RDM cosine similarity between unbiased distance estimates (Eq. 6) defines an unbiased distance correlation. This measure can be easily computed based on Eq. 15 and 16, replacing with an unbiased estimate for the second moment matrix.

(22) |

In contrast to the traditional definition, the expected value of the unbiased distance correlation will be zero if and only if the two representational structures have no common information (i.e., when there is a reliable difference between two activity patterns in , then there is no reliable difference between the same conditions in ).

## Discussion

RSA provides an intuitive and flexible way of performing inference on representational models (i.e., on models describe the relationship between high-dimensional activity patterns). There are, however, numerous different dissimilarity measures and ways of comparing measured RDMs to model RDMs, and the optimal way of implementing RSA remains a matter of debate [nili2014, Diedrichsen2017]. In this paper, we derive an analytical expression for the mean and the covariance of the biased and unbiased estimates of squared Euclidean and Mahalanobis distances. This theoretical result leads to two important conclusions.

First, we show that standard distance estimates are positively biased, and that this bias depends on the variances and covariances of the measured activity patterns (). If the measurement noise is i.i.d. across trials, the bias will be the same across all distance estimates, and can be taken into account by ignoring the mean distance in subsequent model comparisons (for example by using the Pearson or rank correlation). If, however, one condition is measured with higher variance (for example because there were different numbers of repetitions or error trials had to be discarded), then all distances involving this condition will be systematically higher. If two conditions systematically follow each other, such that they are measured with a positive covariance, their dissimilarity will be systematically lower than two conditions measured with independent noise. These biases can translate into biases in model selection [cai2019]. To avoid such errors, we can remove the bias in the estimation of the distance using cross-validation. This approach has the substantial advantage that unequal estimation variances across conditions can no longer bias model decisions. However, removing the bias of the distance estimates comes at a cost: the variance of unbiased distance estimate is slightly higher than the variance of biased distance estimates, by factor . Thus, when using unbiased estimates, a large number of independent partitions () is desirable.

Second, we show that dissimilarity estimates within an RDM are systematically correlated with each other. In a previous paper, we had shown that model selection using the RDM cosine similarity or RDM correlation was less accurate than PCM [Diedrichsen2017]. Here we show that taking the covariance structure of the dissimilarity estimates into account improves our power to adjudicate between models. This improvement can even be achieved when using the covariance structure predicted under the assumption that all true distances are zero, which dramatically simplifies the procedure, avoiding iterative calculation of the covariance matrix. The power achieved with the whitened RDM cosine similarity (Eq. 12) or the whitened RDM Pearson correlation (Eq. 13) are close to the theoretical optimum, as achieved with the likelihood-ratio test implemented by PCM.

Taken together, these two insights suggest the use of unbiased distance estimates combined with the whitened cosine similarity to compare RDMs. We call this new approach the unbiased distance correlation, as it has important connections to the distance correlation [Szekely2007] (as well as the linear CKA [kornblith2019, Cristianini2006]), but extends these two traditional approaches by removing the biasing influence of measurement noise by using a crossvalidated estimate of the distances (Eq. 6) or second moment matrix (Eq. 22).

One important feature of the unbiased distance correlation is that it has an expected value of zero if and only if all pairs of patterns that systematically differ from each other in the data are predicted to be identical (i.e. have a distance of zero) in the model. That is, unbiased distance correlation of zero shows that the representations in the data and model do not have any shared information. In contrast, the traditional RDM Pearson correlation will be zero when there is no systematic relationship between measured and predicted dissimilarities. As a consequence, the values of the unbiased distance correlation tend to be substantially higher, often relatively close to 1 for all models. To obtain a baseline that is equivalent to a Pearson correlation of zero, it is therefore recommended to compare all results to an unbiased distance correlation with a null-model that predicts all conditions as equally distant from each other. This approach combines the interpretability of the Pearson correlation with the increased power of the unbiased distance correlation.

When should the new criterion for RDM model comparison be used? The optimal method of course always depends on the data and models that need to be compared (Figure 8). The first decision is whether the models are meant to predict the dissimilarities quantitatively (ratio scale) or only their ranks (ordinal scale). Quantitative predictions can often be derived if we have an explicit model of the shape of the underlying activity profiles. The distribution of activity profiles may also be predicted from activities in an artificial neural network model

[Khaligh-Razavi2014], directly from perceptual judgements [Sormaz2016], or the statistics of external training data [Ejaz2015]. In other cases, the model may only predict the rank ordering of the dissimilarities, but not by how much one dissimilarity is larger than another. In such cases, rank-correlations are most appropriate [nili2014]. While this approach can be statistically less powerful [Diedrichsen2017], it is robust against any possible monotonic transformation of the dissimilarities.The next decision is whether the activity patterns can be estimated independently and with approximately the same variance across all partitions. If this is not surely the case, then crossvalidated, unbiased distance estimates should be used. This is important, because the bias on the standard Euclidean or Mahalanobis distances will be structured, if the noise is not i.i.d., such that the model comparison will be biased. Even in situations in which the measured activity-pattern estimates can be assumed to be i.i.d., the unbiased estimation approach can be more powerful than using the biased estimates and Pearson’s correlation. This is because the meaningful zero point (which indicates that there is no pattern difference) can help distinguish models. This advantage of unbiased distance estimates can outweigh the disadvantage of the increased variance compared to biased distance estimates. Which approach is better depends on the number of partitions, the signal-to-noise ratio, the experimental conditions, and the structure of the models (Fig. 4). Overall, however, the increased robustness of violations of noise assumptions will generally outweigh the cost of increased variance, especially if the number of partitions is large.

Whether biased or unbiased distance estimates are used, RDMs should always be compared using whitened RDM correlations or cosine similarities. These measures perform often substantially better, but never worse than standard approaches. Because we can approximate the true covariance structure well using the covariance structure under the assumption that there is no true signal, the approach can be implemented in a computationally efficient manner. Thus, we do not see any reason to use standard Pearson correlation or cosine similarity for RDM comparison.

In many applications, we want to fit and evaluate flexible models, where the vector of predicted dissimilarities depends on a vector of parameters . In some cases the predicted RDM is a weighted sum of different model components, for example the different layers of a deep neuronal network [Khaligh-Razavi2014, Khaligh2017]. In other cases, the predicted distances are a non-linear function of the model parameters [Diedrichsen2017], for example when parameterizing the width of a tuning function in a population-receptive-field model [Dumoulin2016]

. In all of these cases, we can take into account the covariance of the dissimilarity estimates by minimizing the following loss function:

(23) |

Equivalently, we can prewhiten the estimation error, by applying the same linear transform to both the estimated distances and the model prediction by , and then use standard least squares approaches.

Taken together, we believe that the unbiased distance correlation provides an important new measure for RDM comparison that should become standard for many applications. The new measure extends the linear centered kernel alignment (CKA) [Cristianini2006, kornblith2019] and the distance correlation [Szekely2007] by removing the biasing influence of measurement noise. Furthermore, it provides a way of incorporating a known noise covariance of the data for optimal inference. In the quadratic form in Eq. 20,the term can simply be replaced with , ensuring that uneven measurement noise across conditions is being taken into account during model comparison.

The unbiased distance correlation and whitened RDM Pearson correlation have been implemented in a new Python-based RSA toolbox, released by the team of authors [Schuett2020]. We hope that the results presented in this paper, together with the accessible implementation, will accelerate the adoption of what we consider to be current best practice in RSA.

## Methods

### Extended definitions

To derive the mean and full variance-covariance matrix of the distance estimates, it is useful to make some more general definitions and assumptions. We assume that each measured activity profile (column of ) has covariance between conditions. For fMRI, this correlation structure is caused by the sluggish nature of the hemodynamic response, as well from the low-frequency noise inherent to the measurements. A reasonable estimate of can be derived directly from the first level linear model (Eq. 1, [Friston1995]). For other modalities, it may be reasonable to assume independence of measurements.

We also assume that each measured activity pattern (row of ), has a covariance of between channels or voxels. Variability of fMRI, EEG, MEG measurements clearly shows substantial spatial structure. Again, this noise structure can usually be estimated from the residuals of the first-level linear model (see Mahalanobis distance). To remove the redundancy of and in terms of the overall scaling of the noise, we restrict .

For the derivation of the full covariance matrix, we need to make the slightly more restrictive assumption that has a matrix-normal distribution across partitions . While this assumption is reasonable for fMRI data, it is recommended to apply the square-root transform to neuronal spiking data to make it conform to the normal assumption [Yu2009].

To derive distances between conditions in a matrix notation, we define a contrast matrix . The row of this matrix contains a and a for the two conditions that are contrasted in the distance, all other entries are . The product then results in a matrix that contains the pattern differences in its rows. We define:

(24) |

The diagonal of contains the squared distances (divisively normalized by the number of channels). On the basis of , we can also define the variance-covariance matrix of the pattern-difference estimates ():

(25) |

### Bias of distance estimates

Eq. 5 can be derived by expressing the estimated pattern difference () as the sum of the true pattern-difference vectors and the measurement noise (). By substituting this into Eq. 4 and taking the expected value (), it is straightforward to show that the distance estimator is positively biased:

(26) |

The bias arises by multiplying the noise with itself (i.e. for the cases of ). For all other cases (), the noise terms are independent, and the expected value of their product is zero.

### Variance of distance estimates

An analytical expression for the variance-covariance matrix of the vector of distance estimates can be derived using the following general result (see Appendix B1, B2 for details). If the matrix has matrix normal distribution , then the diagonal of has the expected value and variance:

(27) | |||||

(28) |

where is the element-by-element multiplication of two matrices. When setting to the mean of the pattern differences across partitions, we can easily derive the variance of the biased distance estimate (Eq. 4).

(29) | |||||

(30) |

The expression for the variance of the unbiased estimate of the distance (Eq. 6) can be derived by taking the covariance of all pairs of partitions into account (see Appendix B3).

(31) |

Intuitively the variances of distance estimates come from the product of signal and noise (averaged over partitions) and product of noise with noise (averaged over pairs of partitions for the biased distance estimate and over pairs of partitions for the unbiased estimate).

### Spatial pre-whitening and Mahalanobis distances

In the result section, we focus on biased and unbiased estimates of the Euclidean distance. Previous work [Walther2016], however, demonstrates clearly that taking into account the spatial covariance structure of fMRI noise () can dramatically increase the reliability of distance estimates.

In the simplest case, we ignore the correlation between voxels and simply divide the activity estimates for each voxel by the square root of the diagonal elements of . This step already prevents noisy voxels to influence the distance estimate overly much. Additionally, we can use multivariate pre-whitening, i.e. post-multiplication of

. This step gives less weight to the information contained in two voxels that are highly correlated in their random variability than to information contained in two uncorrelated voxels. Calculating Euclidean distances on multivariate pre-whitened data is equivalent to calculating a Mahalanobis distance.

In practice, we do not have access to the voxel-by-voxel covariance matrix. However, we can use the residuals from the first-level general linear model to derive an estimate,

(32) |

where K is the number of regressors of interest per partition. Oftentimes, we have the case that , which renders the estimate non-invertible. Even with , it is usually prudent to regularize the estimate, as it stabilizes the distance estimates. A practical way of doing this is to shrink the estimate of the covariance matrix to a diagonal version of the raw sample estimate:

(33) |

The scalar h determines the amount of shrinkage, with 0 corresponding to no shrinkage and 1 to only using the diagonal of the matrix (univariate prewhitening). Estimation methods for the optimal shrinkage coefficient have been proposed [Ledoit2004], but in practice values in the range of perform well for fMRI data. The estimate is then used to obtain a spatially prewhitened versions of :

(34) |

Biased and unbiased estimates of the Mahalanobis distance can then be calculated via Eq. 5, 6, using instead of . To obtain a full expression of the variance-covariance matrix of this distance, we need to know the mean and covariance of the pre-whitened data. If whitening would work perfectly, the data would be independent across voxels. However, given that we operate with an estimate of , this is not the case. Rather, the pre-whitened data will have matrix-normal distribution

(35) | |||||

(36) |

The covariance matrix of these distance estimates is given by Eq. 29, 31, with replaced by , and with . The covariance structure under the assumption that , however, will be the same as for the Euclidean distance - therefore the whitened RDM correlation and the unbiased distance correlation can be use equivalently for the biased and unbiased estimates of the Mahalanobis distance.

### Simulations

To evaluate different ways of comparing RDMs, we conducted a range of simulations, each with a known ground-truth. For the results shown in Figure 4, we used 2 simple models for each simulation, each predicting the dissimilarity patterns between 4 conditions. In each simulation run, we generated artificial data from Eq. 1 for 2-12 partitions from one of the models. The variance of the noise of the simulation was set be proportional to the number of partitions, such that the variance of the average activity patterns were always constant. The noise was assumed to be independent across the voxels. We then computed either biased or unbiased distance estimates, and finally compared the simulated RDM to the two candidate models using different criteria. We then counted how often each method decided for the true (i.e. data generating) model.

For the simulations shown in Figure 6, we used three example experiments from published fMRI studies. The first two examples come from a paper investigating the representational structure of finger movements in primary motor and sensory cortex [Ejaz2015]. In Experiment 1, the activity patterns for K=5 fingers were measured. The resultant RDM was then compared to two model, one that predicts the similarity structure based on the natural statistics of movement, the other that predicts the structure based on the similarity of muscle activity patterns. The RDM correlation between the two models was relatively high ().

The second example comes from experiment 3 in the same paper, this time looking at 31 different finger movements, which span the whole space of possible “piano-chord” combinations. Again, the predictions of the natural statistics and muscle activity model were compared.

The third example uses an experiment investigating the response of the human inferior temporal cortex to 96 images, including animate and inanimate objects [Khaligh-Razavi2014]. The model predictions are derived from a convolutional deep neural network model – with each of the 7 layers providing a separate representational model. The bitmap images were presented to the deep neural network and the internal activity patterns used as representational models.

All data sets where simulated with 8 runs, 160 voxels, and independent noise on the observations. The noise variance was set to . We first normalized the model predictions, such that the norm of the predicted squared Euclidean distances was 1. We then varied the strength of the signal systematically from 0 (pure noise data) to a level that achieved reasonably high accuracy. We generated 3,000 data sets for each experiment, parameter setting, and model. For Experiment 3, where there were 7 alternative models, we generated data sets from each of the models. We then decided whether the data was better fit by the data-generating or one of the alternative models. Accuracy was then averaged over all possible model pairs. Thus, for all 3 Experiments, chance performance was at 0.5.

## References

## Appendix A Symbol Table

Symbol | Size | Meaning |
---|---|---|

1 | Number of conditions or columns in design matrix | |

1 | Number of channels or voxels | |

1 | Number of independent data partitions | |

1 | Number of distances, usually | |

1 | Number of activity measurements for partition | |

Activity measurements for partition | ||

Design matrix for partition | ||

True activity patterns | ||

True activity patterns for condition | ||

Estimated activity patterns for partition | ||

Variance-covariance matrix of across conditions | ||

Variance-covariance matrix of across channels or voxels | ||

True pattern difference for condition pair | ||

True distance for condition pair | ||

Vector of all pairwise distances | ||

Biased distance estimates | ||

Unbiased distance estimates | ||

Matrix of inner products of all pattern differences | ||

Variance-covariance matrix of all estimated pattern differences | ||

Variance-covariance matrix of distance estimates |

Table 1. Table of symbol sizes and meanings. Size is given in number of rows number of columns. For consistency of notation, vectors are defined to be in either row or column orientation.

## Appendix B Derivation of variance-covariance matrix of the distance estimate

### b.1 Expectations of products of normal random variables

The variance of distance estimates can be derived from the basic expectations of products of normally distributed variables. If are jointly normally distributed variables, then we have the following general expectations:

(37) |

### b.2 Expectations of the product of normal matrices

From Eq. 37, we can derive the basic expectations on the product of normal matrices. Let us assume that vector has multi-variate normal distribution with mean and variance-covariance matrix . To derive Eq. 29 and 31, we require the following results for the outer product . The mean is given by

(38) |

The variance-covariance matrix of the diagonal of is

(39) |

These results can be easily extended to the distribution of the matrix product , where is a random matrix with independent normally-distributed columns, i.e. with matrix normal distribution .

(40) |

The full variance-covariance matrix of the diagonal of is

(41) |

Finally, we need to generalize these results to a situation, where the columns of are not independent, but have element-wise covariance of

. Thus, we are interested in the joint distribution of the elements of the quadratic form

, where has matrix normal distribution .(42) | |||||

(43) |

From this result, we can obtain Eq. 29 by considering that the mean of across partitions has variance .

### b.3 Averaging across partitions

To derive the variance of the unbiased distances, we need to take into account the averaging of the estimated different across the M different cross-validation folds. While data from different partitions can be assumed to be independent, the inner products across cross-validation folds are not. This is because the partitions from one cross-validation fold will be again included in other folds. The two pattern differences that enter the product in Eq. 6 come from a single partition (that is, ), or from the set of all other partitions (that is, , which we will denote here in short by .

As a shorthand for the covariance between difference estimates and that are based on the set of partitions and , we introduce the symbol

(44) |

This is the covariance for each individual voxel. We now exploit the bilinearity of the covariance operator, that is,

(45) |

to obtain the following general result:

(46) |

where

(47) |

and

(48) |

This is the most general expression of the variance of the unbiased distance, which can even be used when the covariance structure of different partitions () differs from each other (see Appendix C).

For the case in which the difference estimates from all M partitions can be assumed to have the same covariance, that is, , we can simplify the expression dramatically. In this instance the best estimate of is the average of all partitions except m:

(49) |

Accordingly, for , we have

(50) |

.

Substituting the elements of the appropriate representations of into Eq. 47, 48 and summing up, we have

(51) |

and

(52) |

Finally, on writing the desired complete covariance matrix using element-by-element multiplication, we obtain the result given in Eq. 31.

## Appendix C Unbalanced designs

In unbalanced designs, the noise covariance is different across different partitions - i.e. each partition has their own covariance matrix . In the calculation of the distances, this ideally should be taken into account. To simplify this problem we here assume that all signals are zero, as we did for the derivation of the unbiased distance correlation.

Given the zero mean assumption and independent runs, the product of patterns from one pair of partitions, is uncorrelated to the product of two patterns from any other pair of partitions (Eq.37). Thus, for both biased and unbiased distance estimates, the optimal pooling of the estimates from the pairs is their precision weighted average.

The distance estimate from single pair of partitions has the following expected value () and covariance ():

If :

(53) |

(54) |

If :

(55) |

(56) |

Using these formulas we derive an optimal unbiased estimate using precision weighting:

(57) |

The covariance matrix of this combined estimate is then:

(58) |

Comments

There are no comments yet.