Log In Sign Up

Grounding Representation Similarity with Statistical Testing

To understand neural network behavior, recent works quantitatively compare different networks' learned representations using canonical correlation analysis (CCA), centered kernel alignment (CKA), and other dissimilarity measures. Unfortunately, these widely used measures often disagree on fundamental observations, such as whether deep networks differing only in random initialization learn similar representations. These disagreements raise the question: which, if any, of these dissimilarity measures should we believe? We provide a framework to ground this question through a concrete test: measures should have sensitivity to changes that affect functional behavior, and specificity against changes that do not. We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement.


page 3

page 5

page 6

page 10

page 11

page 12

page 13

page 16


Reliability of CKA as a Similarity Measure in Deep Learning

Comparing learned neural representations in neural networks is a challen...

Deconfounded Representation Similarity for Comparison of Neural Networks

Similarity metrics such as representational similarity analysis (RSA) an...

A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models

We have recently witnessed a number of impressive results on hard mathem...

Convergence Analysis of Over-parameterized Deep Linear Networks, and the Principal Components Bias

Convolutional Neural networks of different architectures seem to learn t...

Capacity, Bandwidth, and Compositionality in Emergent Language Learning

Many recent works have discussed the propensity, or lack thereof, for em...

Cluster Random Fields and Random-Shift Representations

This paper investigates random-shift representations of α-homogeneous sh...

1 Introduction

Understanding neural networks is not only scientifically interesting, but critical for applying deep networks in high-stakes situations. Recent work has highlighted the value of analyzing not just the final outputs of a network, but also its intermediate representations (Li et al., 2015; Raghu et al., 2019). This has motivated the development of representation similarity measures, which can provide insight into how different training schemes, architectures, and datasets affect networks’ learned representations.

A number of similarity measures have been proposed, including centered kernel alignment (CKA) (Kornblith et al., 2019), ones based on canonical correlation analysis (CCA) (Raghu et al., 2017; Morcos et al., 2018)

, single neuron alignment

(Li et al., 2015)

, vector space alignment

(Arora et al., 2017; Smith et al., 2017; Conneau et al., 2018), and others (Laakso and Cottrell, 2000; Wang et al., 2018b; Liang et al., 2019; Lenc and Vedaldi, 2015; Alain and Bengio, 2018; Feng et al., 2020). Unfortunately, these different measures tell different stories. For instance, CKA and projection weighted CCA disagree on which layers of different networks are most similar (Kornblith et al., 2019). This lack of consensus is worrying, as measures are often designed according to different and incompatible intuitive desiderata, such as whether finding a one-to-one assignment, or finding few-to-one mappings, between neurons is more appropriate (Li et al., 2015)

. As a community, we need well-chosen formal criteria for evaluating metrics to avoid over-reliance on intuition and the pitfalls of too many researcher degrees of freedom

(Leavitt and Morcos, 2020).

In this paper we vew representation dissimilarity measures as implicitly answering a classification question–whether two representations are essentially similar or importantly different. Thus, in analogy to statistical testing, we can evaluate them based on their sensitivity to important change and specificity (non-responsiveness) against unimportant changes or noise.

As a warm-up, we first initially consider two intuitive criteria: first, that metrics should have specificity against random initialization; and second, that they should be sensitive to deleting important principal components (those that affect probing accuracy). Unfortunately, popular metrics fail at least one of these two tests. CCA is not specific – random initialization noise overwhelms differences between even far-apart layers in a network (Section 3.1). CKA on the other hand is not sensitive, failing to detect changes in all but the top principal components of a representation (Section 3.2).

We next construct quantitative benchmarks to evaluate a dissimilarity measure’s quality. To move beyond our intuitive criteria, we need a ground truth. For this we turn to the functional behavior of the representations we are comparing, measured through probing accuracy (an indicator of syntactic information) (Belinkov et al., 2017; Peters et al., 2018; Tenney et al., 2019) and out-of-distribution performance of the model they belong to (Naik et al., 2018; McCoy et al., 2020; D’Amour et al., 2020). We then score dissimilarity measures based on their rank correlation with these measured functional differences. Overall our benchmarks contain examples and vary representations across several axes including random seed, layer depth, and low-rank approximation (Section 4)111Code to replicate our results can be found at

Our benchmarks confirm our two intuitive observations: on subtasks that consider layer depth and principal component deletion, we measure the rank correlation with probing accuracy and find CCA and CKA lacking as the previous warm-up experiments suggested. Meanwhile, the Orthogonal Procrustes distance, a classical but often overlooked222For instance, Raghu et al. (2017) and Morcos et al. (2018) do not mention it, and Kornblith et al. (2019) relegates it to the appendix; although Smith et al. (2017) does use it to analyze word embeddings and prefers it to CCA. dissimilarity measure, balances gracefully between CKA and CCA and consistently performs well. This underscores the need for systematic evaluation, otherwise we may fall to recency bias that undervalues classical baselines.

Other subtasks measure correlation with OOD accuracy, motivated by the observation that random initialization sometimes has large effects on OOD performance (McCoy et al., 2020)

. We find that dissimilarity measures can sometimes predict OOD performance using only the in-distribution representations, but we also identify a challenge set on which none of the measures do statistically better than chance. We hope this challenge set will help measure and spur progress in the future.

2 Problem Setup: Metrics and Models

Our goal is to quantify the similarity between two different groups of neurons (usually layers). We do this by comparing how their activations behave on the same dataset. Thus for a layer with neurons, we define , the matrix of activations of the neurons on data points, to be that layer’s raw representation of the data. Similarly, let be a matrix of the activations of neurons on the same data points. We center and normalize these representations before computing dissimilarity, per standard practice. Specifically, for a raw representation we first subtract the mean value from each column, then divide by the Frobenius norm, to produce the normalized representation , used in all our dissimilarity computations. In this work we study dissimilarity measures that allow for quantitative comparisons of representations both within and across different networks. We colloquially refer to values of as distances, although they do not necessarily satisfy the triangle inequality required of a proper metric.

We study five dissimilarity measures: centered kernel alignment (CKA), three measures derived from canonical correlation analysis (CCA), and a measure derived from the orthogonal Procrustes problem. As argued in Kornblith et al. (2019), similarity measures should be invariant to left orthogonal transformations to accommodate the symmetries of neural networks, and all five measures satisfy this requirement.

Centered kernel alignment (CKA) uses an inner product to quantify similarity between two representations. It is based on the idea that one can first choose a kernel, compute the kernel matrix for each representation, and then measure similarity as the alignment between these two kernel matrices. The measure of similarity thus depends on one’s choice of kernel; in this work we consider Linear CKA:


as proposed in Kornblith et al. (2019). Other choices of kernel are also valid; we focus on Linear CKA here since Kornblith et al. (2019) report similar results from using either a linear or RBF kernel.

Canonical correlation analysis (CCA) finds orthogonal bases () for two matrices such that after projection onto , the projected matrices have maximally correlated rows. For , the canonical correlation coefficient is computed as follows:


To transform the vector of correlation coefficients into a scalar measure, two options considered previously (Kornblith et al., 2019) are the mean correlation coefficient, , and the mean squared correlation coefficient, , defined as follows:


To improve the robustness of CCA, Morcos et al. (2018) propose projection-weighted CCA (PWCCA) as another scalar summary of CCA:


where is the row of , and is the projection of onto the canonical direction. We find that PWCCA performs far better than and , so we focus on PWCCA in the main text, but include results on the other two measures in the appendix.

The orthogonal Procrustes problem consists of finding the left-rotation of that is closest to in Frobenius norm, i.e. solving the optimization problem:


The minimum is the squared orthogonal Procrustes distance between and , and is equal to


where is the nuclear norm (Schönemann, 1966). Unlike the other metrics, the orthogonal Procrustes distance is not normalized between 0 and 1, although for normalized , it also lies in .

2.1 Models we study

We investigate representations computed by the BERT model family (Devlin et al., 2018) on sentences from the Multigenre Natural Language Inference (MNLI) dataset (Williams et al., 2018)

. We study BERT models of two sizes: BERT base, with 12 hidden layers of 768 neurons, and BERT medium, with 8 hidden layers of 512 neurons. We use the same architectures as in the open source BERT release

333available at, but to generate diversity we study 3 variations of these models:

  1. BERT base models pretrained with different random seeds but not finetuned for particular tasks, released by Zhong et al. (2021)444available at

  2. BERT medium models that were initialized from pretrained models released by Zhong et al. (2021), that we further finetuned on MNLI with different finetuning seeds ( models total).

  3. BERT base models that were initialized from the pretrained BERT model in (Devlin et al., 2018) and finetuned on MNLI with different seeds, released by McCoy et al. (2020)555available at

Further training details, as well as checks that our training protocol results in models with comparable performance to the original BERT release, can be found in Appendix A.

3 Warm-up: Intuitive Tests for Sensitivity and Specificity

When designing dissimilarity measures, researchers usually consider invariants that these measures should not be sensitive to (Kornblith et al., 2019); for example, symmetries in neural networks imply that permuting the neurons in a fully connected layer does not change the representations learned. We take this one step further and frame dissimilarity measures as answering whether representations are essentially the same, or importantly different. We can then evaluate measures based on whether they respond to important changes (sensitivity) while ignoring changes that don’t matter (specificity).

Assessing sensitivity and specificity requires a ground truth–which representations are truly different? To answer this, we begin with the following two intuitions: 1) neural network representations trained on the same data but from different random initializations are similar, and 2) representations lose crucial information as principal components are deleted. These motivate the following intuitive tests of specificity and sensitivity: we expect a dissimilarity measure to: 1) assign a small distance between architecturally identical neural networks that only differ in initialization seed, and 2) assign a large distance between a representation and the representation after deleting important principal components (enough to affect accuracy). We will see that PWCCA fails the first test (specificity), while CKA fails the second (sensitivity).

Figure 1: PWCCA fails the intuitive specificity test. Top: PWCCA, CKA, and Orthogonal Procrustes pairwise distances between each layer of two differently initialized networks (Model A and B). Bottom: We zoom in to analyze the layer of Model A, plotting this layer’s distance to every other layer in both networks; the dashed line indicates the distance to the corresponding layer in Model B. For PWCCA, none of the distances in model A exceed this line, indicating that random initialization affects this distance more than large changes in layer depth.

3.1 Specificity against changes to random seed

Neural networks with the same architecture trained from different random initializations show many similarities, such as highly correlated predictions on in-distribution data points (McCoy et al., 2020). Thus it seems natural to expect a good similarity measure to assign small distances between architecturally corresponding layers of networks that are identical except for initialization seed.

To check this property, we take two BERT base models pre-trained with different random seeds and, for every layer in the first model, compute its dissimilarity to every layer in both the first and second model. We do this for 5 separate pairs of models and average the results. To pass the intuitive specificity test, a dissimilarity measure should assign relatively small distances between a layer in the first network and its corresponding layer in the second network.

Figure 1 displays the average pair-wise PWCCA, CKA, and Orthogonal Procrustes distances between layers of two networks differing only in random seed. According to PWCCA, these networks’ representations are quite dissimilar; for instance, the two layer representations are further apart than they are from any other layer in the same network. PWCCA is thus not specific against random initialization, as it can outweigh even large changes in layer depth.

In contrast, CKA can separate layer in a different network from layers or in the same network, showing better specificity to random initialization. Orthogonal Procrustes exhibits smaller but non-trivial specificity, distinguishing layers once they are - layers apart.

3.2 Sensitivity to removing principal components

(a) First layer of BERT
(b) Last layer of BERT
Figure 2: CKA fails to be sensitive to all but the largest principal components. We compute dissimilarities between a layer’s representation and low-rank approximations to that representation obtained by deleting principal components, starting from the smallest (solid lines). We also compute the average distance between networks trained with different random seeds as a baseline (dotted line), and mark the intersection point with a star. The starred points indicate that CKA requires almost all the components to be deleted before CKA distance exceeds the baseline.

Dissimilarity measures should also be sensitive to deleting important principal components of a representation.666For a representation , we define , the result of deleting the smallest principal components from

, as follows: we compute the singular value decomposition

, construct by dropping the lowest singular vectors of , and finally take . To quantify which components are important, we fix a layer of a pre-trained BERT base model and measure how probing accuracy degrades as principal components are deleted, since probing accuracy is a common measure of the information captured in a representation (Belinkov et al., 2017). We probe linear classifation performance on the Stanford Sentiment Tree Bank task (SST-2) (Socher et al., 2013), following the experimental protocol in Tamkin et al. (2020). Figure 2(b) shows how probing accuracy degrades with component deletion. Ideally, dissimilarity measures should be large by the time probing accuracy has decreased substantially.

To assess whether a dissimilarity measure is large, we need a baseline to compare to. For each measure, we define a dissimilarity score to be above the detectable threshold if it is larger than the dissimilarity score between networks with different random initialization. Figure 2 plots the dissimilarity induced by deleting principal components, as well as this baseline.

For the last layer of BERT, CKA requires 97% of a representation’s principal components to be deleted for the dissimilarity to be detectable; after deleting these components, probing accuracy shown in Figure 2(b) drops significantly from 80% to 63% (chance is ). CKA thus fails to detect large accuracy drops and so fails our intuitive sensitivity test.

Other metrics perform better: Orthogonal Procrustes’s detection threshold is 85% of the principal components, corresponding to an accuracy drop 80% to 70%. PWCCA’s threshold is 55% of principal components, corresponding to an accuracy drop from 80% to 75%.

PWCCA’s failure of specificity and CKA’s failure of sensitivity on these intuitive tests are worrying. However, before declaring definitive failure, in the next section, we turn to making our assessments more rigorous.

4 Rigorously Evaluating Dissimilarity Metrics

In the previous section, we saw that CKA and PWCCA each failed intuitive tests, based on sensitivity to principal components and specificity to random initialization. However, these were based primarily on intuitive, qualitative desiderata. Is there some way for us to make these tests more rigorous and quantitative?

First consider the intuitive layer specificity test (Section 3.1), which revealed that random initialization affects PWCCA more than large changes in layer depth. To justify why this is undesirable, we can turn to probing accuracy, which is strongly affected by layer depth, and only weakly affected by random seed (Figure 2(a)). This suggests a path forward: we can ground the layer test in the concrete differences in functionality captured by the probe.

More generally, we want metrics to be sensitive to changes that affect functionality, while ignoring those that don’t. This motivates the following general procedure, given a distance metric and a functionality (which assigns a real number to a given representation):

  1. Collect a set of representations that differ along one or more axes of interest (e.g. layer depth, random seed).

  2. Choose a reference representation . When is an accuracy metric, it is reasonable to choose .777Choosing the highest accuracy model as the reference makes it more likely that as accuracy changes, models are on average becoming more dissimilar. A low accuracy model may be on the “periphery” of model space, where it is dissimilar to models with high accuracy, but potentially even more dissimilar to other low accuracy models that make different mistakes.

  3. For every representation :

    • Compute

    • Compute

  4. Report the rank correlation between and (measured by Kendall’s or Spearman ).

The above procedure provides a quantitative measure of how well the distance metric responds to the functionality . For instance, in the layer specificity test, since depth affects probing accuracy strongly while random seed affects it only weakly, a dissimilarity measure with high rank correlation will be strongly responsive to layer depth and weakly responsive to seed; thus rank correlation quantitatively formalizes the test from Section 3.1.

(a) Probing variation across layers
(b) Variation across principal axis deletions
(c) Variation across finetuning seeds
(d) Variation across different seeds
Figure 3: Our perturbations induce substantial variation on probing tasks and stress tests: (2(a)

): Changing the depth of the examined BERT base layer strongly affects probing accuracy on QNLI. The trend for each randomly initialized model is displayed semi-transparently, and the solid black line is the mean trend. (

2(b)): Truncating principal components from pretrained BERT base significantly degrades probing accuracy on SST-2 (BERT layer 12 shown here). (2(c)): Finetuning BERT base with different seeds leads to variation in accuracies on the Lexical (non-entailed) subset of HANS (McCoy et al., 2020), shown via histogram. (2(d)): Pretraining and finetuning BERT medium with 10 different pretraining seeds and 10 different finetuning seeds per pretrained model leads to variation in accuracies on the Antonymy (yellow scatter points) and Numerical (blue scatter points) stress tests Naik et al. (2018).

Correlation metrics also capture properties that our intuition might miss. For instance, Figure 2(a) shows that some variation in random seed actually does affect accuracy, and our procedure rewards metrics that pick up on this, while the intuitive sensitivity test would penalize them.

Our procedure requires choosing a collection of models ; the crucial feature of is that it contains models with diverse behavior according to . Different sets , combined with a functional difference , can be thought of as miniature “benchmarks" that surface complementary perspectives on dissimilarity measures’ responsiveness to that functional difference. In the rest of this section, we instantiate this quantitative benchmark for several choices of and , starting with the layer and principal component tests from Section 3 and continuing on to several tests of OOD performance.

The overall results are summarized in Table 1. Note that for any single benchmark, we expect the correlation coefficients to be significantly lower than , since the metric must capture all important axes of variation while measures only one type of functionality. A good metric is one that has consistently high correlation across many different functional measures.

Benchmark 1: Layer depth.

To turn the layer test into a benchmark, we construct a set of representations by pretraining BERT base models with different initialization seeds and including each of the BERT layers as a representation. We separately consider two functionalities : probing accuracy on QNLI (Wang et al., 2018a) or SST-2 (Socher et al., 2013). To compute the rank correlation, we take the reference representation to be the depth- (final layer) representation with highest probing accuracy. We compute the Kendall’s and Spearman’s rank correlations between the dissimilarities and the probing accuracy differences and report the results in Table 1.

We find that PWCCA has lower rank correlations compared to CKA and Procrustes for both probing tasks. This corroborates the intuitive specificity test (Section 3.1), suggesting that PWCCA registers too large of a dissimilarity across random initializations.

Functionality Procrustes CKA PWCCA
Pretraining seed,
layer depth
120 QNLI probe 0.862 0.670 0.876 0.685 0.763 0.564
120 SST-2 probe 0.890 0.707 0.905 0.732 0.829 0.637
Pretraining seed,
PC deletion
SST-2 probe 0.860 0.677 0.751 0.564 0.870 0.690
Finetuning seed
HANS: Lexical
Pretraining and
finetuning seeds
stress test
0.243 0.178 0.227 0.160 0.204 0.152
stress test
0.071 0.049 0.122 0.084 0.031 0.023
Total Average 0.580 0.447 0.557 0.426 0.544 0.413
Table 1: Summary of rank correlation results. For the first four rows (benchmarks), all dissimilarity measures successfully achieve significant positive rank correlation with the functionality of interest–both CKA and PWCCA dominate certain benchmarks and fall behind on others, while Procrustes is more consistent and often close to the leader. The last two benchmarks are more challenging, and no dissimilarity measure achieves a high correlation.

Benchmark 2: Principal component (PC) deletion.

We next quantify the PC deletion test from Section 3.2, by constructing a set of representations that vary in both random initialization and fraction of principal components deleted. We pretrain 10 BERT base models with different initializations, and for each pretrained model we obtain 14 different representations by deleting that representation’s smallest principal components, with . Thus has elements. The representations themselves are the layer- activations, for ,888Earlier layers have near-chance accuracy on probing tasks, so we ignore them. so there are different choices of . We use SST-2 probing accuracy as the functionality of interest , and select the reference representation as the element in with highest accuracy. Rank correlation results are consistent across the 5 choices of (Appendix C), so we report the average as a summary statistic in Table 1.

We find that PWCCA has the highest rank correlation between dissimilarity and probing accuracy, followed by Procrustes, and distantly followed by CKA. This corroborates the intuitive observations from Section 3.2 that CKA is not sensitive to principal component deletion.

4.1 Investigating variation in OOD performance across random seeds

So far our benchmarks have been based on probing accuracy, which only measures in-distribution behavior (the train and test set of the probe are typically i.i.d.). In addition, representations were always pretrained but not finetuned. To add diversity to our benchmarks, we next consider the out-of-distribution performance of several collections of fine-tuned models.

Benchmark 3: Changing fine-tuning seeds.

McCoy et al. (2020) show that a single pretrained BERT base model finetuned on MNLI with different random initializations will produce models with similar in-distribution performance, but widely variable performance on out-of-distribution data (see Figure 2(c)). We thus create a benchmark out of McCoy et al.

’s 100 released fine-tuned models, using OOD accuracy on the “Lexical Heuristic (Non-entailment)" subset of the HANS dataset

(McCoy et al., 2019) as our functionality . This functionality is associated with the entire model, rather than an individual layer (in contrast to the probing functionality), but we consider one layer at a time to measure whether dissimilarities between representations at that layer correlate with . This allows us to also localize whether certain layers are more predictive of .

We construct different (one for each of the 12 layers of BERT base), taking the reference representation to be that of the highest accuracy model according to . As before, we report each dissimilarity measure’s rank correlation with in Table 1, averaged over the 12 runs.

All three dissimilarity measures correlate with OOD accuracy, with Orthogonal Procrustes and PWCCA being more correlated than CKA. Since the representations in our benchmarks were computed on in-distribution MNLI data, this has the interesting implication that dissimilarity measures can detect OOD differences without access to OOD data. It also implies that random initialization leads to meaningful functional differences that are picked up by these measures, especially Procrustes and PWCCA. Contrast this with our intuitive specificity test in Section 3.1, where all sensitivity to random initialization was seen as a shortcoming. Our more quantitative benchmark here suggests that some of that sensitivity tracks true functionality.

To check that the differences in rank correlation for Procrustes, PWCCA, and CKA are statistically significant, we compute bootstrap estimates of their 95% confidence intervals. With 2000 bootstrapped samples, we find statistically significant differences between all pairs of measures for most choices of layer depth

, so we conclude PWCCA > Orthogonal Procrustes > CKA (the full results are in Appendix D). We do not apply this procedure for the previous two benchmarks, because the different models have correlated randomness and so any -value based on independence assumptions would be invalid.

Changing both pretraining and fine-tuning seeds: a challenge set.

We also construct benchmarks from a collection of 100 BERT medium models, trained with all combinations of 10 pretraining and 10 fine-tuning seeds. The models are fine-tuned on MNLI, and we consider two different functionalities of interest : accuracy on the Antonymy stress test and on the Numerical stress test (Naik et al., 2018), which both show significant variation in accuracy across models (see Figure 2(d)). We obtain different sets (one for each of the 8 layer depths in BERT medium), again taking to be the representation of the highest-accuracy model according to . Rank correlations for each dissimilarity measure are averaged over the 8 runs and reported in Table 1.

None of the dissimilarity measures show a large rank correlation for either task, and for the Numerical task, at most layers, the associated -values (assuming independence) are non-significant at the 0.05 level (see Appendix C). 999See Appendix C for -values as produced by sci-kit learn. Strictly speaking, the -values are invalid because they assume independence, but the pretraining seed induces correlations. However, correctly accounting for these would tend to make the -values larger, thus preserving our conclusion of non-significance . Thus we conclude that all measures fail to be sensitive to OOD accuracy in these settings. One reason for this could be that there is less variation in the OOD accuracies compared to the previous experiment (compare Figure 2(c) to 2(d)). Another reason could be that it is harder to correctly account for both pretraining and fine-tuning variation at the same time. Either way, we hope that future dissimilarity measures can improve upon these results, and we present this benchmark as a challenge task to motivate progress.

5 Discussion

In this work we proposed a quantitative measure for evaluating similarity metrics, based on the rank correlation with functional behavior. Using this, we generated tasks motivated by sensitivity to deleting important directions, specificity to random initialization, and sensitivity to out-of-distribution performance. Popular existing metrics such as CKA and CCA often performed poorly on these tasks, sometimes in striking ways. Meanwhile, the classical Orthogonal Procrustes transform attained consistently good performance.

Given the success of Orthogonal Procrustes, it is worth reflecting on how it differs from the other metrics and why it might perform well. To do so, we consider a simplified case where and

have the same singular vectors but different singular values. Thus without loss of generality

and , where the are both diagonal. In this case, the Orthogonal Procrustes distance reduces to , or the sum of the squared distances between the singular values. We will see that both CCA and CKA reduce to less reasonable formulae in this case.

Orthogonal Procrustes vs. CCA. All three metrics derived from CCA assign zero

distance even when the (non-zero) singular values are arbitrarily different. This is because CCA correlation coefficients are invariant to all invertible linear transformations. This invariance property may help explain why CCA metrics generally find layers within the same network to be much more similar than networks trained with different randomness. Random initialization introduces noise, particularly in unimportant principal components, while representations within the same network more easily preserve these components, and CCA may place too much weight on their associated correlation coefficients.

Orthogonal Procrustes vs. CKA. In contrast to the squared distance of Orthogonal Procrustes, CKA actually reduces to a quartic function based on the dot products between the squared entries of and . As a consequence, CKA is dominated by representations’ largest singular values, leaving it insensitive to meaningful differences in smaller singular values as illustrated in Figure 2. This lack of sensitivity to moderate-sized differences may help explain why CKA fails to track out-of-distribution error effectively.

In addition to helping understand similarity measures, our benchmarks pinpoint directions for improvement. No method was sensitive to accuracy on the Numerical stress test in our challenge set, possibly due to a lower signal-to-noise ratio. Since Orthogonal Procrustes performed well on most of our tasks, it could be a promising foundation for a new measure, and recent work shows how to regularize Orthogonal Procrustes to handle high noise

(Pumir et al., 2021). Perhaps similar techniques could be adapted here.

An alternative to our benchmarking approach is to directly define two representations’ dissimilarity as their difference in a functional behavior of interest. Feng et al. (2020) take this approach, defining dissimilarity as difference in accuracy on a handful of probing tasks. One drawback of this approach is that a small set of probes may not capture all the differences in representations, so it is useful to base dissimilarity measures on representations’ intrinsic properties. Intrinsically defined dissimilarities also have the potential to highlight new functional behaviors, as we found that representations with similar in-distribution probing accuracy often have highly variable OOD accuracy.

A limitation of our work is that we only consider a handful of model variations and functional behaviors, and restricting our attention to these settings could overlook other important considerations. To address this, we envision a paradigm in which a rich tapestry of benchmarks are used to ground and validate neural network interpretations. Other axes of variation in models could include training on more or fewer examples, training on shuffled labels vs. real labels, training from specifically chosen initializations (Frankle and Carbin, 2018), and using different architectures. Other functional behaviors to examine could include modularity and meta-learning capabilities. Benchmarks could also be applied to other interpretability tools beyond dissimilarity. For example, sensitivity to deleting principal components could provide an additional sanity check for saliency maps and other visualization tools (Adebayo et al., 2018).

More broadly, many interpretability tools are designed as audits of models, although it is often unclear what characteristics of the models are consistently audited. We position this work as a counter-audit, where by collecting models that differ in functional behavior, we can assess whether the interpretability tools CKA, PWCCA, etc., accurately reflect the behavioral differences. Many other types of counter-audits may be designed to assess other interpretability tools. For example, models that have backdoors built into them to misclassify certain inputs provide counter-audits for interpretability tools that explain model predictions–these explanations should reflect any backdoors present (Li et al., 2020; Chen et al., 2017; Wang et al., 2019; Kurita et al., 2020). We are hopeful that more comprehensive checks on interpretability tools will provide deeper understanding of neural networks, and more reliable models.

Thanks to Ruiqi Zhong for helpful comments and assistance in finetuning models, and thanks to Daniel Rothchild and our anonymous reviewers for helpful discussion. FD is supported by an NSF Graduate Research Fellowship and the Open Philanthropy Project AI Fellows Program.



Appendix A BERT finetuning details

We fine-tuned models from Zhong et al. [2021] and the original BERT models from Devlin et al. [2018]

on three tasks – Quora Question Pairs (QQP)

101010, Multi-Genre Natural Language Inference (MNLI; Williams et al. [2018]), and the Stanford Sentiment Treebank (SST-2; Socher et al. [2013]), and show each model’s accuracy on these tasks in Table 2. Our models generally have comparable accuracy.

As in Turc et al. [2019]

, we finetune for 4 epochs for each dataset. For each task and model size, we tune hyperparameters in the following way: we first randomly split our new training set into 80% and 20%; then we finetune on the 80% split with all 9 combination of batch size [16, 32, 64] and learning rate [1e-4, 5e-5, 3e-5], and choose the combination that leads to the best average accuracy on the remaining 20%. Finetuning these models for all three tasks requires around 500 hours.

89.8% 79.6% 94.2%
89.5% 78.9% 94.2%
90.8% 83.8% 95.0%
90.6% 81.2% 94.6%
Table 2: Comparing accuracy of our pretrained model (superscript ) to the original release by Devlin et al. [2018] and Turc et al. [2019] (superscript ) on a variety of fine-tuned tasks.

Appendix B Licenses

The source code for BERT models available at is licensed under the Apache License 2.0.

The model weights for the 100 BERT base models provided by McCoy et al. [2020] are licensed under the Creative Commons Attribution 4.0 International license, and their source code is licensed under the MIT license (

Appendix C Layer-wise results

Some of the results presented in Table 1 were averaged over multiple layers, since rankings between dissimilarity measures were consistent across different layers. Rank correlation scores are higher across all measures for certain layers, however, so we include layer-by-layer results here for completeness. We also include scores for and here, and note that they are often similar to PWCCA, and generally dominated by other measures. We expand each row of Table 1 into a subsection of its own. We also include p-values as reported by sci-kit learn, although we note that because random seeds are shared among some representations, these p-values are all inflated, with the exception of those for the experiment perturbing only fine-tuning seed, and assessing functionality through HANS (C.3). The invalid p-values may all be thought of as upper-bounds for the significance of the rank correlation results.

c.1 Perturbation: pretraining seed and layer depth

Tables 3 and 4 show the full results (including p-values and all 5 dissimilarity measures) using the QNLI probe as the functionality of interest, for Spearman and Kendall’s , respectively. Table 5 and 6 present results for the probing task SST-2 as the functionality of interest.

Layer Procrustes CKA PWCCA
12 0.862 (6.5E-37) 0.876 (1.6E-39) 0.763 (2.2E-24) 0.849 (1.0E+00) 0.846 (1.0E+00)
Table 3: Spearman results for perturbing pretraining seed and layer depth, and assessing functionality through the QNLI probe
Layer Procrustes CKA PWCCA
12 0.670 (1.1E-27) 0.685 (7.4E-29) 0.564 (3.2E-20) 0.652 (1.0E+00) 0.647 (1.0E+00)
Table 4: Kendall’s results for perturbing pretraining seed and layer depth, and assessing functionality through the QNLI probe
Layer Procrustes CKA PWCCA
12 0.890 (2.7E-42) 0.905 (5.3E-46) 0.829 (7.7E-32) 0.857 (1.0E+00) 0.854 (1.0E+00)
Table 5: Spearman results for perturbing pretraining seed and layer depth, and assessing functionality through the SST-2 probe
Layer Procrustes CKA PWCCA
12 0.707 (1.2E-30) 0.732 (1.0E-32) 0.637 (3.1E-25) 0.662 (1.0E+00) 0.658 (1.0E+00)
Table 6: Kendall’s results for perturbing pretraining seed and layer depth, and assessing functionality through the SST-2 probe

c.2 Perturbation: pretraining seed and principal component deletion

We find that for these experiments, results are consistent across the layers we analyze (the last 6 layers of BERT base). Tables 7 and 8 show results for Spearman and Kendall’s , respectively.

Layer Procrustes CKA PWCCA
8 0.764 (2.4E-36) 0.668 (3.2E-25) 0.776 (3.4E-38) 0.700 (1.9E-28) 0.700 (1.8E-28)
9 0.813 (2.1E-44) 0.706 (4.0E-29) 0.825 (9.2E-47) 0.728 (1.3E-31) 0.728 (1.2E-31)
10 0.873 (2.1E-58) 0.818 (2.7E-45) 0.874 (1.1E-58) 0.748 (3.2E-34) 0.749 (2.7E-34)
11 0.918 (1.2E-74) 0.797 (1.4E-41) 0.922 (1.7E-76) 0.781 (6.6E-39) 0.781 (7.0E-39)
12 0.932 (1.1E-81) 0.766 (1.1E-36) 0.955 (4.2E-97) 0.810 (6.1E-44) 0.810 (6.1E-44)
Table 7: Layer-wise Spearman results for perturbing pretraining seed and principal component deletion, and assessing functionality through the SST-2 probe
Layer Procrustes CKA PWCCA
8 0.560 (1.8E-29) 0.479 (4.4E-22) 0.573 (1.1E-30) 0.512 (6.8E-25) 0.512 (6.6E-25)
9 0.602 (1.2E-33) 0.509 (1.2E-24) 0.618 (2.5E-35) 0.542 (1.1E-27) 0.543 (9.7E-28)
10 0.684 (5.6E-43) 0.627 (2.1E-36) 0.685 (5.3E-43) 0.588 (2.9E-32) 0.589 (2.5E-32)
11 0.751 (2.8E-51) 0.616 (3.3E-35) 0.756 (6.4E-52) 0.648 (9.2E-39) 0.648 (9.2E-39)
12 0.787 (3.4E-56) 0.588 (2.9E-32) 0.819 (1.2E-60) 0.701 (4.7E-45) 0.701 (4.9E-45)
Table 8: Layer-wise Kendall’s results for perturbing pretraining seed and principal component deletion, and assessing functionality through the SST-2 probe

c.3 Perturbation: fine-tuning seed, Functionality: HANS

Results for this experiment are similar across layers for Procrustes and all three CCA-based measures, with middle layers of BERT base having a slightly higher rank correlation score in general. For CKA, this effect is even more pronounced. Tables 9 and 10 show the results for Spearman and Kendall’s , respectively.

Layer Procrustes () CKA () PWCCA () () ()
1 0.425 (5.1E-06) 0.361 (1.1E-04) 0.405 (1.4E-05) 0.388 (3.4E-05) 0.389 (3.2E-05)
2 0.510 (3.1E-08) 0.410 (1.2E-05) 0.486 (1.5E-07) 0.488 (1.3E-07) 0.483 (1.8E-07)
3 0.531 (6.6E-09) 0.427 (4.6E-06) 0.538 (3.8E-09) 0.533 (5.6E-09) 0.532 (6.2E-09)
4 0.543 (2.6E-09) 0.506 (3.9E-08) 0.552 (1.4E-09) 0.555 (1.0E-09) 0.550 (1.5E-09)
5 0.563 (5.3E-10) 0.512 (2.6E-08) 0.570 (2.9E-10) 0.582 (1.1E-10) 0.580 (1.3E-10)
6 0.629 (1.2E-12) 0.641 (3.6E-13) 0.621 (2.8E-12) 0.621 (2.7E-12) 0.622 (2.5E-12)
7 0.647 (1.7E-13) 0.658 (5.0E-14) 0.647 (1.7E-13) 0.653 (9.0E-14) 0.650 (1.2E-13)
8 0.643 (2.7E-13) 0.552 (1.3E-09) 0.653 (9.5E-14) 0.651 (1.1E-13) 0.651 (1.2E-13)
9 0.589 (5.9E-11) 0.419 (7.1E-06) 0.641 (3.5E-13) 0.662 (3.3E-14) 0.660 (4.2E-14)
10 0.536 (4.6E-09) 0.437 (2.7E-06) 0.559 (7.3E-10) 0.612 (6.6E-12) 0.614 (5.4E-12)
11 0.532 (6.2E-09) 0.426 (4.9E-06) 0.565 (4.7E-10) 0.619 (3.4E-12) 0.614 (5.5E-12)
12 0.465 (5.3E-07) 0.192 (2.8E-02) 0.574 (2.1E-10) 0.609 (9.2E-12) 0.610 (7.9E-12)
Table 9: Layer-wise Spearman results for perturbing finetuning seed, and assessing functionality through the HANS: Lexical (non-entailment) OOD dataset
Layer Procrustes () CKA () PWCCA () () ()
1 0.295 (6.7E-06) 0.269 (3.6E-05) 0.277 (2.2E-05) 0.265 (4.7E-05) 0.268 (4.0E-05)
2 0.363 (4.6E-08) 0.288 (1.1E-05) 0.343 (2.1E-07) 0.342 (2.3E-07) 0.342 (2.4E-07)
3 0.372 (2.1E-08) 0.290 (9.5E-06) 0.378 (1.3E-08) 0.375 (1.6E-08) 0.375 (1.6E-08)
4 0.393 (3.4E-09) 0.358 (6.6E-08) 0.401 (1.7E-09) 0.405 (1.2E-09) 0.403 (1.4E-09)
5 0.410 (7.7E-10) 0.367 (3.3E-08) 0.417 (4.1E-10) 0.428 (1.4E-10) 0.424 (2.0E-10)
6 0.464 (4.2E-12) 0.474 (1.5E-12) 0.460 (6.3E-12) 0.460 (5.8E-12) 0.461 (5.6E-12)
7 0.483 (5.5E-13) 0.488 (3.3E-13) 0.481 (7.1E-13) 0.486 (3.9E-13) 0.483 (5.5E-13)
8 0.478 (9.2E-13) 0.392 (3.7E-09) 0.483 (5.7E-13) 0.481 (6.5E-13) 0.480 (7.7E-13)
9 0.432 (1.0E-10) 0.293 (7.7E-06) 0.475 (1.2E-12) 0.496 (1.3E-13) 0.494 (1.6E-13)
10 0.380 (1.0E-08) 0.306 (3.4E-06) 0.401 (1.7E-09) 0.447 (2.3E-11) 0.448 (2.1E-11)
11 0.376 (1.5E-08) 0.292 (8.3E-06) 0.411 (6.9E-10) 0.448 (2.1E-11) 0.445 (2.7E-11)
12 0.330 (5.7E-07) 0.127 (3.1E-02) 0.416 (4.4E-10) 0.446 (2.5E-11) 0.447 (2.2E-11)
Table 10: Layer-wise Kendall’s results for perturbing finetuning seed, and assessing functionality through the HANS: Lexical (non-entailment) OOD dataset

c.4 Perturbation: pretraining seeds and finetuning seeds of BERT medium

Rank correlation scores are low across the board for this task, suggesting that it is difficult for all existing dissimilarity measures, regardless of the layer within a network. Results on the Antonymy stress test for Spearman and Kendall’s are in Tables 11 and 12, respectively. Results on the Numerical stress test for Spearman and Kendall’s are in Tables 13 and 14, respectively.

Layer Procrustes CKA PWCCA
1 0.252 (5.7E-03) 0.241 (7.8E-03) 0.168 (4.7E-02) 0.305 (1.0E+00) 0.327 (1.0E+00)
2 0.213 (1.7E-02) 0.145 (7.5E-02) 0.131 (9.7E-02) 0.047 (6.8E-01) 0.031 (6.2E-01)
3 0.260 (4.5E-03) 0.262 (4.2E-03) 0.208 (1.9E-02) 0.137 (9.1E-01) 0.111 (8.6E-01)
4 0.260 (4.5E-03) 0.265 (3.8E-03) 0.265 (3.8E-03) 0.276 (1.0E+00) 0.254 (9.9E-01)
5 0.273 (3.0E-03) 0.302 (1.1E-03) 0.278 (2.5E-03) 0.339 (1.0E+00) 0.310 (1.0E+00)
6 0.330 (3.9E-04) 0.280 (2.4E-03) 0.346 (2.1E-04) 0.313 (1.0E+00) 0.304 (1.0E+00)
7 0.271 (3.2E-03) 0.315 (7.1E-04) 0.111 (1.4E-01) 0.091 (8.2E-01) 0.090 (8.1E-01)
8 0.084 (2.0E-01) 0.004 (4.8E-01) 0.123 (1.1E-01) 0.204 (9.8E-01) 0.198 (9.8E-01)
Table 11: Layer-wise Spearman results for perturbing pretraining seed and finetuning seed, and assessing functionality through the Antonymy stress test
Layer Procrustes CKA PWCCA
1 0.199 (1.7E-03) 0.171 (5.9E-03) 0.126 (3.3E-02) 0.244 (1.0E+00) 0.243 (1.0E+00)
2 0.179 (4.3E-03) 0.123 (3.5E-02) 0.118 (4.2E-02) 0.061 (8.1E-01) 0.042 (7.3E-01)
3 0.185 (3.3E-03) 0.186 (3.2E-03) 0.139 (2.0E-02) 0.110 (9.5E-01) 0.096 (9.2E-01)
4 0.187 (3.0E-03) 0.191 (2.6E-03) 0.188 (2.9E-03) 0.206 (1.0E+00) 0.193 (1.0E+00)
5 0.192 (2.4E-03) 0.194 (2.2E-03) 0.202 (1.5E-03) 0.267 (1.0E+00) 0.242 (1.0E+00)
6 0.236 (2.7E-04) 0.197 (1.9E-03) 0.252 (1.1E-04) 0.229 (1.0E+00) 0.221 (1.0E+00)
7 0.189 (2.8E-03) 0.217 (7.3E-04) 0.091 (9.1E-02) 0.081 (8.8E-01) 0.082 (8.9E-01)
8 0.061 (1.9E-01) -0.000 (5.0E-01) 0.101 (6.9E-02) 0.155 (9.9E-01) 0.150 (9.9E-01)
Table 12: Layer-wise Kendall’s results for perturbing pretraining seed and finetuning seed, and assessing functionality through the Antonymy stress test
Layer Procrustes CKA PWCCA
1 0.137 (8.7E-02) 0.108 (1.4E-01) 0.107 (1.4E-01) 0.072 (7.6E-01) 0.072 (7.6E-01)
2 -0.012 (5.5E-01) 0.060 (2.8E-01) 0.062 (2.7E-01) 0.004 (5.1E-01) 0.001 (5.0E-01)
3 -0.059 (7.2E-01) 0.011 (4.6E-01) -0.031 (6.2E-01) -0.060 (2.8E-01) -0.056 (2.9E-01)
4 0.041 (3.4E-01) 0.052 (3.0E-01) -0.026 (6.0E-01) -0.101 (1.6E-01) -0.084 (2.0E-01)
5 0.003 (4.9E-01) 0.131 (9.7E-02) -0.047 (6.8E-01) -0.061 (2.7E-01) -0.061 (2.7E-01)
6 0.092 (1.8E-01) 0.260 (4.5E-03) -0.029 (6.1E-01) -0.064 (2.6E-01) -0.056 (2.9E-01)
7 0.164 (5.2E-02) 0.250 (6.1E-03) 0.037 (3.6E-01) 0.040 (6.5E-01) 0.040 (6.5E-01)
8 0.202 (2.2E-02) 0.105 (1.5E-01) 0.175 (4.1E-02) 0.134 (9.1E-01) 0.143 (9.2E-01)
Table 13: Layer-wise Spearman results for perturbing pretraining seed and finetuning seed, and assessing functionality through the Numerical stress test
Layer Procrustes CKA PWCCA
1 0.103 (6.5E-02) 0.083 (1.1E-01) 0.074 (1.4E-01) 0.050 (7.7E-01) 0.048 (7.6E-01)
2 -0.010 (5.6E-01) 0.046 (2.5E-01) 0.046 (2.5E-01) 0.006 (5.3E-01) 0.001 (5.0E-01)
3 -0.041 (7.3E-01) 0.014 (4.2E-01) -0.018 (6.0E-01) -0.047 (2.5E-01) -0.047 (2.4E-01)
4 0.031 (3.2E-01) 0.038 (2.9E-01) -0.020 (6.2E-01) -0.076 (1.3E-01) -0.065 (1.7E-01)
5 0.005 (4.7E-01) 0.086 (1.0E-01) -0.031 (6.8E-01) -0.042 (2.7E-01) -0.042 (2.7E-01)
6 0.060 (1.9E-01) 0.175 (5.1E-03) -0.020 (6.2E-01) -0.050 (2.3E-01) -0.046 (2.5E-01)
7 0.112 (4.9E-02) 0.168 (6.8E-03) 0.030 (3.3E-01) 0.019 (6.1E-01) 0.024 (6.4E-01)
8 0.131 (2.7E-02) 0.063 (1.8E-01) 0.125 (3.3E-02) 0.099 (9.3E-01) 0.103 (9.4E-01)
Table 14: Layer-wise Kendall’s results for perturbing pretraining seed and finetuning seed, and assessing functionality through the Numerical stress test

Appendix D Bootstrap significance testing for changing fine-tuning seeds

To assess whether the differences between rank correlations are statistically significant in the experiments varying finetuning seed and comparing functional behavior on the OOD HANS dataset, we conduct bootstrap resampling. Concretely, for every pair of metrics and every layer depth, we do the following:

  • Sample 100 models with replacement, and collect their representations at the specified layer depth

  • Let the reference be the representation corresponding to the sampled model with maximum accuracy at that depth

  • Compute the dissimilarities between and the 100 sampled representations

  • Compute the Kendall’s and Spearman’s rank correlations for Orthogonal Procrustes, CKA, and PWCCA

  • Record (Procrustes) - (CKA), (PWCCA) - (CKA), and (PWCCA) - (Procrustes), and the same pairwise differences for Kendall’s .

  • Repeat the above 2000 times

This gives us bootstrap distributions for the differences in rank correlations, and we may compute the 95% confidence intervals for these distributions. When the confidence interval does not overlap with 0, we conclude that the difference in rank correlation is statistically significant. The figures below show the results for each layer. We see that in the deeper layers of the network (layers 8-12), PWCCA has statistically significantly higher rank correlation than Orthogonal Procrustes, which in turn has statistically significantly higher rank correlation than CKA. In earlier layers, results are sometimes statistically significant, but not always.

Figure 4: Bootstrap comparison of between metrics, layers 1-4
Figure 5: Bootstrap comparison of between metrics, layers 5-8
Figure 6: Bootstrap comparison of between metrics, layers 9-12
Figure 7: Bootstrap comparison of between metrics, layers 1-4
Figure 8: Bootstrap comparison of between metrics, layers 5-8
Figure 9: Bootstrap comparison of between metrics, layers 9-12