1 Introduction
Understanding neural networks is not only scientifically interesting, but critical for applying deep networks in highstakes situations. Recent work has highlighted the value of analyzing not just the final outputs of a network, but also its intermediate representations (Li et al., 2015; Raghu et al., 2019). This has motivated the development of representation similarity measures, which can provide insight into how different training schemes, architectures, and datasets affect networks’ learned representations.
A number of similarity measures have been proposed, including centered kernel alignment (CKA) (Kornblith et al., 2019), ones based on canonical correlation analysis (CCA) (Raghu et al., 2017; Morcos et al., 2018)
, single neuron alignment
(Li et al., 2015), vector space alignment
(Arora et al., 2017; Smith et al., 2017; Conneau et al., 2018), and others (Laakso and Cottrell, 2000; Wang et al., 2018b; Liang et al., 2019; Lenc and Vedaldi, 2015; Alain and Bengio, 2018; Feng et al., 2020). Unfortunately, these different measures tell different stories. For instance, CKA and projection weighted CCA disagree on which layers of different networks are most similar (Kornblith et al., 2019). This lack of consensus is worrying, as measures are often designed according to different and incompatible intuitive desiderata, such as whether finding a onetoone assignment, or finding fewtoone mappings, between neurons is more appropriate (Li et al., 2015). As a community, we need wellchosen formal criteria for evaluating metrics to avoid overreliance on intuition and the pitfalls of too many researcher degrees of freedom
(Leavitt and Morcos, 2020).In this paper we vew representation dissimilarity measures as implicitly answering a classification question–whether two representations are essentially similar or importantly different. Thus, in analogy to statistical testing, we can evaluate them based on their sensitivity to important change and specificity (nonresponsiveness) against unimportant changes or noise.
As a warmup, we first initially consider two intuitive criteria: first, that metrics should have specificity against random initialization; and second, that they should be sensitive to deleting important principal components (those that affect probing accuracy). Unfortunately, popular metrics fail at least one of these two tests. CCA is not specific – random initialization noise overwhelms differences between even farapart layers in a network (Section 3.1). CKA on the other hand is not sensitive, failing to detect changes in all but the top principal components of a representation (Section 3.2).
We next construct quantitative benchmarks to evaluate a dissimilarity measure’s quality. To move beyond our intuitive criteria, we need a ground truth. For this we turn to the functional behavior of the representations we are comparing, measured through probing accuracy (an indicator of syntactic information) (Belinkov et al., 2017; Peters et al., 2018; Tenney et al., 2019) and outofdistribution performance of the model they belong to (Naik et al., 2018; McCoy et al., 2020; D’Amour et al., 2020). We then score dissimilarity measures based on their rank correlation with these measured functional differences. Overall our benchmarks contain examples and vary representations across several axes including random seed, layer depth, and lowrank approximation (Section 4)^{1}^{1}1Code to replicate our results can be found at https://github.com/jsd/sim_metric..
Our benchmarks confirm our two intuitive observations: on subtasks that consider layer depth and principal component deletion, we measure the rank correlation with probing accuracy and find CCA and CKA lacking as the previous warmup experiments suggested. Meanwhile, the Orthogonal Procrustes distance, a classical but often overlooked^{2}^{2}2For instance, Raghu et al. (2017) and Morcos et al. (2018) do not mention it, and Kornblith et al. (2019) relegates it to the appendix; although Smith et al. (2017) does use it to analyze word embeddings and prefers it to CCA. dissimilarity measure, balances gracefully between CKA and CCA and consistently performs well. This underscores the need for systematic evaluation, otherwise we may fall to recency bias that undervalues classical baselines.
Other subtasks measure correlation with OOD accuracy, motivated by the observation that random initialization sometimes has large effects on OOD performance (McCoy et al., 2020)
. We find that dissimilarity measures can sometimes predict OOD performance using only the indistribution representations, but we also identify a challenge set on which none of the measures do statistically better than chance. We hope this challenge set will help measure and spur progress in the future.
2 Problem Setup: Metrics and Models
Our goal is to quantify the similarity between two different groups of neurons (usually layers). We do this by comparing how their activations behave on the same dataset. Thus for a layer with neurons, we define , the matrix of activations of the neurons on data points, to be that layer’s raw representation of the data. Similarly, let be a matrix of the activations of neurons on the same data points. We center and normalize these representations before computing dissimilarity, per standard practice. Specifically, for a raw representation we first subtract the mean value from each column, then divide by the Frobenius norm, to produce the normalized representation , used in all our dissimilarity computations. In this work we study dissimilarity measures that allow for quantitative comparisons of representations both within and across different networks. We colloquially refer to values of as distances, although they do not necessarily satisfy the triangle inequality required of a proper metric.
We study five dissimilarity measures: centered kernel alignment (CKA), three measures derived from canonical correlation analysis (CCA), and a measure derived from the orthogonal Procrustes problem. As argued in Kornblith et al. (2019), similarity measures should be invariant to left orthogonal transformations to accommodate the symmetries of neural networks, and all five measures satisfy this requirement.
Centered kernel alignment (CKA) uses an inner product to quantify similarity between two representations. It is based on the idea that one can first choose a kernel, compute the kernel matrix for each representation, and then measure similarity as the alignment between these two kernel matrices. The measure of similarity thus depends on one’s choice of kernel; in this work we consider Linear CKA:
(1) 
as proposed in Kornblith et al. (2019). Other choices of kernel are also valid; we focus on Linear CKA here since Kornblith et al. (2019) report similar results from using either a linear or RBF kernel.
Canonical correlation analysis (CCA) finds orthogonal bases () for two matrices such that after projection onto , the projected matrices have maximally correlated rows. For , the canonical correlation coefficient is computed as follows:
(2)  
(3) 
To transform the vector of correlation coefficients into a scalar measure, two options considered previously (Kornblith et al., 2019) are the mean correlation coefficient, , and the mean squared correlation coefficient, , defined as follows:
(4) 
To improve the robustness of CCA, Morcos et al. (2018) propose projectionweighted CCA (PWCCA) as another scalar summary of CCA:
(5) 
where is the row of , and is the projection of onto the canonical direction. We find that PWCCA performs far better than and , so we focus on PWCCA in the main text, but include results on the other two measures in the appendix.
The orthogonal Procrustes problem consists of finding the leftrotation of that is closest to in Frobenius norm, i.e. solving the optimization problem:
(6) 
The minimum is the squared orthogonal Procrustes distance between and , and is equal to
(7) 
where is the nuclear norm (Schönemann, 1966). Unlike the other metrics, the orthogonal Procrustes distance is not normalized between 0 and 1, although for normalized , it also lies in .
2.1 Models we study
We investigate representations computed by the BERT model family (Devlin et al., 2018) on sentences from the Multigenre Natural Language Inference (MNLI) dataset (Williams et al., 2018)
. We study BERT models of two sizes: BERT base, with 12 hidden layers of 768 neurons, and BERT medium, with 8 hidden layers of 512 neurons. We use the same architectures as in the open source BERT release
^{3}^{3}3available at https://github.com/googleresearch/bert, but to generate diversity we study 3 variations of these models:
BERT base models pretrained with different random seeds but not finetuned for particular tasks, released by Zhong et al. (2021)^{4}^{4}4available at https://github.com/ruiqizhong/acl2021instancelevel.

BERT medium models that were initialized from pretrained models released by Zhong et al. (2021), that we further finetuned on MNLI with different finetuning seeds ( models total).

BERT base models that were initialized from the pretrained BERT model in (Devlin et al., 2018) and finetuned on MNLI with different seeds, released by McCoy et al. (2020)^{5}^{5}5available at https://github.com/tommccoy1/hans/tree/master/berts_of_a_feather.
Further training details, as well as checks that our training protocol results in models with comparable performance to the original BERT release, can be found in Appendix A.
3 Warmup: Intuitive Tests for Sensitivity and Specificity
When designing dissimilarity measures, researchers usually consider invariants that these measures should not be sensitive to (Kornblith et al., 2019); for example, symmetries in neural networks imply that permuting the neurons in a fully connected layer does not change the representations learned. We take this one step further and frame dissimilarity measures as answering whether representations are essentially the same, or importantly different. We can then evaluate measures based on whether they respond to important changes (sensitivity) while ignoring changes that don’t matter (specificity).
Assessing sensitivity and specificity requires a ground truth–which representations are truly different? To answer this, we begin with the following two intuitions: 1) neural network representations trained on the same data but from different random initializations are similar, and 2) representations lose crucial information as principal components are deleted. These motivate the following intuitive tests of specificity and sensitivity: we expect a dissimilarity measure to: 1) assign a small distance between architecturally identical neural networks that only differ in initialization seed, and 2) assign a large distance between a representation and the representation after deleting important principal components (enough to affect accuracy). We will see that PWCCA fails the first test (specificity), while CKA fails the second (sensitivity).
3.1 Specificity against changes to random seed
Neural networks with the same architecture trained from different random initializations show many similarities, such as highly correlated predictions on indistribution data points (McCoy et al., 2020). Thus it seems natural to expect a good similarity measure to assign small distances between architecturally corresponding layers of networks that are identical except for initialization seed.
To check this property, we take two BERT base models pretrained with different random seeds and, for every layer in the first model, compute its dissimilarity to every layer in both the first and second model. We do this for 5 separate pairs of models and average the results. To pass the intuitive specificity test, a dissimilarity measure should assign relatively small distances between a layer in the first network and its corresponding layer in the second network.
Figure 1 displays the average pairwise PWCCA, CKA, and Orthogonal Procrustes distances between layers of two networks differing only in random seed. According to PWCCA, these networks’ representations are quite dissimilar; for instance, the two layer representations are further apart than they are from any other layer in the same network. PWCCA is thus not specific against random initialization, as it can outweigh even large changes in layer depth.
In contrast, CKA can separate layer in a different network from layers or in the same network, showing better specificity to random initialization. Orthogonal Procrustes exhibits smaller but nontrivial specificity, distinguishing layers once they are  layers apart.
3.2 Sensitivity to removing principal components
Dissimilarity measures should also be sensitive to deleting important principal components of a representation.^{6}^{6}6For a representation , we define , the result of deleting the smallest principal components from
, as follows: we compute the singular value decomposition
, construct by dropping the lowest singular vectors of , and finally take . To quantify which components are important, we fix a layer of a pretrained BERT base model and measure how probing accuracy degrades as principal components are deleted, since probing accuracy is a common measure of the information captured in a representation (Belinkov et al., 2017). We probe linear classifation performance on the Stanford Sentiment Tree Bank task (SST2) (Socher et al., 2013), following the experimental protocol in Tamkin et al. (2020). Figure 2(b) shows how probing accuracy degrades with component deletion. Ideally, dissimilarity measures should be large by the time probing accuracy has decreased substantially.To assess whether a dissimilarity measure is large, we need a baseline to compare to. For each measure, we define a dissimilarity score to be above the detectable threshold if it is larger than the dissimilarity score between networks with different random initialization. Figure 2 plots the dissimilarity induced by deleting principal components, as well as this baseline.
For the last layer of BERT, CKA requires 97% of a representation’s principal components to be deleted for the dissimilarity to be detectable; after deleting these components, probing accuracy shown in Figure 2(b) drops significantly from 80% to 63% (chance is ). CKA thus fails to detect large accuracy drops and so fails our intuitive sensitivity test.
Other metrics perform better: Orthogonal Procrustes’s detection threshold is 85% of the principal components, corresponding to an accuracy drop 80% to 70%. PWCCA’s threshold is 55% of principal components, corresponding to an accuracy drop from 80% to 75%.
PWCCA’s failure of specificity and CKA’s failure of sensitivity on these intuitive tests are worrying. However, before declaring definitive failure, in the next section, we turn to making our assessments more rigorous.
4 Rigorously Evaluating Dissimilarity Metrics
In the previous section, we saw that CKA and PWCCA each failed intuitive tests, based on sensitivity to principal components and specificity to random initialization. However, these were based primarily on intuitive, qualitative desiderata. Is there some way for us to make these tests more rigorous and quantitative?
First consider the intuitive layer specificity test (Section 3.1), which revealed that random initialization affects PWCCA more than large changes in layer depth. To justify why this is undesirable, we can turn to probing accuracy, which is strongly affected by layer depth, and only weakly affected by random seed (Figure 2(a)). This suggests a path forward: we can ground the layer test in the concrete differences in functionality captured by the probe.
More generally, we want metrics to be sensitive to changes that affect functionality, while ignoring those that don’t. This motivates the following general procedure, given a distance metric and a functionality (which assigns a real number to a given representation):

Collect a set of representations that differ along one or more axes of interest (e.g. layer depth, random seed).

Choose a reference representation . When is an accuracy metric, it is reasonable to choose .^{7}^{7}7Choosing the highest accuracy model as the reference makes it more likely that as accuracy changes, models are on average becoming more dissimilar. A low accuracy model may be on the “periphery” of model space, where it is dissimilar to models with high accuracy, but potentially even more dissimilar to other low accuracy models that make different mistakes.

For every representation :

Compute

Compute


Report the rank correlation between and (measured by Kendall’s or Spearman ).
The above procedure provides a quantitative measure of how well the distance metric responds to the functionality . For instance, in the layer specificity test, since depth affects probing accuracy strongly while random seed affects it only weakly, a dissimilarity measure with high rank correlation will be strongly responsive to layer depth and weakly responsive to seed; thus rank correlation quantitatively formalizes the test from Section 3.1.
): Changing the depth of the examined BERT base layer strongly affects probing accuracy on QNLI. The trend for each randomly initialized model is displayed semitransparently, and the solid black line is the mean trend. (
2(b)): Truncating principal components from pretrained BERT base significantly degrades probing accuracy on SST2 (BERT layer 12 shown here). (2(c)): Finetuning BERT base with different seeds leads to variation in accuracies on the Lexical (nonentailed) subset of HANS (McCoy et al., 2020), shown via histogram. (2(d)): Pretraining and finetuning BERT medium with 10 different pretraining seeds and 10 different finetuning seeds per pretrained model leads to variation in accuracies on the Antonymy (yellow scatter points) and Numerical (blue scatter points) stress tests Naik et al. (2018).Correlation metrics also capture properties that our intuition might miss. For instance, Figure 2(a) shows that some variation in random seed actually does affect accuracy, and our procedure rewards metrics that pick up on this, while the intuitive sensitivity test would penalize them.
Our procedure requires choosing a collection of models ; the crucial feature of is that it contains models with diverse behavior according to . Different sets , combined with a functional difference , can be thought of as miniature “benchmarks" that surface complementary perspectives on dissimilarity measures’ responsiveness to that functional difference. In the rest of this section, we instantiate this quantitative benchmark for several choices of and , starting with the layer and principal component tests from Section 3 and continuing on to several tests of OOD performance.
The overall results are summarized in Table 1. Note that for any single benchmark, we expect the correlation coefficients to be significantly lower than , since the metric must capture all important axes of variation while measures only one type of functionality. A good metric is one that has consistently high correlation across many different functional measures.
Benchmark 1: Layer depth.
To turn the layer test into a benchmark, we construct a set of representations by pretraining BERT base models with different initialization seeds and including each of the BERT layers as a representation. We separately consider two functionalities : probing accuracy on QNLI (Wang et al., 2018a) or SST2 (Socher et al., 2013). To compute the rank correlation, we take the reference representation to be the depth (final layer) representation with highest probing accuracy. We compute the Kendall’s and Spearman’s rank correlations between the dissimilarities and the probing accuracy differences and report the results in Table 1.
We find that PWCCA has lower rank correlations compared to CKA and Procrustes for both probing tasks. This corroborates the intuitive specificity test (Section 3.1), suggesting that PWCCA registers too large of a dissimilarity across random initializations.
Perturbation 

Functionality  Procrustes  CKA  PWCCA  

120  QNLI probe  0.862  0.670  0.876  0.685  0.763  0.564  
120  SST2 probe  0.890  0.707  0.905  0.732  0.829  0.637  

SST2 probe  0.860  0.677  0.751  0.564  0.870  0.690  
Finetuning seed 










0.243  0.178  0.227  0.160  0.204  0.152  

0.071  0.049  0.122  0.084  0.031  0.023  
Total  Average  0.580  0.447  0.557  0.426  0.544  0.413 
Benchmark 2: Principal component (PC) deletion.
We next quantify the PC deletion test from Section 3.2, by constructing a set of representations that vary in both random initialization and fraction of principal components deleted. We pretrain 10 BERT base models with different initializations, and for each pretrained model we obtain 14 different representations by deleting that representation’s smallest principal components, with . Thus has elements. The representations themselves are the layer activations, for ,^{8}^{8}8Earlier layers have nearchance accuracy on probing tasks, so we ignore them. so there are different choices of . We use SST2 probing accuracy as the functionality of interest , and select the reference representation as the element in with highest accuracy. Rank correlation results are consistent across the 5 choices of (Appendix C), so we report the average as a summary statistic in Table 1.
We find that PWCCA has the highest rank correlation between dissimilarity and probing accuracy, followed by Procrustes, and distantly followed by CKA. This corroborates the intuitive observations from Section 3.2 that CKA is not sensitive to principal component deletion.
4.1 Investigating variation in OOD performance across random seeds
So far our benchmarks have been based on probing accuracy, which only measures indistribution behavior (the train and test set of the probe are typically i.i.d.). In addition, representations were always pretrained but not finetuned. To add diversity to our benchmarks, we next consider the outofdistribution performance of several collections of finetuned models.
Benchmark 3: Changing finetuning seeds.
McCoy et al. (2020) show that a single pretrained BERT base model finetuned on MNLI with different random initializations will produce models with similar indistribution performance, but widely variable performance on outofdistribution data (see Figure 2(c)). We thus create a benchmark out of McCoy et al.
’s 100 released finetuned models, using OOD accuracy on the “Lexical Heuristic (Nonentailment)" subset of the HANS dataset
(McCoy et al., 2019) as our functionality . This functionality is associated with the entire model, rather than an individual layer (in contrast to the probing functionality), but we consider one layer at a time to measure whether dissimilarities between representations at that layer correlate with . This allows us to also localize whether certain layers are more predictive of .We construct different (one for each of the 12 layers of BERT base), taking the reference representation to be that of the highest accuracy model according to . As before, we report each dissimilarity measure’s rank correlation with in Table 1, averaged over the 12 runs.
All three dissimilarity measures correlate with OOD accuracy, with Orthogonal Procrustes and PWCCA being more correlated than CKA. Since the representations in our benchmarks were computed on indistribution MNLI data, this has the interesting implication that dissimilarity measures can detect OOD differences without access to OOD data. It also implies that random initialization leads to meaningful functional differences that are picked up by these measures, especially Procrustes and PWCCA. Contrast this with our intuitive specificity test in Section 3.1, where all sensitivity to random initialization was seen as a shortcoming. Our more quantitative benchmark here suggests that some of that sensitivity tracks true functionality.
To check that the differences in rank correlation for Procrustes, PWCCA, and CKA are statistically significant, we compute bootstrap estimates of their 95% confidence intervals. With 2000 bootstrapped samples, we find statistically significant differences between all pairs of measures for most choices of layer depth
, so we conclude PWCCA > Orthogonal Procrustes > CKA (the full results are in Appendix D). We do not apply this procedure for the previous two benchmarks, because the different models have correlated randomness and so any value based on independence assumptions would be invalid.Changing both pretraining and finetuning seeds: a challenge set.
We also construct benchmarks from a collection of 100 BERT medium models, trained with all combinations of 10 pretraining and 10 finetuning seeds. The models are finetuned on MNLI, and we consider two different functionalities of interest : accuracy on the Antonymy stress test and on the Numerical stress test (Naik et al., 2018), which both show significant variation in accuracy across models (see Figure 2(d)). We obtain different sets (one for each of the 8 layer depths in BERT medium), again taking to be the representation of the highestaccuracy model according to . Rank correlations for each dissimilarity measure are averaged over the 8 runs and reported in Table 1.
None of the dissimilarity measures show a large rank correlation for either task, and for the Numerical task, at most layers, the associated values (assuming independence) are nonsignificant at the 0.05 level (see Appendix C). ^{9}^{9}9See Appendix C for values as produced by scikit learn. Strictly speaking, the values are invalid because they assume independence, but the pretraining seed induces correlations. However, correctly accounting for these would tend to make the values larger, thus preserving our conclusion of nonsignificance . Thus we conclude that all measures fail to be sensitive to OOD accuracy in these settings. One reason for this could be that there is less variation in the OOD accuracies compared to the previous experiment (compare Figure 2(c) to 2(d)). Another reason could be that it is harder to correctly account for both pretraining and finetuning variation at the same time. Either way, we hope that future dissimilarity measures can improve upon these results, and we present this benchmark as a challenge task to motivate progress.
5 Discussion
In this work we proposed a quantitative measure for evaluating similarity metrics, based on the rank correlation with functional behavior. Using this, we generated tasks motivated by sensitivity to deleting important directions, specificity to random initialization, and sensitivity to outofdistribution performance. Popular existing metrics such as CKA and CCA often performed poorly on these tasks, sometimes in striking ways. Meanwhile, the classical Orthogonal Procrustes transform attained consistently good performance.
Given the success of Orthogonal Procrustes, it is worth reflecting on how it differs from the other metrics and why it might perform well. To do so, we consider a simplified case where and
have the same singular vectors but different singular values. Thus without loss of generality
and , where the are both diagonal. In this case, the Orthogonal Procrustes distance reduces to , or the sum of the squared distances between the singular values. We will see that both CCA and CKA reduce to less reasonable formulae in this case.Orthogonal Procrustes vs. CCA. All three metrics derived from CCA assign zero
distance even when the (nonzero) singular values are arbitrarily different. This is because CCA correlation coefficients are invariant to all invertible linear transformations. This invariance property may help explain why CCA metrics generally find layers within the same network to be much more similar than networks trained with different randomness. Random initialization introduces noise, particularly in unimportant principal components, while representations within the same network more easily preserve these components, and CCA may place too much weight on their associated correlation coefficients.
Orthogonal Procrustes vs. CKA. In contrast to the squared distance of Orthogonal Procrustes, CKA actually reduces to a quartic function based on the dot products between the squared entries of and . As a consequence, CKA is dominated by representations’ largest singular values, leaving it insensitive to meaningful differences in smaller singular values as illustrated in Figure 2. This lack of sensitivity to moderatesized differences may help explain why CKA fails to track outofdistribution error effectively.
In addition to helping understand similarity measures, our benchmarks pinpoint directions for improvement. No method was sensitive to accuracy on the Numerical stress test in our challenge set, possibly due to a lower signaltonoise ratio. Since Orthogonal Procrustes performed well on most of our tasks, it could be a promising foundation for a new measure, and recent work shows how to regularize Orthogonal Procrustes to handle high noise
(Pumir et al., 2021). Perhaps similar techniques could be adapted here.An alternative to our benchmarking approach is to directly define two representations’ dissimilarity as their difference in a functional behavior of interest. Feng et al. (2020) take this approach, defining dissimilarity as difference in accuracy on a handful of probing tasks. One drawback of this approach is that a small set of probes may not capture all the differences in representations, so it is useful to base dissimilarity measures on representations’ intrinsic properties. Intrinsically defined dissimilarities also have the potential to highlight new functional behaviors, as we found that representations with similar indistribution probing accuracy often have highly variable OOD accuracy.
A limitation of our work is that we only consider a handful of model variations and functional behaviors, and restricting our attention to these settings could overlook other important considerations. To address this, we envision a paradigm in which a rich tapestry of benchmarks are used to ground and validate neural network interpretations. Other axes of variation in models could include training on more or fewer examples, training on shuffled labels vs. real labels, training from specifically chosen initializations (Frankle and Carbin, 2018), and using different architectures. Other functional behaviors to examine could include modularity and metalearning capabilities. Benchmarks could also be applied to other interpretability tools beyond dissimilarity. For example, sensitivity to deleting principal components could provide an additional sanity check for saliency maps and other visualization tools (Adebayo et al., 2018).
More broadly, many interpretability tools are designed as audits of models, although it is often unclear what characteristics of the models are consistently audited. We position this work as a counteraudit, where by collecting models that differ in functional behavior, we can assess whether the interpretability tools CKA, PWCCA, etc., accurately reflect the behavioral differences. Many other types of counteraudits may be designed to assess other interpretability tools. For example, models that have backdoors built into them to misclassify certain inputs provide counteraudits for interpretability tools that explain model predictions–these explanations should reflect any backdoors present (Li et al., 2020; Chen et al., 2017; Wang et al., 2019; Kurita et al., 2020). We are hopeful that more comprehensive checks on interpretability tools will provide deeper understanding of neural networks, and more reliable models.
Thanks to Ruiqi Zhong for helpful comments and assistance in finetuning models, and thanks to Daniel Rothchild and our anonymous reviewers for helpful discussion. FD is supported by an NSF Graduate Research Fellowship and the Open Philanthropy Project AI Fellows Program.
References
 Adebayo et al. [2018] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31:9505–9515, 2018.

Alain and Bengio [2018]
G. Alain and Y. Bengio.
Understanding intermediate layers using linear classifier probes, 2018.
 Arora et al. [2017] S. Arora, Y. Liang, and T. Ma. A simple but toughtobeat baseline for sentence embeddings. 2017.

Belinkov et al. [2017]
Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, and J. Glass.
What do neural machine translation models learn about morphology?
In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, 2017.  Chen et al. [2017] X. Chen, C. Liu, B. Li, K. Lu, and D. Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
 Conneau et al. [2018] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data, 2018.
 D’Amour et al. [2020] A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395, 2020.
 Devlin et al. [2018] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 Feng et al. [2020] Y. Feng, R. Zhai, D. He, L. Wang, and B. Dong. Transferred discrepancy: Quantifying the difference between representations. arXiv preprint arXiv:2007.12446, 2020.
 Frankle and Carbin [2018] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.

Kornblith et al. [2019]
S. Kornblith, M. Norouzi, H. Lee, and G. Hinton.
Similarity of neural network representations revisited.
In
International Conference on Machine Learning
, pages 3519–3529, 2019.  Kurita et al. [2020] K. Kurita, P. Michel, and G. Neubig. Weight poisoning attacks on pretrained models. arXiv preprint arXiv:2004.06660, 2020.

Laakso and Cottrell [2000]
A. Laakso and G. Cottrell.
Content and cluster analysis: assessing representational similarity in neural systems.
Philosophical psychology, 13(1):47–76, 2000.  Leavitt and Morcos [2020] M. L. Leavitt and A. Morcos. Towards falsifiable interpretability research. arXiv preprint arXiv:2010.12016, 2020.
 Lenc and Vedaldi [2015] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence, 2015.
 Li et al. [2020] S. Li, S. Ma, M. Xue, and B. Z. H. Zhao. Deep learning backdoors. arXiv preprint arXiv:2007.08273, 2020.
 Li et al. [2015] Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft. Convergent learning: Do different neural networks learn the same representations? In Feature Extraction: Modern Questions and Challenges, pages 196–212. PMLR, 2015.
 Liang et al. [2019] R. Liang, T. Li, L. Li, J. Wang, and Q. Zhang. Knowledge consistency between neural networks and beyond. In International Conference on Learning Representations, 2019.
 McCoy et al. [2019] R. T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019.
 McCoy et al. [2020] R. T. McCoy, J. Min, and T. Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 217–227, 2020.
 Morcos et al. [2018] A. Morcos, M. Raghu, and S. Bengio. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, pages 5727–5736, 2018.
 Naik et al. [2018] A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, 2018.
 Peters et al. [2018] M. E. Peters, M. Neumann, L. Zettlemoyer, and W.t. Yih. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949, 2018.
 Pumir et al. [2021] T. Pumir, A. Singer, and N. Boumal. The generalized orthogonal procrustes problem in the high noise regime. Information and Inference: A Journal of the IMA, Jan 2021. ISSN 20498772. doi: 10.1093/imaiai/iaaa035. URL http://dx.doi.org/10.1093/imaiai/iaaa035.
 Raghu et al. [2019] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, 2019.

Raghu et al. [2017]
M. Raghu, J. Gilmer, J. Yosinski, and J. SohlDickstein.
Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.
In Advances in Neural Information Processing Systems, pages 6076–6085, 2017.  Schönemann [1966] P. H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, 1966.
 Smith et al. [2017] S. L. Smith, D. H. P. Turban, S. Hamblin, and N. Y. Hammerla. Offline bilingual word vectors, orthogonal transformations and the inverted softmax, 2017.

Socher et al. [2013]
R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and
C. Potts.
Recursive deep models for semantic compositionality over a sentiment
treebank.
In
Proceedings of the 2013 conference on empirical methods in natural language processing
, pages 1631–1642, 2013. 
Tamkin et al. [2020]
A. Tamkin, T. Singh, D. Giovanardi, and N. Goodman.
Investigating transferability in pretrained language models.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1393–1401, 2020.  Tenney et al. [2019] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJzSgnRcKX.
 Turc et al. [2019] I. Turc, M.W. Chang, K. Lee, and K. Toutanova. Wellread students learn better: On the importance of pretraining compact models. arXiv preprint arXiv:1908.08962, 2019.
 Wang et al. [2018a] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. Glue: A multitask benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018a.
 Wang et al. [2019] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE, 2019.
 Wang et al. [2018b] L. Wang, L. Hu, J. Gu, Z. Hu, Y. Wu, K. He, and J. Hopcroft. Towards understanding learning representations: To what extent do different neural networks learn the same representation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 9584–9593. Curran Associates, Inc., 2018b. URL https://proceedings.neurips.cc/paper/2018/file/5fc34ed307aac159a30d81181c99847ePaper.pdf.
 Williams et al. [2018] A. Williams, N. Nangia, and S. Bowman. A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N181101.
 Zhong et al. [2021] R. Zhong, D. Ghosh, D. Klein, and J. Steinhardt. Are larger pretrained language models uniformly better? comparing performance at the instance level. arXiv preprint arXiv:2105.06020, 2021.
Appendix
Appendix A BERT finetuning details
We finetuned models from Zhong et al. [2021] and the original BERT models from Devlin et al. [2018]
on three tasks – Quora Question Pairs (QQP)
^{10}^{10}10https://www.quora.com/q/quoradata/FirstQuoraDatasetReleaseQuestionPairs, MultiGenre Natural Language Inference (MNLI; Williams et al. [2018]), and the Stanford Sentiment Treebank (SST2; Socher et al. [2013]), and show each model’s accuracy on these tasks in Table 2. Our models generally have comparable accuracy.As in Turc et al. [2019]
, we finetune for 4 epochs for each dataset. For each task and model size, we tune hyperparameters in the following way: we first randomly split our new training set into 80% and 20%; then we finetune on the 80% split with all 9 combination of batch size [16, 32, 64] and learning rate [1e4, 5e5, 3e5], and choose the combination that leads to the best average accuracy on the remaining 20%. Finetuning these models for all three tasks requires around 500 hours.
Appendix B Licenses
The source code for BERT models available at https://github.com/googleresearch/bert is licensed under the Apache License 2.0.
The model weights for the 100 BERT base models provided by McCoy et al. [2020] are licensed under the Creative Commons Attribution 4.0 International license, and their source code is licensed under the MIT license (https://github.com/tommccoy1/hans/blob/master/LICENSE.md).
Appendix C Layerwise results
Some of the results presented in Table 1 were averaged over multiple layers, since rankings between dissimilarity measures were consistent across different layers. Rank correlation scores are higher across all measures for certain layers, however, so we include layerbylayer results here for completeness. We also include scores for and here, and note that they are often similar to PWCCA, and generally dominated by other measures. We expand each row of Table 1 into a subsection of its own. We also include pvalues as reported by scikit learn, although we note that because random seeds are shared among some representations, these pvalues are all inflated, with the exception of those for the experiment perturbing only finetuning seed, and assessing functionality through HANS (C.3). The invalid pvalues may all be thought of as upperbounds for the significance of the rank correlation results.
c.1 Perturbation: pretraining seed and layer depth
Tables 3 and 4 show the full results (including pvalues and all 5 dissimilarity measures) using the QNLI probe as the functionality of interest, for Spearman and Kendall’s , respectively. Table 5 and 6 present results for the probing task SST2 as the functionality of interest.
Layer  Procrustes  CKA  PWCCA  
12  0.862 (6.5E37)  0.876 (1.6E39)  0.763 (2.2E24)  0.849 (1.0E+00)  0.846 (1.0E+00) 
Layer  Procrustes  CKA  PWCCA  
12  0.670 (1.1E27)  0.685 (7.4E29)  0.564 (3.2E20)  0.652 (1.0E+00)  0.647 (1.0E+00) 
Layer  Procrustes  CKA  PWCCA  
12  0.890 (2.7E42)  0.905 (5.3E46)  0.829 (7.7E32)  0.857 (1.0E+00)  0.854 (1.0E+00) 
Layer  Procrustes  CKA  PWCCA  
12  0.707 (1.2E30)  0.732 (1.0E32)  0.637 (3.1E25)  0.662 (1.0E+00)  0.658 (1.0E+00) 
c.2 Perturbation: pretraining seed and principal component deletion
We find that for these experiments, results are consistent across the layers we analyze (the last 6 layers of BERT base). Tables 7 and 8 show results for Spearman and Kendall’s , respectively.
Layer  Procrustes  CKA  PWCCA  
8  0.764 (2.4E36)  0.668 (3.2E25)  0.776 (3.4E38)  0.700 (1.9E28)  0.700 (1.8E28) 
9  0.813 (2.1E44)  0.706 (4.0E29)  0.825 (9.2E47)  0.728 (1.3E31)  0.728 (1.2E31) 
10  0.873 (2.1E58)  0.818 (2.7E45)  0.874 (1.1E58)  0.748 (3.2E34)  0.749 (2.7E34) 
11  0.918 (1.2E74)  0.797 (1.4E41)  0.922 (1.7E76)  0.781 (6.6E39)  0.781 (7.0E39) 
12  0.932 (1.1E81)  0.766 (1.1E36)  0.955 (4.2E97)  0.810 (6.1E44)  0.810 (6.1E44) 
Layer  Procrustes  CKA  PWCCA  
8  0.560 (1.8E29)  0.479 (4.4E22)  0.573 (1.1E30)  0.512 (6.8E25)  0.512 (6.6E25) 
9  0.602 (1.2E33)  0.509 (1.2E24)  0.618 (2.5E35)  0.542 (1.1E27)  0.543 (9.7E28) 
10  0.684 (5.6E43)  0.627 (2.1E36)  0.685 (5.3E43)  0.588 (2.9E32)  0.589 (2.5E32) 
11  0.751 (2.8E51)  0.616 (3.3E35)  0.756 (6.4E52)  0.648 (9.2E39)  0.648 (9.2E39) 
12  0.787 (3.4E56)  0.588 (2.9E32)  0.819 (1.2E60)  0.701 (4.7E45)  0.701 (4.9E45) 
c.3 Perturbation: finetuning seed, Functionality: HANS
Results for this experiment are similar across layers for Procrustes and all three CCAbased measures, with middle layers of BERT base having a slightly higher rank correlation score in general. For CKA, this effect is even more pronounced. Tables 9 and 10 show the results for Spearman and Kendall’s , respectively.
Layer  Procrustes ()  CKA ()  PWCCA ()  ()  () 
1  0.425 (5.1E06)  0.361 (1.1E04)  0.405 (1.4E05)  0.388 (3.4E05)  0.389 (3.2E05) 
2  0.510 (3.1E08)  0.410 (1.2E05)  0.486 (1.5E07)  0.488 (1.3E07)  0.483 (1.8E07) 
3  0.531 (6.6E09)  0.427 (4.6E06)  0.538 (3.8E09)  0.533 (5.6E09)  0.532 (6.2E09) 
4  0.543 (2.6E09)  0.506 (3.9E08)  0.552 (1.4E09)  0.555 (1.0E09)  0.550 (1.5E09) 
5  0.563 (5.3E10)  0.512 (2.6E08)  0.570 (2.9E10)  0.582 (1.1E10)  0.580 (1.3E10) 
6  0.629 (1.2E12)  0.641 (3.6E13)  0.621 (2.8E12)  0.621 (2.7E12)  0.622 (2.5E12) 
7  0.647 (1.7E13)  0.658 (5.0E14)  0.647 (1.7E13)  0.653 (9.0E14)  0.650 (1.2E13) 
8  0.643 (2.7E13)  0.552 (1.3E09)  0.653 (9.5E14)  0.651 (1.1E13)  0.651 (1.2E13) 
9  0.589 (5.9E11)  0.419 (7.1E06)  0.641 (3.5E13)  0.662 (3.3E14)  0.660 (4.2E14) 
10  0.536 (4.6E09)  0.437 (2.7E06)  0.559 (7.3E10)  0.612 (6.6E12)  0.614 (5.4E12) 
11  0.532 (6.2E09)  0.426 (4.9E06)  0.565 (4.7E10)  0.619 (3.4E12)  0.614 (5.5E12) 
12  0.465 (5.3E07)  0.192 (2.8E02)  0.574 (2.1E10)  0.609 (9.2E12)  0.610 (7.9E12) 
Layer  Procrustes ()  CKA ()  PWCCA ()  ()  () 
1  0.295 (6.7E06)  0.269 (3.6E05)  0.277 (2.2E05)  0.265 (4.7E05)  0.268 (4.0E05) 
2  0.363 (4.6E08)  0.288 (1.1E05)  0.343 (2.1E07)  0.342 (2.3E07)  0.342 (2.4E07) 
3  0.372 (2.1E08)  0.290 (9.5E06)  0.378 (1.3E08)  0.375 (1.6E08)  0.375 (1.6E08) 
4  0.393 (3.4E09)  0.358 (6.6E08)  0.401 (1.7E09)  0.405 (1.2E09)  0.403 (1.4E09) 
5  0.410 (7.7E10)  0.367 (3.3E08)  0.417 (4.1E10)  0.428 (1.4E10)  0.424 (2.0E10) 
6  0.464 (4.2E12)  0.474 (1.5E12)  0.460 (6.3E12)  0.460 (5.8E12)  0.461 (5.6E12) 
7  0.483 (5.5E13)  0.488 (3.3E13)  0.481 (7.1E13)  0.486 (3.9E13)  0.483 (5.5E13) 
8  0.478 (9.2E13)  0.392 (3.7E09)  0.483 (5.7E13)  0.481 (6.5E13)  0.480 (7.7E13) 
9  0.432 (1.0E10)  0.293 (7.7E06)  0.475 (1.2E12)  0.496 (1.3E13)  0.494 (1.6E13) 
10  0.380 (1.0E08)  0.306 (3.4E06)  0.401 (1.7E09)  0.447 (2.3E11)  0.448 (2.1E11) 
11  0.376 (1.5E08)  0.292 (8.3E06)  0.411 (6.9E10)  0.448 (2.1E11)  0.445 (2.7E11) 
12  0.330 (5.7E07)  0.127 (3.1E02)  0.416 (4.4E10)  0.446 (2.5E11)  0.447 (2.2E11) 
c.4 Perturbation: pretraining seeds and finetuning seeds of BERT medium
Rank correlation scores are low across the board for this task, suggesting that it is difficult for all existing dissimilarity measures, regardless of the layer within a network. Results on the Antonymy stress test for Spearman and Kendall’s are in Tables 11 and 12, respectively. Results on the Numerical stress test for Spearman and Kendall’s are in Tables 13 and 14, respectively.
Layer  Procrustes  CKA  PWCCA  
1  0.252 (5.7E03)  0.241 (7.8E03)  0.168 (4.7E02)  0.305 (1.0E+00)  0.327 (1.0E+00) 
2  0.213 (1.7E02)  0.145 (7.5E02)  0.131 (9.7E02)  0.047 (6.8E01)  0.031 (6.2E01) 
3  0.260 (4.5E03)  0.262 (4.2E03)  0.208 (1.9E02)  0.137 (9.1E01)  0.111 (8.6E01) 
4  0.260 (4.5E03)  0.265 (3.8E03)  0.265 (3.8E03)  0.276 (1.0E+00)  0.254 (9.9E01) 
5  0.273 (3.0E03)  0.302 (1.1E03)  0.278 (2.5E03)  0.339 (1.0E+00)  0.310 (1.0E+00) 
6  0.330 (3.9E04)  0.280 (2.4E03)  0.346 (2.1E04)  0.313 (1.0E+00)  0.304 (1.0E+00) 
7  0.271 (3.2E03)  0.315 (7.1E04)  0.111 (1.4E01)  0.091 (8.2E01)  0.090 (8.1E01) 
8  0.084 (2.0E01)  0.004 (4.8E01)  0.123 (1.1E01)  0.204 (9.8E01)  0.198 (9.8E01) 
Layer  Procrustes  CKA  PWCCA  
1  0.199 (1.7E03)  0.171 (5.9E03)  0.126 (3.3E02)  0.244 (1.0E+00)  0.243 (1.0E+00) 
2  0.179 (4.3E03)  0.123 (3.5E02)  0.118 (4.2E02)  0.061 (8.1E01)  0.042 (7.3E01) 
3  0.185 (3.3E03)  0.186 (3.2E03)  0.139 (2.0E02)  0.110 (9.5E01)  0.096 (9.2E01) 
4  0.187 (3.0E03)  0.191 (2.6E03)  0.188 (2.9E03)  0.206 (1.0E+00)  0.193 (1.0E+00) 
5  0.192 (2.4E03)  0.194 (2.2E03)  0.202 (1.5E03)  0.267 (1.0E+00)  0.242 (1.0E+00) 
6  0.236 (2.7E04)  0.197 (1.9E03)  0.252 (1.1E04)  0.229 (1.0E+00)  0.221 (1.0E+00) 
7  0.189 (2.8E03)  0.217 (7.3E04)  0.091 (9.1E02)  0.081 (8.8E01)  0.082 (8.9E01) 
8  0.061 (1.9E01)  0.000 (5.0E01)  0.101 (6.9E02)  0.155 (9.9E01)  0.150 (9.9E01) 
Layer  Procrustes  CKA  PWCCA  
1  0.137 (8.7E02)  0.108 (1.4E01)  0.107 (1.4E01)  0.072 (7.6E01)  0.072 (7.6E01) 
2  0.012 (5.5E01)  0.060 (2.8E01)  0.062 (2.7E01)  0.004 (5.1E01)  0.001 (5.0E01) 
3  0.059 (7.2E01)  0.011 (4.6E01)  0.031 (6.2E01)  0.060 (2.8E01)  0.056 (2.9E01) 
4  0.041 (3.4E01)  0.052 (3.0E01)  0.026 (6.0E01)  0.101 (1.6E01)  0.084 (2.0E01) 
5  0.003 (4.9E01)  0.131 (9.7E02)  0.047 (6.8E01)  0.061 (2.7E01)  0.061 (2.7E01) 
6  0.092 (1.8E01)  0.260 (4.5E03)  0.029 (6.1E01)  0.064 (2.6E01)  0.056 (2.9E01) 
7  0.164 (5.2E02)  0.250 (6.1E03)  0.037 (3.6E01)  0.040 (6.5E01)  0.040 (6.5E01) 
8  0.202 (2.2E02)  0.105 (1.5E01)  0.175 (4.1E02)  0.134 (9.1E01)  0.143 (9.2E01) 
Layer  Procrustes  CKA  PWCCA  
1  0.103 (6.5E02)  0.083 (1.1E01)  0.074 (1.4E01)  0.050 (7.7E01)  0.048 (7.6E01) 
2  0.010 (5.6E01)  0.046 (2.5E01)  0.046 (2.5E01)  0.006 (5.3E01)  0.001 (5.0E01) 
3  0.041 (7.3E01)  0.014 (4.2E01)  0.018 (6.0E01)  0.047 (2.5E01)  0.047 (2.4E01) 
4  0.031 (3.2E01)  0.038 (2.9E01)  0.020 (6.2E01)  0.076 (1.3E01)  0.065 (1.7E01) 
5  0.005 (4.7E01)  0.086 (1.0E01)  0.031 (6.8E01)  0.042 (2.7E01)  0.042 (2.7E01) 
6  0.060 (1.9E01)  0.175 (5.1E03)  0.020 (6.2E01)  0.050 (2.3E01)  0.046 (2.5E01) 
7  0.112 (4.9E02)  0.168 (6.8E03)  0.030 (3.3E01)  0.019 (6.1E01)  0.024 (6.4E01) 
8  0.131 (2.7E02)  0.063 (1.8E01)  0.125 (3.3E02)  0.099 (9.3E01)  0.103 (9.4E01) 
Appendix D Bootstrap significance testing for changing finetuning seeds
To assess whether the differences between rank correlations are statistically significant in the experiments varying finetuning seed and comparing functional behavior on the OOD HANS dataset, we conduct bootstrap resampling. Concretely, for every pair of metrics and every layer depth, we do the following:

Sample 100 models with replacement, and collect their representations at the specified layer depth

Let the reference be the representation corresponding to the sampled model with maximum accuracy at that depth

Compute the dissimilarities between and the 100 sampled representations

Compute the Kendall’s and Spearman’s rank correlations for Orthogonal Procrustes, CKA, and PWCCA

Record (Procrustes)  (CKA), (PWCCA)  (CKA), and (PWCCA)  (Procrustes), and the same pairwise differences for Kendall’s .

Repeat the above 2000 times
This gives us bootstrap distributions for the differences in rank correlations, and we may compute the 95% confidence intervals for these distributions. When the confidence interval does not overlap with 0, we conclude that the difference in rank correlation is statistically significant. The figures below show the results for each layer. We see that in the deeper layers of the network (layers 812), PWCCA has statistically significantly higher rank correlation than Orthogonal Procrustes, which in turn has statistically significantly higher rank correlation than CKA. In earlier layers, results are sometimes statistically significant, but not always.