1 Introduction
As deep neural networks are increasingly deployed in realworld scenarios, robustness of their automatically learned feature representations has emerged as a key desideratum. Prior works have defined many measures of robustness corresponding to different types of syntheticallygenerated or naturallyoccurring perturbations that can be applied to the inputs and their distributions (e.g., adversarial inputs (Biggio et al., 2013; Szegedy et al., 2013; Papernot et al., 2016) or distributional shifts (Geirhos et al., 2018; Hendrycks and Dietterich, 2019; Engstrom et al., 2019; Fawzi and Frossard, 2015; Taori et al., 2020)). The common property underlying the different robustness measures is that they all attempt to capture the extent to which the learned representations remain invariant (i.e., unchanged) under some defined set of perturbations.
In this paper, we propose a new way to study, understand, and characterize robustness of neural networks. Our key insight is that the set of input perturbations against which a neural network’s robustness is measured, can itself be defined by another reference
neural network. Specifically, given a reference neural network, we first obtain a set of input perturbations that are imperceptible to the reference network (i.e., find inputs with invariant reference representations), and then check the extent to which representations of other neural networks are invariant to these perturbations. Our proposal allows us to measure relative invariance of two neural network representations and estimate the degree to which the two neural networks share representational invariance. Intuitively, our proposal generalizes the often unstated, but implicit assumption behind all interesting sets of perturbations used in robustness studies today: they are perturbations that are imperceptible to a particular reference neural network, the human brain.
Comparing representational invariance of two neural networks is an important aspect of determining their representational (or perceptual) alignment. Assessing representational alignment is crucial for a future society with interacting agents controlled by neural networks (e.g., cars driven by different deep learning systems). Additionally, the ability to measure relative invariance, and therefore robustness, of deep neural network representations can offer insights into interesting questions such as: when updating a model, to what extent are invariances preserved (which may be crucial to regulators for safety assurance)? How does representational invariance vary with the choice of network architectures, loss functions, random weight initialization, and datasets used in the training process?
Our work is inspired by and builds upon previous studies investigating similarity between deep neural network representations (Raghu et al., 2017; Morcos et al., 2018; Kornblith et al., 2019a; Nguyen et al., 2020). However, we find that existing representational similarity literature focuses narrowly on comparing two representations of data samples drawn from a specific input distribution, ignoring representations of data samples outside of the distribution or changes in representation caused by input perturbations for which one of the two representations remains invariant. As such, existing measures of similarity between neural network representations offer no insight into their robustness and consequently, their alignment. Nevertheless, we retain the compelling (axiomatic) properties of existing similarity measures, by repurposing them to measure relative invariance. Specifically, for input perturbations imperceptible to the reference neural network, we quantify the invariance in representations of the other neural network using a popular similarity index called Centered Kernel Alignment (CKA) (Kornblith et al., 2019a).
To summarize, our key contributions are as follows:

[leftmargin=*]

We propose a measure of shared invariances between two representations that is based on the models which generated them. We show that our measure faithfully captures shared invariances between two models where existing measures of representation similarity (such as CKA) do not work adequately.

Our proposal repurposes existing representation similarity measures to measure shared invariance, thus preserving all the (desirable) axiomatic properties of these measures.

Using our measure we are able to derive novel insights about the impact of weight initialization, architecture, loss and training dataset on the shared invariances between networks. Our initial results show that our measure is a promising evaluation tool to better understand deep learning.

We find that typically the shared invariance between models reduces for later layers, however when trained using adversarial training, the same models end up with higher shared invariances even in later layers. We also see that models with residual connections tend to have high shared invariances among them than between other nonresidual models.
2 Measuring Representational Robustness
Robustness is defined as “perform(ing) without failure under a wide range of conditions” (MerriamWebster, 2022). Applying the definition in the context of learning models, we can disentangle two distinct requirements for a model to be considered robust: i) the model must produce correct outputs, have high accuracy, and ii) these outputs must be produced consistently for a diverse set of inputs, the outputs of the model must be invariant to irrelevant perturbations (changes) in the input. Extensive literature on robust learning (Szegedy et al., 2013; Papernot et al., 2016; Goodfellow et al., 2015) suggests that it is quite hard to train models that achieve high accuracy on standard benchmarking datasets and high invariance to irrelevant (adversariallygenerated or naturallyoccurring) perturbations simultaneously (Tsipras et al., 2018; Madry et al., 2019; Zhang et al., 2019). Reconciling correctness and invariance requirements remains a topic of active research.
Here, we investigate the robustness of neural network representations that are learned in the process of generating outputs. Specifically, we attempt to quantify the invariance of learned neural network representations to irrelevant perturbations in the inputs. Intuitively, a high representational invariance is a necessary (though not sufficient) condition for a neural network model to be robust.
2.1 A Relative Invariance Framework
In order to quantify the invariance of a neural network’s representations to irrelevant perturbations to inputs, we need to first define the set of such perturbations. Most of the existing works on robustness use humans as a reference model to define these perturbations. Thus all works on robustness are inherently (and often implicitly) relative to a human. We generalize the reliance on a humans as the reference model by assuming the reference model to be another neural network. We can then define the set of irrelevant input perturbations as those changes that do not cause any change in the reference model’s representation (i.e., perception) of the inputs. Finally, we quantify how invariant a given neural network’s representations are to these irrelevant input perturbations.
Our framework effectively quantifies invariance of one neural network’s representation relative to another reference neural network. Our use of a reference neural network model is inspired by how human perception is used as a reference to determine which adversarial perturbations (Szegedy et al., 2013) or image corruptions (Geirhos et al., 2018; Hendrycks and Dietterich, 2019) or image transformations (Engstrom et al., 2019; Fawzi and Frossard, 2015) do not alter the perception of inputs and are, hence, considered irrelevant perturbations.
2.2 Problem Setting
Given a reference neural network and samples () from a given data distribution (), our goal is to define a measure of how invariant a given target network is to perturbations of samples that are imperceptible by , do not change their representations according to .
Invariant input perturbations for a reference model
In our framework, the invariances that are desirable are determined with respect to a reference model, . To this end, we introduce the notion of Identically Represented Inputs (IRIs). For any given input data point and a given reference model , IRIs is the set of all data points that are mapped to the same representation as by . Formally,
(1) 
In other words, is invariant to and cannot perceive any difference between and any . In practice, exact equality is hard to achieve, and thus we relax this formulation so that and are almost indistinguishable for , , for a small enough ,
(2) 
Going forward we use the relaxed formulation of IRIs and thus omit the subscript. For each of the input points , if we pick a from , we get a corresponding batch of samples such that .
Measuring invariance of the target model on
Assuming we obtain and , our key idea is to capture the extent to which and share invarances, by measuring the degree to which representations assigned by to and are similar. Specifically, we want to quantify the degree of similarity between two sets of data representations and . To this end, we can make use of any of the existing representation similarity measures (), such as CKA (Kornblith et al., 2019a), and variants of CCA (Morcos et al., 2018; Raghu et al., 2017) as they’re designed specifically to measure similarity between two sets of representations. This yields a shared invariance measure, , which can now be formally defined as:
(3) 
Note that while the traditional use of , , does not measure shared invariance between and (we give concrete arguments for the same in Section 3.2), our proposed measure (Eq 3) shows how existing measures can be repurposed to measure shared invariance.
It’s important to note that our measure is a directional one since IRIs are defined relative to the reference model . Other than the reference and target models and , our measure takes given input points and a representation similarity measure () as inputs. We discuss the concrete instantiations of used in this work in Section 3.1. First, however, we describe how to construct given .
2.3 Generating IRIs
Operationalizing Eq 3, requires the answer to a key question: how to sample from an infinitely large set ? We argue that there are two key ways to do this: arbitrarily or adversarially. We can randomly choose a sample from , in which case we get arbitrary IRIs, or we can pick adversarially with respect to , such that the representations and are farthest apart, in which case we get adversarial IRIs. We also show in Table 1 that takeaways about shared invariance can vary greatly depending on the choice of arbitrary or adversarial IRIs.
We leverage the key insight that can map multiple different inputs to the same representation (since is a highly nonlinear deep neural network) which can be found using representation inversion (Mahendran and Vedaldi, 2014). For a given set of inputs typically drawn from the training distribution and a reference model , we generate that are all mapped to similar representations as by , . Note that and , are, by construction IRIs. This is achieved by performing the following optimization for every :
(4) 
Which can be approximated using gradient descent by repeatedly performing the following update (where is the step size):
(5) 
A key consideration in solving this optimization is that we must start with some initial value of , we must choose a seed from which to start the gradient descent. Different seeds can lead to different solutions of . We find that in practice randomly picked seeds give fairly stable estimates^{1}^{1}1Since we deal with images we sample each pixel value as a random integer from
with uniform probabilities.
.Arbitrary IRI In order to simulate an arbitrary sample from , we use the following ,
Since this scheme only optimizes for similar representation of and on () and starts from a random initial value of , it simulates a random sample from the (possibly infinitely) large set .
Adversarial IRIs We can alter the way we find () such that the resulting () are still, by definition IRIs but are optimized to generate very distinct outputs on , the model for which we’re measuring shared invariance. This can be achieved by using exactly the same procedure as in Eqs 4 and 5, except we now change to:
Solving for using ensures that the inputs are still similarly represented as on () and thus (X, X’) are IRIs. However this ensures that any measure of on such IRIs will be a worst case estimate. Such IRIs have been referred to as controversial stimuli (Golan et al., 2020) in existing literature.
3 Measuring Shared Invariances








STIR 

CKA  
Acc(, )  

Here the two ResNet18s in each column are trained on CIFAR10 with different random initializations, holding every other hyperparameter constant. 1.) For two such models trained using the vanilla crossentropy loss (left), interestingly, we find that STIR highlights a lack of shared invariance, whereas CKA overestimates this value; 2.) when both models are trained using adversarial training
(Madry et al., 2019) (middle) STIR faithfully estimates high shared invariance; 3.) Finally STIR is able show how having a directional measure can bring out the differences when comparing a model trained with vanilla loss and adv training (right), whereas CKA being unidirectional cannot derive these insights. All numbers are computed over 1000 random samples from CIFAR10 training set and averaged over 5 runs.3.1 STIR, an instantiation of
Using Eqs 4&5 we can generate different , by repeatedly sampling times from a given distribution (typically the training distribution of , the train or test set). Now, we define as follows:
(6) 
Here, we find using representation inversion as described in Section 2.3. We call this measure Similarity Through Inverted Representations — STIR.
When all are chosen in an adversarial manner (using as described in Section 2.3), we can estimate the worst case STIR as:
(7) 
Both Equations 6&7 are parametrized by . For our purpose we use linear Centered Kernel Alignment (CKA) (Kornblith et al., 2019a) as the similarity measure, Linear CKA. CKA has been adopted by the community as the standard measure for representation similarity and has been used by many subsequent works to derive important insights about deep learning (Nguyen et al., 2020; Raghu et al., 2021; Neyshabur et al., 2020). CKA also has certain desirable axiomatic properties such as invariance to isotropic scaling and orthogonal transformations, which other methods such as SVCCA (Raghu et al., 2017) and PWCCA (Morcos et al., 2018) do not possess. Importantly, CKA is not
invariant to every invertible linear transformation, which is not true of SVCCA or PWCCA. We refer to the paper
(Kornblith et al., 2019a) for an extensive discussion on why these are desirable properties for any similarity measure. While CKA, as proposed, can work with any kernel function, results show that Linear CKA works just as well as RBF CKA, so for simplicity we use Linear CKA for all our experiments. Definitions of the similarity metrics listed above can be found in Appendix A.3.2 Why Existing Representation Similarity () Measures Cannot Measure Shared Invariance
While at first glance a representation similarity measure (RSM) such as CKA (Kornblith et al., 2019a) or one of the others mentioned in section 3.1 might look appealing to measure shared invariance, these measures are in fact not suitable for this task, for several reasons.
First, current RSMs are not designed for capturing invariance. Their goal is to measure the degree of correlation between sets of points generated by two different models, that are transformed to be as aligned as possible under certain constraints. Measuring invariance, however, requires capturing the degree of difference between points generated by the same model, which should be the same.
Second, RSMs don’t interact with the model. All existing RSMs are evaluated in a twostep process, by first obtaining collections of representations from two models and then computing similarity based on those representations as , without using the models further. From a causal perspective, an invariance is an intervention on a model’s input that does not lead to a change in the model’s output. However, a central result in the causality literature is that the effect of interventions cannot be determined from observational data alone (Pearl, 2009). Therefore, any metric that aims to make meaningful statements about model invariance needs to interact with the model to make interventions. STIR does this via the process of representationinversion, however, existing RSMs are purely observational and therefore cannot properly determine invariances.
Third, sharing invariance between two models is directional. If model is constant, it will share all of model ’s invariances, but not vice versa if is not constant itself. Existing RSMs are not directional and therefore cannot express these relationships.
3.3 STIR Faithfully Measures Shared Invariance
Training two models ( and ) with different random initializations (holding all other things like architecture, loss and other hyperparameters constant), leads to very “similar” representations on the penultimate layer, as measured by instantiations of such as CKA (Kornblith et al., 2019a). We consider two variants of this experiment, where we train two ResNet18 models (on CIFAR10) from different random initializations (keeping everything else same) with 1.) the standard crossentropy loss (, ), and 2.) with adversarial training (, ) ^{2}^{2}2trained using threat model with , see (Madry et al., 2019) for more details. Throughout the paper STIR is measured over CIFAR10 training samples with = Linear CKA. Thus, to simplify the notation, we use to mean .
STIR provides insights into shared invariance where CKA fails. Both (, ) and (, ) achieve a high similarity score (as measured by CKA, on the penultimate layer of both models). However, we find that such a similarity measurement would be an overestimation of shared invariances between and which are much lower when measured as and , as shown in Table 1. Two models trained using adversarial training (, ) should intuitively have more shared invariances since these models were explicitly trained to be invariant to perturbations. Indeed, we see that these two models have much higher values of and .
Sanity Checks for STIR For high values of STIR, intuitively we would expect the representations of the model we’re evaluating () to have similar representations on IRIs, . Since IRIs, by construction, have similar representation on the reference model (), we expect the predictions of on (pred() to agree with predictions of on (pred(). Similarly, for low STIR values, we’d expect less agreement between pred() and pred(). We see that this relationship holds as lower STIR values for and also correspond to less agreement in their predictions on and higher STIR values for and correspond to higher agreement in their predictions (as shown in the right of each row of Table 1). To further corroborate that the estimate of shared invariance given by STIR is justified in being lower for (, ) than (, ), we generate controversial stimuli (Golan et al., 2020) for both pairs of models. Details of how to generate these can be found in the Appendix C. We find that indeed it’s significantly easier to generate controversial stimuli for (, ) than for (, ) (see Appendix C for the results).
shows that in the worst case, there are almost no shared invariances. When measuring shared invariance in the worst case (using , Eq 7), we see that even in the case of and the shared invariance drops close to . This is shown in the second row of Table 1. Thus, for the rest of the paper we focus on STIR measured using arbitrary IRIs, since gives a very pessimistic estimate of shared invariance and thus is not useful in comparing models.
STIR brings out nuance through directionality. When comparing across training types, (, ), we find that directionality of STIR is able to show that invariances of are not well captured by (indicated by the low value of ). However, is much higher. We posit that models trained using AT have a “superior” set of invariances and do not posses “bad” invariances that models trained using the vanila loss exhibit. Thus, when evaluating IRIs from a Vanilla model on a model trained using AT – one can expect lower shared invariance than in the other direction. STIR is able to capture these nuances in comparison between models because of its directionality. Measures of representation similarity () like CKA do not offer any such insights, since they’re not directional.
4 Using STIR to Analyze Model Updates
One motivation for a shared invariance measure like STIR, as mentioned in Section 1, is to monitor how different models “align” with each other. This can then be used to analyze if updates to a model leads to preservation of invariances learned before the update. We first show that STIR can capture relative differences between model invariances and then use STIR to analyze a simulated model update scenario where we incrementally add more training data.
4.1 STIR Captures Relative Robustness
We demonstrate that STIR can capture differences between models of varying degrees of robustness. We know that increasing adversarial robustness increases invariance to ball perturbations. Thus, by construction, if a model has higher adversarial robustness than another model , then invariances of should be “superior” to those of and hence we should see . To empirically test this, we construct three models with varying degrees of adversarial robustness: a model trained using the vanilla crossentropy loss (), a model trained using adversarial training (AT) with the usual 10 iterations used to solve the inner maximization of AT (Madry et al., 2019) (), and a model trained using AT but with only 1 iteration in the inner maximization loop, which gives adversarial robustness somewhere in the middle (). Table 2 shows comparison between these three models and confirms that if adversarial robustness (measured here by robust accuracy).

AT it=10  AT it=1  Vanilla 

AT it=10 Rob: Clean: 
—  
AT it=1 Rob: Clean: 
—  
Vanilla Rob: Clean: 
—  

4.2 Updating Models With More Data
Models deployed in the real world are continuously updated with more data. In such cases, it may be crucial to understand how a model update at a given timestep shares invariance with the model at the previous timestep. To simulate such a scenario, we train a ResNet18 on CIFAR10 where at each timestep, we add training samples, , at timesteps , we train the model on samples (we keep
samples for a holdout validation set). At each timestep we train for 100 epochs. Figure
1 shows how (and ) changes as we progressively add more data. Here is the model at a given timestep and is the model on the previous timestep. For both AT and Vanilla loss we see and increase as we add more data. We also see that after a certain amount of data the STIR scores plateau – thus indicating that adding more data has diminishing returns for shared invariance.5 Evaluating the Impact of Design Choices on Shared Invariance using STIR
A lot of research effort been dedicated towards finding architectures, training schemes and datasets that produce more correct ( accurate) models. However, the effect that these design choices have on the relative invariance of models is still not properly understood. In this section, we leverage STIR to investigate the effect that the various choices in the training pipeline have on shared invariances between models. All evaluations of STIR in this section are performed using the CIFAR10 dataset. See Appendix B for additional details.
5.1 Role of Different Random Initialization, Different Training Datasets
Random Initializations (Kornblith et al., 2019b) find that the same architecture trained from two different random initializations should converge to highly similar representations, layer in both models has high similarity. We find that this is not necessarily the case for shared invariances. Fig. 2 shows results for two ResNet18 (different random initialization) and two VGG16 (different random initialization) models trained on CIFAR10 using the vanilla crossentropy loss and adversarial training. We see that models trained with vanilla loss (Fig. 2 & 2
), later layers have lower shared invariances than initial layers, indicated by a negative slope of lines of best fit. A similar result showing high variance between two classifiers trained with vanilla loss that differ only in their random initialization has been observed for NLP models as well
(Zafar et al., 2021), albeit for posthoc explanations. This suggests a possibly interesting connection between shared invariance and posthoc explanations which we leave for future work. However, with adversarial training, both initial and final layers converge to (almost) similarly high levels of shared invariances (indicated by flatter lines of fit in Fig. 2 & 2). Thus, we conclude that when training from different random initializations, training procedures that explicitly introduce invariance (such as adversarial training) make each layer of two differently initialized models converge to similar shared invariances.Datasets For the same architecture (ResNet18) trained on different datasets (CIFAR10 and CIFAR100), we evaluate the shared invariances between each of the corresponding layers. We find that in general, initial layers tend to have higher shared invariances, as indicated by the negative slope of the line of best fit in Fig. 3. Interestingly, (Kornblith et al., 2019a) also had a similar observation when measuring similarity for models trained on different datasets. Additionally, we see that shared invariance increases substantially (for all layers) when these models are trained using adversarial training, similar to the random initilaization case, even though the training here is performed on different datasets. We see similar trends for other architectures too (results in Appendix D).
5.2 Different Architectures, Penultimate Layer
We train ResNet18, ResNet34, VGG16 and VGG19 with two different random seeds and using both the vanilla loss and adversarial training (AT). We then compute the shared invariances across all these configurations of models (for the penultimate layer of each model) as shown in Fig. 4.
We find that in general, architectures with residual connections (ResNet18 and ResNet34) have high shared invariances amongst themselves, as indicated by high values of STIR amongst ResNets (for both vanilla and AT).
5.3 Different Adversarial Training Methods
As observed in Fig. 4, models trained with AT generally have higher values of STIR amongst them than models trained using the vanilla loss. This raises a natural question: does high STIR between models hold for any kind of training that makes models robust to ball perturbations?
To test this, we train a ResNet18 on CIFAR10 using 3 distinct training losses, all of which have been shown to be highly robust to ball perturbations: Adversarial Training (AT) (Madry et al., 2019), TRADES (Zhang et al., 2019) and MART (Wang et al., 2019). TRADES and MART contains an additional hyperparameter that is used to trade off between accuracy and adversarial robustness. We use for our experiments, and additional details can be found in Appendix B. All 3 of these ResNets achieve similar clean and robust accuracy.
Surprisingly, we find that models trained using these methods only have mild levels of shared invariance, as indicated by lines of best fit around (see Fig. 5). This is in contrast to the case of two ResNet18s trained using AT (see Fig. 2) which leads to very high shared invariance. This shows that the differences shown in Fig. 5 are not due to stochasticity in training but rather due to the training methods themselves. Thus, while all of these methods achieve the same goal of robustness in an ball, they achieve so in very different ways. We see similar trends for other architectures too (results in Appendix D).
In summary, we find that shared invariance between models tends to decrease with increasing layerdepth and adversarial training significantly increases the degree of shared invariance. Models with architectures using residual connections exhibit a higher degree of shared invariance, whereas different methods of adversarial training do not necessarily lead to models with the same shared invariances.
6 Related Work
Comparing Representations. A number of papers have proposed methods for measuring the similarity of representations in deep neural networks (Laakso and Cottrell, 2000; Li et al., 2016; Wang et al., 2018; Raghu et al., 2017; Morcos et al., 2018; Kornblith et al., 2019a). Our invariance measure leverages CKA (Kornblith et al., 2019a), however, we discuss in Sections 3.2 and 3.3 why the existing metrics themselves are unsuitable for measuring shared invariances between models. Recent work has identified lack of consistency between existing similarity measures and proposed a measure that is consistent (Ding et al., 2021). However, their proposed measure also measures similarity and does not take into account the invariances. It can thus be used in conjunction with our proposed measure, but not replace it for measuring invariance. There has been work on measuring shared invariance between NN representations and (black box) humans (Feather et al., 2019; Nanda et al., 2021), however, we consider a different problem of shared invariances between two NNs.
Understanding Representations. A lot of work on scrutinizing representations has been geared towards improving the interpretability of neural networks (Mahendran and Vedaldi, 2014; Olah et al., 2017; Kim et al., 2018; Dosovitskiy and Brox, 2016a, b). We’re particularly inspired by representation inversion (Mahendran and Vedaldi, 2014; Dosovitskiy and Brox, 2016b) which is a key component of our proposed measure. However, while the goals of all of these works is to be able to make a neural network more interpretable to humans, to enable qualitative judgements about the network’s behavior, our goal is to instead measure the degree of shared invariance between any two models, to make quantitative statements. Recent work has used model stitching to compare representations (Bansal et al., 2021), but it is focused on better understanding of learned representations, rather than measuring robustness. Higgins et al. (2018) define disentanglement in representation learning by leveraging the concept of invariance. Their work, however, only provides a definition of disentanglement and no measure for the degree of invariance.
Robustness. Our work takes inspiration from many works on adversarial (Szegedy et al., 2013; Madry et al., 2019; Zhang et al., 2018; Ilyas et al., 2019), natural (Hendrycks and Dietterich, 2019; Geirhos et al., 2018) and distributional (Taori et al., 2020; Recht et al., 2019; Koh et al., 2021) robustness. However, in all these works, the reference model is (implicitly) assumed to be a human and the goal is to make a neural network follow the invariances of a human. In our work we explicitly characterize the reference model, which can be another neural network, that allows us to unify all the different notions of robustness.
7 Conclusion
We proposed a directional measure of shared invariance between representations that takes into account the invariances of the model that generated these representations. We showed how our measure faithfully estimates shared invariances where existing representation similarity methods may fail. Furthermore, we showed how our measure can be used to derive interesting insights about deep learning. It will be interesting to explore this direction further in future work, revisiting earlier analysis based on previous representation similarity approaches (Nguyen et al., 2020; Raghu et al., 2021). Another interesting avenue to explore is how our measure can be used during training to encourage one model to follow similar invariances to another. This could be helpful to update a neural network in safetycritical applications by ensuring that the new network maintains the invariances of the original model.
Acknowledgements
AW acknowledges support from a Turing AI Fellowship under grant EP/V025279/1, The Alan Turing Institute, and the Leverhulme Trust via CFI. VN, TS, and KPG were supported in part by an ERC Advanced Grant “Foundations for Fair Social Computing” (no. 789373). VN and JPD were supported in part by NSF CAREER Award IIS1846237, NSF DISN Award #2039862, NSF Award CCF1852352, NIH R01 Award NLM01303901, NIST MSE Award #20126334, DARPA GARD #HR00112020007, DoD WHS Award #HQ003420F0035, ARPAE Award #4334192 and a Google Faculty Research Award. Finally, we would like to thank anonymous reviewers for their constructive feedback and suggestions for experiments in Section 4.
References
 Revisiting model stitching to compare neural representations. Advances in Neural Information Processing Systems 34, pp. 225–236. Cited by: §6.
 Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases, H. Blockeel, K. Kersting, S. Nijssen, and F. Železný (Eds.), pp. 387–402. External Links: ISBN 9783642409943 Cited by: §1.
 Grounding representation similarity with statistical testing. arXiv preprint arXiv:2108.01661. Cited by: §6.
 Generating images with perceptual similarity metrics based on deep networks. Advances in neural information processing systems 29, pp. 658–666. Cited by: §6.

Inverting visual representations with convolutional networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4829–4837. Cited by: §6.  Exploring the landscape of spatial robustness. In International Conference on Machine Learning, pp. 1802–1811. Cited by: §1, §2.1.
 Manitest: are classifiers really invariant?. CoRR abs/1507.06535. External Links: Link, 1507.06535 Cited by: §1, §2.1.
 Metamers of neural networks reveal divergence from human perceptual systems. Advances in Neural Information Processing Systems 32. Cited by: §6.
 Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §1, §2.1, §6.
 Controversial stimuli: pitting neural networks against each other as models of human cognition. Proceedings of the National Academy of Sciences 117 (47), pp. 29330–29337. External Links: Document, ISSN 00278424, Link, https://www.pnas.org/content/117/47/29330.full.pdf Cited by: Appendix C, §2.3, §3.3.
 Explaining and harnessing adversarial examples. External Links: 1412.6572 Cited by: §2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix B.
 Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.1, §6.
 Towards a definition of disentangled representations. CoRR abs/1812.02230. External Links: Link, 1812.02230 Cited by: §6.
 Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, External Links: Link Cited by: §6.

Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav)
. In International conference on machine learning, pp. 2668–2677. Cited by: §6.  Wilds: a benchmark of inthewild distribution shifts. In International Conference on Machine Learning, pp. 5637–5664. Cited by: §6.
 Similarity of neural network representations revisited. In International Conference on Machine Learning, pp. 3519–3529. Cited by: Appendix A, Figure 7, §1, §2.2, §3.1, §3.2, §3.3, Figure 3, §5.1, §6.

Do better imagenet models transfer better?
. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2661–2671. Cited by: §5.1.  Learning multiple layers of features from tiny images. Cited by: Appendix B.

Content and cluster analysis: assessing representational similarity in neural systems
. Philosophical psychology 13 (1), pp. 47–76. Cited by: §6.  Convergent learning: do different neural networks learn the same representations?. External Links: 1511.07543 Cited by: §6.
 Towards deep learning models resistant to adversarial attacks. External Links: 1706.06083 Cited by: Appendix B, Table 3, §2, Table 1, Figure 5, §4.1, §5.3, §6, footnote 2.
 Understanding deep image representations by inverting them. External Links: 1412.0035 Cited by: §2.3, §6.
 Robust. In MerriamWebster.com dictionary, External Links: Link Cited by: §2.
 Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, pp. 5732–5741. External Links: Link Cited by: Appendix A, §1, §2.2, §3.1, §6.
 Exploring alignment of representations with human perception. CoRR abs/2111.14726. External Links: Link, 2111.14726 Cited by: §6.

What is being transferred in transfer learning?
. arXiv preprint arXiv:2008.11687. Cited by: §3.1.  Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327. Cited by: §1, §3.1, §7.
 Feature visualization. Distill. Note: https://distill.pub/2017/featurevisualization External Links: Document Cited by: §6.
 The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pp. 372–387. Cited by: §1, §2.
 Causality. 2 edition, Cambridge University Press. External Links: ISBN 9780521895606 Cited by: §3.2.
 SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6078–6087. External Links: ISBN 9781510860964 Cited by: Appendix A, §1, §2.2, §3.1, §6.

Do vision transformers see like convolutional neural networks?
. Advances in Neural Information Processing Systems 34. Cited by: §3.1, §7. 
Do imagenet classifiers generalize to imagenet?
. In International Conference on Machine Learning, pp. 5389–5400. Cited by: §6.  Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Appendix B.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §2.1, §2, §6.
 Measuring robustness to natural distribution shifts in image classification. In Advances in Neural Information Processing Systems, External Links: Link Cited by: §1, §6.

Robustness may be at odds with accuracy
. arXiv preprint arXiv:1805.12152. Cited by: §2.  Towards understanding learning representations: to what extent do different neural networks learn the same representation. arXiv preprint arXiv:1810.11750. Cited by: §6.
 Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, Cited by: Figure 5, §5.3.
 On the lack of robust interpretability of neural text classifiers. In Findings of the Association for Computational Linguistics: ACLIJCNLP 2021, Online, pp. 3730–3740. External Links: Link, Document Cited by: §5.1.
 Theoretically principled tradeoff between robustness and accuracy. In International Conference on Machine Learning, pp. 7472–7482. Cited by: §2, Figure 5, §5.3.

The unreasonable effectiveness of deep features as a perceptual metric
. In CVPR, Cited by: §6.
Appendix A Representation Similarity Measures
A recent line of work has studied measures for the similarity of representations in deep neural networks. Given two models and and a set of input points to both, all of these measures quantify the degree of representational similarity between two sets of representations and as and are defined as follows:
SVCCA (Raghu et al., 2017) is computed via the following steps:

Compute by pruning using SVD to only retain the first and principal components that are necessary to retain 99% of the variance, respectively.

Perform Canonical Correlation Analysis yielding correlation coefficients .

Return
PWCCA (Morcos et al., 2018) is computed via the following steps:

Perform Canonical Correlation Analysis yielding correlation coefficients and CCA vectors

Compute weights , where the ’s are the columns of

Return
(Linear) CKA (Kornblith et al., 2019a) is computed via the following steps:

Compute similarity matrices ,

Compute normalized versions , of the similarity matrix using centering matrix

Return , where
Appendix B Model Architecture, Hardware, Training, and Other Details
We use ResNet18, ResNet34 (He et al., 2016), VGG16 and VGG19 (Simonyan and Zisserman, 2014) trained on CIFAR10/100 (Krizhevsky et al., 2009) using the standard crossentropy loss and other adversarially robust training methods such as AT, TRADES and MART. All of our experiments are performed on standard models and datasets that can fit on standard GPUs. We also attach our code to reproduce the numbers. For all purposes of adversarial training we use the threat model with (see (Madry et al., 2019) for details). Additionally TRADES and MART require another hyperparameter (that balances adversarial robustness and clean accuracy) which we set to for our experiments.
Appendix C STIR Faithfully Measures Shared Invariance
To confirm that STIR is correct in assigning different scores to two (same) models trained from different initializations, we generate controversial stimuli (Golan et al., 2020) for all configurations in Table 3: (), (), and (). Such stimuli are generated by solving the following optimization for a training data point and any two models and :
(8) 
Here we assume and are penultimate layer representations of the respective models. This process generates for every point such that is low and is high. Since these are “perceived” very differently by the two models (wrt the original ), they are called controversial stimuli. We use the empirical mean of the amount of perturbation needed () as a measure of ease of generating controversial stimuli ^{3}^{3}3High(er) values of indicate it was hard(er) to generate controversial stimuli. For a pair of models (, ) where it’s easy to generate such , the shared invariance should be low. We see that this is indeed the case as indicated by numbers in Table 3.








STIR 

CKA  
Acc(, )  

Appendix D Experiments: Insights using STIR
d.1 Role of Different Random Initialization, Different Training Datasets
d.2 Different Architectures, Penultimate Layer
Fig 8 shows that trends for STIR hold across different choice of seeds. We also see that in contrast CKA assigns high value across the board and is thus not a faithful measure of shared invariance.
d.3 Different Adversarial Training Methods
Fig 9 shows results on VGG16. These are much higher than values for ResNet18, thus showing that these methods achieve robustness differently across architectures. For AT and MART, we see lower shared invariances even for VGG16.s