Exploring Alignment of Representations with Human Perception

We argue that a valuable perspective on when a model learns good representations is that inputs that are mapped to similar representations by the model should be perceived similarly by humans. We use representation inversion to generate multiple inputs that map to the same model representation, then quantify the perceptual similarity of these inputs via human surveys. Our approach yields a measure of the extent to which a model is aligned with human perception. Using this measure of alignment, we evaluate models trained with various learning paradigms ( supervised and self-supervised learning) and different training losses (standard and robust training). Our results suggest that the alignment of representations with human perception provides useful additional insights into the qualities of a model. For example, we find that alignment with human perception can be used as a measure of trust in a model's prediction on inputs where different models have conflicting outputs. We also find that various properties of a model like its architecture, training paradigm, training loss, and data augmentation play a significant role in learning representations that are aligned with human perception.



There are no comments yet.


page 2

page 6

page 13

page 15


SEMI: Self-supervised Exploration via Multisensory Incongruity

Efficient exploration is a long-standing problem in reinforcement learni...

High Fidelity Visualization of What Your Self-Supervised Representation Knows About

Discovering what is learned by neural networks remains a challenge. In s...

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

While it is nearly effortless for humans to quickly assess the perceptua...

Learning Perceptually-Aligned Representations via Adversarial Robustness

Many applications of machine learning require models that are human-alig...

Revisiting Model Stitching to Compare Neural Representations

We revisit and extend model stitching (Lenc Vedaldi 2015) as a metho...

Perception of visual numerosity in humans and machines

Numerosity perception is foundational to mathematical learning, but its ...

Inverting Adversarially Robust Networks for Image Synthesis

Recent research in adversarially robust classifiers suggests their repre...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: [Examples of Generated Images]

Examples from ImageNet that are perceived similarly by different models. Similarly perceived images are generated by starting from different seeds and solving Eq 

1, similar to [1]. Some models retain some features of the seed image (for example, the reconstruction of both “typewriter” and “cat” starting from random noise for Robust VGG16 contains some part of the noise in the final images too) – whereas other models, like ResNet50 have much higher quality reconstructions.

Many suggest that deep learning has been successful due to the ability to learn “meaningful” representations 

[2, 3], and that it is helpful to understand what aspects of a learned representation contribute to success for a downstream task (image classification). Much prior work has focused on analyzing learned representations for better interpretability [4, 5, 6, 7, 8, 9, 10]. Recent works proposed methods to compare the similarity of representations between two neural nets [11, 12, 13, 14], however, these works assume white-box access to both networks. Here, we consider the similarity between the learned representation of a deep neural net and human perception. Measuring this type of similarity offers different challenges, including that we only have black-box access to human perception (we cannot measure internal representations in the human brain).

Assessing alignment of learned representations with human perception is an important step in understanding and diagnosing critical issues in deep learning such as lack of robustness [15, 16, 17, 18, 19, 20, 21, 22, 23]. Prior work has shown that adversarially robust models, , models trained using adversarial training [24], learn representations that are well aligned with human perception [1, 25, 26]. While this is an exciting observation, these works do not formally define a measure that can quantify alignment with human perception. We use representation inversion [4] to find sets of images that a model ‘perceives’ similarly in latent space, and use human surveys to quantify human alignment of a learned representation. Representation inversion involves reconstructing an image from just its representation and has been used in prior works (notably [4]

) as an interpretability tool to understand what visual features are learned by a deep neural network. We leverage representation inversion to define a measure of alignment with human perception and conduct an in-depth study of vision models trained using various kinds of training losses (standard ERM and various adversarial training methods 

[24, 27, 28]) and with different training paradigms (supervised and self-supervised learning).

We show that measuring alignment can help assess a model’s performance on controversial stimuli [29]. Controversial stimuli are inputs that are perceived differently by two models, , one model perceives these inputs to be similar and the other model perceives them to be different. Through empirical investigations, we show that more aligned models are more likely to be correct about the perception of controversial stimuli. This is discussed in depth in Section 3.1. Additionally, we argue that measuring alignment can be a simple but promising way of auditing adversarial robustness of defence methods. We evaluate a variety of defence methods and interestingly find that the defence methods with lower alignment were later broken by stronger attacks (in [30]), whereas the methods with high alignment have been harder to attack.

We then conduct an in-depth study to understand what contributes to better alignment. We show that different architectures trained on CIFAR10 that achieve similar clean and robust test accuracies can have very different alignment with human perception. We also show that in certain cases, using data augmentation during training is key to learning aligned representations. Additionally, we find that when the same architectures are trained with a self-supervised contrastive loss (SimCLR [31]), the learned representations – while typically having lower accuracies than their fully supervised counterparts – have much better alignment. Prior work has suggested that robust training induces a human prior over learned representations [1], however, our results suggest that the story is more nuanced and that robust training is not the only component that leads to well-aligned representations.

(a) 2AFC

Hard ImageNet Clustering Task

(c) Random ImageNet Clustering Task
Figure 2: [Survey Prompts for AMT workers] In the 2AFC (left) setting we ask the annotator to choose which of the two images is perceptually closer to the query image. For a well-aligned model, the query image (result, ) should be perceptually closer to the image that is assigned a similar representation by the model. In the clustering setting (center and right) we show 3 images from the dataset (target images, ) in the columns and for each of these, we generate 2 resulting images () that yield similar representations on the model. Each of these is shown across the rows. The task here is to match the resulting image on the row with the corresponding target image on the column. For a well-aligned model, images that are mapped to similar representations by the model should also be the ones matched by human annotators.

We highlight the following contributions:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*]

  • We present a method to assess the alignment of learned representations with human perception at scale. We show that this is a reliable measure and corroborate our finding using human surveys. We also present an automated method that can serve as a proxy for human surveys to measure alignment at scale.

  • We find that alignment is indicative of a model’s correctness on controversial stimuli [29] – which are inputs that are perceived similarly by one model but differently by another one. In such cases, the model with higher alignment is more likely to correctly map controversial stimuli.

  • We show that models aligned with human perception have good robustness to adversarial attacks. Our results show that proposed defences which were later broken by stronger attacks indeed had poor alignment. Thus, measuring alignment could help to audit defences against adversarial attacks.

  • Using our method, we present an empirical study on representation alignment for various types of vision models. We draw insights about the role of data augmentation, model architectures, different training paradigms and loss functions in learning human aligned representations. Our findings show that measuring representation alignment can offer additional important insights about models that cannot be inferred from traditionally used measures such as clean and robust accuracy.

1.1 Related Work

Robust Models Several methods have been introduced to make deep learning models robust against adversarial attacks [32, 33, 34, 35, 36]. Adversarial training [24] was shown to be one of the most effective techniques to defend against adversarial attacks and since then, many variants of adversarial training have been proposed to improve its performance such as TRADES [27] and MART [28]. Interestingly, models that are robust to adversarial perturbations were also found to learn features that align with human perception [1] and possess perceptually aligned gradients [25, 26]. In our work we systematically study alignment and show that well aligned models are generally adversarially robust. We also show that many other factors (discussed in section 4), in conjunction with robust learning, lead to better alignment.

Representation Similarity An early study of comparing neural representations by comparing distances was shown in [37]

. Methods to align features of different networks and comparing them with respect to (groups of) neurons were explored in 

[11]. Recently, there have been many works that efficiently calculate the similarity in learned representations of two neural nets [13, 12, 14, 38]. These works assume complete access to both neural networks and are thus able to fully analyze both sets of representations to then compute a similarity score. These methods cannot be applied to our case since we want to measure the similarity between a neural network and the human brain, with only black box access to the latter.

DNNs and Human Perception Similarity of representations of neural networks have been used to measure quality in the image space in [39, 40]. Zhang et al proposed a perceptual similarity metric based on activations of initial layers of trained DNNs [41]. We leverage this perceptual similarity metric in our work to simulate human perception (details in section 2). Recent work has looked at measuring the alignment of human and neural network perception by eliciting similarity judgements from humans and comparing it with the outputs of neural nets [42]. Our work differs in two key aspects: the kind of similarity we look for lies entirely in the pixels of the image, , we look for perceptual similarity in images, whereas [42] look for contextual similarity (a cigarette and beer bottle are marked similar in their work because both have a shared context of being age restricted, however, in our case these would be considered dissimilar since they have no perceptual similarity), and we create similar looking images which takes into account all the layers of the neural network, while [42] only consider the final output. Further works have built on the foundational work of Roads and Love [42]

to transform neural network representations such that they align with human similarity judgments 

[43, 44], or trained specifically to predict human judgments [45]. However, these are orthogonal to our work.

Explainability/Interpretability Many prior works that evaluate learned representations often do so with the goal of explaining the learned features of a DNN [4, 9, 8, 46, 47]. While our work is inspired by these prior works and uses similar tools, our goal is to understand alignment with human perception by generating inputs that are mapped to similar perceptions.

2 Measuring Human Perception Alignment

A necessary condition for a DNN’s perception to be aligned with human perception is that inputs that are mapped to similar representations by a DNN should also be perceived similarly by humans.

Testing for this condition is a two step process: we first need to generate inputs that are mapped to similar representations by the DNN and then we need to assess if these inputs are also similarly seen by humans. In theory we would like to find all inputs that are mapped to similar representations, however, given the highly non-linear nature of DNNs this is very challenging. We show empirically that even using a small number of randomly seeded samples is sufficient to reliably observe the differences.


Model Human 2AFC LPIPS 2AFC Human Clustering LPIPS Clustering Clean Acc. Robust Acc.

ResNet18 98.0
VGG16 11.5
InceptionV3 87.5
Densenet121 100.0



Model Human 2AFC LPIPS 2AFC Human Clustering Human Clustering Hard LPIPS Clustering Clean Acc. Robust Acc.


ResNet18 -
ResNet50 -
VGG16 -

Table 1: [CIFAR10 and ImageNet Survey Results] For different training losses on different model architectures, we show the results of surveys that measure alignment with human perception. A higher value for 2AFC and clustering indicate better alignment. A value of close to for clustering means random assignment and indicates no alignment. While robust models are much better aligned, we find that alignment can vary quite a bit between different architectures – all of which achieve similar clean and robust accuracies. This indicates that alignment is an important additional measure for evaluating DNNs. All measures are reported on the test set.

2.1 Inverting Representations to Generate Images

We generate multiple inputs that lead to the same representation for a DNN using representation inversion [4]. Representation inversion involves starting with some image to reconstruct a given image (target) with respect to its representation , obtaining , where is the trained DNN. The reconstruction is achieved by solving an optimization of the following form:


Starting the optimization from different values of (seed) gives different reconstructions, (result). All of these reconstructions induce representations on the DNN that are very similar to the given image (target), as measured using norm. Thus, this process yields a set of images, which are mapped to similar internal representations of the DNN. Some examples are shown in Fig 1.

2.2 Measuring Human Perception Similarity

After obtaining a set of images with similar representations, we must check if humans also perceive these images similarly. The extent to which humans think this set of images is similar defines how aligned the representations learned by the DNN are with human perception. We deploy two types of surveys to elicit human similarity judgements. Prompts for these surveys are shown in Fig 2.

Clustering In this setting, we ask humans to match the resulting images () to the most perceptually similar image (). A prompt for this type of a task is shown in Fig 1(c) & 1(b). Once we get these responses, a quantitative measure of alignment can be calculated by measuring the fraction of that were correctly matched to their respective . For ImageNet, we observed that a random draw of three images (Fig 1(c)) can often be easy to match to based on how different the drawn images () are. Thus, we additionally construct a “hard” version of this task by ensuring that the three images are very similar (as shown in Fig 1(b)). More details in Appendix A.

2AFC Two alternative forced choice test (2AFC) is commonly used to assess perceptual similarity of images [41]. In this setting we show the annotator a reconstructed image () and ask them to match it to one of the two images shown in the options. The images shown in the options are the seed (starting value of in equation 1) and the original image (). The goal here is to see if humans also perceive to be similar to . See Fig 1(a) for an example of this type of survey.

Automating Similarity Judgements Conducting an AMT survey to evaluate each model for human alignment can be time consuming and expensive, and thus does not scale. In order to automate the process of getting similarity judgements from humans, we use LPIPS [41]

which is a widely used perceptual similarity measure. It relies on the initial few activations of an ImageNet pretrained model to compute a perceptual distance, , a good measure of how similar two images can be. We use LPIPS with Alexnet since it was shown to give the most reliable similarity estimates 

111https://github.com/richzhang/PerceptualSimilarity. Using this, we can simulate a human to complete the 2AFC and Clustering tasks by matching the queries to the option that has the highest LPIPS similarity. See Appendix A for complete details of the experimental setup.

2.3 Role of Input Distribution

To assess how the proposed method varies with based on samples chosen as targets ( in Eq 1), we test using targets that are both “in-distribution”, , they are taken from the test set of the dataset used during training and “out-of-distribution” (OOD), , targets are images that look nothing like the training dataset. For OOD samples, to test an extreme case, we choose images sampled from two random gaussians, and . Some examples of in-distribution and OOD targets, along with the corresponding results are shown in Fig 8, Appendix A. We find that even though performance drops, humans are still able to faithfully perform both 2AFC and Clustering tasks, yielding the same ranking of models as in-distribution targets. Results are shown in Table 3, Appendix A.

2.4 Evaluation

For each model, we randomly picked 100 images from the test set and choose two different types of seeds: an image with random pixel values, and a random image picked from the training set, as shown in Fig 1. For each of the 100 images, we reconstruct the image using the two types of seeds, giving a total of 200 query images. Each image was reconstructed using different randomly drawn seeds, thus ensuring that any evaluation done on these reconstructions does not depend on the seed image. These were used to then setup the 2AFC and Clustering surveys. Each survey was completed by 3 AMT workers, who were paid based on a US per hour compensation rate. Attention checks (with same query image as the option) were placed to ensure the validity of our responses. Our study was approved by our institution’s ethics review board.

Table 1 shows the results for the surveys conducted with AMT workers. For a well aligned model, the scores under 2AFC and Clustering should be close to 1, while for a non-aligned model scores under 2AFC should be close to zero and scores under Clustering should be close to a random guess (about ).

Reliability We see that scores under Human 2AFC and Human Clustering

order the different models similarly. Despite starting from separate seeds, human annotators get similar accuracy when matching the regenerated image to the corresponding target image. We also see that even though accuracy drops for the “hard” version of ImageNet task, the relative ordering of models remains the same. Additionally the variances between different annotators (included in Table 

1) is also very low. These observations indicate that alignment can be reliably measured by generating inputs that have similar internal representations and does not depend on bias in selecting seeds, targets or annotators.

Automatic Measurement Table 1 additionally shows the accuracy numbers if LPIPS was used to simulate humans. We find that these numbers faithfully reflect how humans assign perceptual similarity, since LPIPS accuracy scores also rank models the same way as human judgements do. Thus in the rest of the paper, we will use this as a way to simulate human similarity judgements, which can then be used to measure alignment.

Alignment as an Evaluation Measure We also see from Table 1 that models with similar clean and robust accuracy can have very different levels of alignment, most notably robust VGG16 on CIFAR10, which has similar accuracies as other models but has very low alignment.

3 Why Does Alignment Matter?

3.1 Trustworthiness of Model Predictions

Figure 3: [Controversial Stimuli Examples] An example from CIFAR10 (top) and ImageNet (bottom) along with corresponding controversial stimuli generated using Eq 2. Controversial stimuli have higher perceptual similarity when (the model for which stimuli are similar) is the more aligned model (ResNet18 for CIFAR10 and ResNet50 for ImageNet). Conversely, when (the model for which stimuli are distinct) is the more aligned model, the generated controversial stimuli have lower perceptual similarity (bottom row of images for both CIFAR10 and ImageNet).

Controversial stimuli refers to images for which different models produce distinct responses [29]. These kinds of images are particularly useful in determining which of two models (with comparable clean and robust accuracies over an input distribution) is more trustworthy as only one of the conflicting responses can be accurate, , the controversial stimuli will either be similar or distinct.

We can generate controversial stimuli for any pair of neural nets by playing a game between the two networks where one network changes a given input such that its representation changes substantially from that of the original input, while the other network changes the input such that its representations remains similar. Playing such a game to convergence leads to inputs that are perceived similarly by the first network, but differently by the second. More concretely, for two neural networks and , starting from an input , we aim to solve the following optimization:



is a hyperparameter used for appropriate scaling to ensure stability (empirically we find

works well for CIFAR10/100 and ImageNet). When solving this optimization, we can choose different starting values of x (), to obtain different results (). All of the , along with , are then “perceived” similarly by but differently by and are thus called controversial stimuli. It’s important to note that if and perceive inputs in exactly the same way (if = ), then controversial stimuli cannot exist.

When we can generate such controversial stimuli (), one of or has to be (more) incorrect in perceiving the stimuli as the set of images can either be visually similar or distinct (as judged by humans). If based on human perception, the generated are visually similar, then these are perceived incorrectly by . Likewise if humans judge to be visually distinct, then these inputs are perceived incorrectly by . Since eliciting human judgements is an expensive and time-consuming task, we use distance between two images measured by LPIPS [41] as a measure of perceptual similarity. 222Similar to experiments in Section 2, we use Alexnet to measure LPIPS from the public source code of the original paper.


VGG16 ResNet18

Contr. Stimuli (Pairwise Dist.) (VGG16 ResNet18) (ResNet18 VGG16)


VGG16 ResNet50

Contr. Stimuli (Pairwise Dist.) (VGG16 ResNet50) (ResNet50 VGG16)

Table 2: [Controversial Stimuli for a pair of CIFAR10 and ImageNet models] When the more aligned model (in bold) is set as in Eq 2, we find that the controversial stimuli are more perceptually similar (indicated by lesser pairwise distance, in bold) than when the less aligned model is set as . The results indicate that the more aligned model (ResNet18) is more accurately perceiving the controversial stimuli than the less aligned model (VGG16). Here, model alignment is measured based on human judgements on 2AFC task in Table 1.

We find that the model with higher alignment is more likely to perceive controversial stimuli correctly and hence is more trustworthy. This is shown in Table 2 where setting (see Eq 2) to the more aligned model (ResNet18 for CIFAR10 and ResNet50 for ImageNet) leads to stimuli that are more perceptually similar – as indicated by lower LPIPS distance. Thus, our proposed measure of alignment can help understand which model is likely to go wrong in the case of controversial stimuli. Fig 3 shows some examples (more examples in Appendix B) of controversial stimuli generated by playing two robust models trained on ImageNet (ResNet50 and VGG16, taken from [48]) and two robust models trained on CIFAR10 (ResNet18 and VGG16 [24]). These two examples suggest that the more aligned models (ResNets in this example) are likely to perform better on controversial stimuli.

3.2 Robustness of Model Predictions

Figure 4: [Using Alignment as a Measure of Robustness] ResNet18 trained on CIFAR10 using Adversarial Training [24], TRADES [27], MART [28] and k-WTA [49]. TRADES and MART use a hyperparameter to balance the clean and robust accuracies. Note the strong correlation between measures of perception alignment and robust accuracy. Thus, perception alignment can be used as a proxy measure for adversarial robustness.

We posit that models whose representations are estimated to be better aligned with human perception tend to have good robust accuracy. By definition, adversarial attacks are imperceptible perturbations to the input that result in a misclassification [50]. If a model is well aligned with human perception then it is less likely to change its representations for imperceptible perturbation and consequently, it is more likely to be robust to adversarial attacks.

In Table 1 we find that models with good alignment (as indicated by 2AFC and Clustering accuracies of humans) are also the ones with higher robust accuracies – for both CIFAR10 and ImageNet. This suggests that good alignment may be a sufficient indicator for adversarial robustness. To further test this hypothesis, we train models with 4 different kinds of mechanisms proposed in prior work to ensure robustness: adversarial training [24], TRADES [27], MART [28], and k-WTA [49]. The first three methods use a modified loss function during training, whereas k-WTA introduces a new kind of activation, called k-Winners-Take-All. We took the model from k-WTA with highest reported robust accuracy. 333This uses a sparse ResNet18. We find that the model trained using k-WTA has the least alignment, and note that k-WTA was broken recently using stronger attacks [30]. Fig 4 shows how methods such as adversarial training, TRADES and MART have high alignment and also high robust accuracy, whereas k-WTA has low alignment.

While we used k-WTA as an illustrative example, many proposed adversarial defences have been repeatedly broken [30, 51], a key reason being the lack of proper robustness evaluation [52]. These results indicate that measuring alignment can be a promising and simple method to help audit robustness of proposed defences.

4 What Contributes to Good Alignment

While prior work has shown that robust models tend to have “perceptually aligned gradients” [25] and that robust training “induces a human prior on the learned representations” [1, 26], we find that robust training – while an important ingredient – is not the only factor that leads to well aligned representations. We find that alignment also depends on other factors such as model architecture, data augmentation, and learning paradigm.

4.1 Loss Function

Figure 5: [Role of Loss Function in Alignment; CIFAR10] We see that robustly trained models are more aligned, similar to past findings. However, we see that different architectures and training datasets can lead to varying levels of alignment. For example, VGG16 has very low levels of alignment despite being trained using robust training losses. Results on other datasets in Appendix C.

Prior work has suggested that alignment could be a general property of robustly trained models [25, 1]. We test the alignment of DNNs trained using various loss functions – standard Empirical Risk Minimization (ERM), adversarial training, and its variants (TRADES [27], MART [28]). Fig 5 shows that the alignment of nonrobust models (blue squares) is considerably worse than robust ones (triangles and circles). However, the effect is also influenced by the choice of model architecture, , in CIFAR10, for all training methods, robust VGG16 has significantly lower alignment than other models.

4.2 Data Augmentation

(a) CIFAR10
(b) CIFAR100

(c) Imagenet

Figure 6: [Role of Data Augmentation in Representation Alignment] We see that there are cases where data augmentation makes a significant contribution to better aligned representations (ResNet18 for CIFAR10, CIFAR100 and ImageNet and InceptionV3 for CIFAR10). Importantly, data augmentation never significantly hurts alignment. ImageNet models here are robustly trained (both with and without data augmentation) using [53] with a .

Hand-crafted data augmentations (such as rotation, blur etc.) are commonly used in deep learning pipelines. We wish to understand the role played by these augmentations in learning aligned representations. Intuitively, if robust training – which augments adversarial samples during training – generally leads to better aligned representations, then hand-crafted data augmentations should play some role too. Fig 6 shows how data augmentation can be crucial in learning aligned representations for some models (ResNet18 benefits greatly from data augmentation on all three datasets). While for all other experiments, we used robust ImageNet models provided by [48], for these experiments, we needed to re-train ImageNet models – which is a very time-consuming task. Thus, we used a faster approximation of adversarial training [53]. Since the implementation provided by the authors only works for the threat model, we used models trained using an of in Fig 5(c).

4.3 Learning Paradigm

(a) CIFAR10
(b) CIFAR100
Figure 7: [Role of Learning Paradigm] We compare supervised models to their self-supervised counterparts and find that typically for non-robust models, self-supervised methods yield much better alignment. However, when trained to be robust, supervised models typically align much better than self-supervised models.

Since both adversarial and hand-crafted data augmentations help in alignment, we now wish to see how alignment of self-supervised models – which explicitly rely on data augmentations – compares with the alignment of fully supervised models. SimCLR [31] is a popular self-supervised training method that learns ‘meaningful’ representations by using a contrastive loss. Recent works have built on SimCLR to also include adversarial data augmentations [54, 55]. We train both the standard version of SimCLR and the one with adversarial augmentation on CIFAR10 and CIFAR100 and compare their alignment with the supervised counterparts. More training details are included in Appendix C.

Fig 7 shows the results when comparing self-supervised and standard supervised learning. We indeed see that for standard learning (without adversarial data augmentation), self-supervised models have better alignment. We also see that for self-suprvised models, adversarial data augmentation improves alignment. However, with adversarial augmentation, supervised learning outperforms self-supervised models.

To summarize, various learning parameters in addition to robust training affect representation alignment, and the nature of these effects is quite nuanced. We leave a more comprehensive study of the effects of these training parameters on alignment for future work.

5 Conclusion and Broader Impacts

We proposed a measure of human perceptual alignment for the learned representations of any deep neural network and showed how it can be reliably measured at scale. This measure of alignment can be a useful model evaluation tool, as it can give insights beyond those offered by traditionally used metrics such a clean and robust accuracy. Through empirical evaluation, we showed that models with similar accuracies can have many controversial stimuli. In such cases, the model that has higher alignment is more likely to perceive the controversial stimuli correctly, thus making it more trustworthy. Additionally, we demonstrated how measuring alignment could be a simple yet promising way to evaluate adversarial robustness.Finally, we conducted a study of the various factors that lead to better aligned representations. We find that various properties of a model like its architecture, training loss, training paradigm, and data augmentation can all play significant roles in learning representations that are aligned with human perception.

Human perception is complex, nuanced and discontinuous [56] which poses many challenges in measuring alignment of DNNs with human perception [57]. In this work we take a small step towards defining and measuring alignment of DNNs with human perception. Our proposed method is a necessary but not sufficient condition for alignment with human perception and thus must be used carefully and be supplemented with other checks, including domain expertise. By presenting this method, we hope for a better understanding and auditing of DNNs.


AW acknowledges support from a Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute, and the Leverhulme Trust via CFI. This work was supported by Wellcome Trust Investigator Award WT106931MA and Royal Society Wolfson Fellowship 183029 to BCL. VN, AM, CK, and KPG were supported in part by an ERC Advanced Grant “Foundations for Fair Social Computing” (no. 789373). VN and JPD were supported in part by NSF CAREER Award IIS-1846237, NSF D-ISN Award #2039862, NSF Award CCF-1852352, NIH R01 Award NLM013039-01, NIST MSE Award #20126334, DARPA GARD #HR00112020007, DoD WHS Award #HQ003420F0035, ARPA-E Award #4334192 and a Google Faculty Research Award. All authors would like to thank Nina Grgić-Hlača for help with setting up AMT surveys.


Appendix A Measuring Human Alignment via Representation Inversion

a.1 Measuring Human Perception Similarity

We recruited AMT workers to complete surveys to measure alignment of learned representations. We recruited workers with completion rate and who spoke English. To further ensure that the workers understood the task, we added attention checks. For 2AFC task, this meant making the query image as the same image as one of the images in the option. In clustering setting, this meant making the image on the row same as one of the images in the columns. All the workers who took our survey passed the attention checks.

We estimated a completion time of about minutes for each survey and thus paid each worker . We allotted minutes per survey, so workers are not forced to rush through the survey. Most of the workers were able to complete the task in less than 20 minutes. Our study was approved by the Ethical Review Board of our institute.

a.2 Constructing the “Hard” ImageNet Clustering Task

When constructing the Clustering task for ImageNet, we noticed that choosing 3 images randomly often results in images that are visually very different (very different colors, as shown in Fig 1(c)). To ensure that our results did not depend on the 3 chosen images, we additionally constructed a “hard” version of the same task, where instead of randomly sampling 3 images, we choose 3 most similar samples from the ImageNet validation set. In order to obtain similar samples, we leveraged prior work that collected human similarity judgements on ImageNet [42]. Using the similarity matrix constructed by the authors, for each randomly sampled image, we choose 2 most similar images to it in the ImageNet validation set 444We thank the authors for sharing their annotations with us.. This gave us a harder version of this task where the 3 target images looked very similar, as shown in Fig 1(b).

a.3 Automating Human Surveys

Since conducting a survey for each model we wish to evaluate can be time consuming and expensive, we used LPIPS [41] to simulate a human. LPIPS has shown to be a good proxy measure of human perception. For 2AFC, this meant using LPIPS to measure the distance between the query image and the two images shown in the options and then matching the query image to the one with lesser LIPIS distance. And similarly in clustering, for each image on the row, we used LPIPS to measure its distance from each of the 3 images in the column and then matched it to the closest one. Our results (Table 1 in main paper) show that this can serve as a good proxy for human perception similarity. For all our experiments, we used the Alexnet model made available by the authors to measure LPIPS distance.

a.4 Role of Input Distribution

Fig 8 shows some examples of inputs sampled from different gaussians (the darker ones are sampled from and lighter ones from and darker ones from ). When starting from different seeds (we show two different types of seeds here, similar to all other experiments for in-distribution targets), we see how reconstructed images can be qualitatively different for different models. We posit that completing 2AFC and Clustering tasks on inputs that look like noise to humans is a qualitatively harder task than when doing this on in-distribution target samples.

Remarkably, we observe that humans are still able to bring out the differences between different models, even when given (a harder) task of matching re-constructed noisy inputs. Table 3 shows the results of surveys conducted with noisy target samples. It’s worth noting that the accuracy of humans drop quite a bit from in-distribution targets, thus indicating that this is indeed a harder task.

a.5 Models

CIFAR10/100 We used VGG16, ResNet18, Densenet121 and InceptionV3 for experiments on CIFAR10 and CIFAR100. The “robust” version of these datasets were trained using adversarial training [24], with an epsilon of , , these models were trained to be robust to perturbations less than a magnitude of 1.

ImageNet We used VGG16 (with batchnorm), ResNet18 and ResNet50 for ImageNet. The “robust” versions of these models were taken from [48], who trained these models using adversarial training with an epsilon of 3.


In-Distribution Noise Noise


98.0 52.632 52.632
VGG16 11.5 0.000 0.000
InceptionV3 87.5 52.631 52.631
Densenet121 100.0 52.632 52.632


In-Distribution Noise Noise



Table 3: [CIFAR10 and ImageNet In-Distr vs OOD Survey Results] We note that even on OOD samples that look like noise, humans can still bring out the relative differences between models, , densenet121 on CIFAR10 is still considered the best aligned model on targets sampled from both kinds of noise. Reduced accuracy of humans on noise shows that this is indeed a harder task than in-distribution target samples.

Appendix B Why Does Alignment Matter?

We show some more examples of controversial stimuli in Figs 910.

Figure 9: [Controversial Stimuli Examples] Example from CIFAR10 (top) and ImageNet (bottom) along with corresponding controversial stimuli generated using Eq 2.
Figure 10: [Controversial Stimuli Examples] Example from CIFAR10 (top) and ImageNet (bottom) along with corresponding controversial stimuli generated using Eq 2.

Appendix C What Contributes to Good Alignment

Fig 11 shows results for CIFAR100 and ImageNet for standard and robust training. Similar to previous works, we find that robust models are better aligned with human perception. Interestingly, we find that the variance between different architectures that we observed for CIFAR10 does not exist for CIFAR100 and ImageNet, , regardless of architecture, robustly trained models are well aligned with human perception.

(a) CIFAR100

(b) Imagenet

Figure 11: [Role of Loss Function in Representation Alignment] We see that robustly trained models are more aligned, similar to past findings.