Learning Perceptually-Aligned Representations via Adversarial Robustness

by   Logan Engstrom, et al.

Many applications of machine learning require models that are human-aligned, i.e., that make decisions based on human-meaningful information about the input. We identify the pervasive brittleness of deep networks' learned representations as a fundamental barrier to attaining this goal. We then re-cast robust optimization as a tool for enforcing human priors on the features learned by deep neural networks. The resulting robust feature representations turn out to be significantly more aligned with human perception. We leverage these representations to perform input interpolation, feature manipulation, and sensitivity mapping, without any post-processing or human intervention after model training. Our code and models for reproducing these results is available at https://git.io/robust-reps.



There are no comments yet.


page 6

page 15

page 17

page 18

page 21

page 22

page 23

page 24


Fast Training of Deep Neural Networks Robust to Adversarial Perturbations

Deep neural networks are capable of training fast and generalizing well ...

Leveraging Sparse Linear Layers for Debuggable Deep Networks

We show how fitting sparse linear models over learned deep feature repre...

Adversarially robust segmentation models learn perceptually-aligned gradients

The effects of adversarial training on semantic segmentation networks ha...

Exploring Alignment of Representations with Human Perception

We argue that a valuable perspective on when a model learns good represe...

Inverting Adversarially Robust Networks for Image Synthesis

Recent research in adversarially robust classifiers suggests their repre...

Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks

Recent work suggests that representations learned by adversarially robus...

Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features

Machine learning has demonstrated remarkable prediction accuracy over i....

Code Repositories


Code for "Learning Perceptually-Aligned Representations via Adversarial Robustness"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A major appeal of deep neural networks is their ability to attain remarkably high accuracy on a variety of tasks [krizhevsky2012imagenet, he2015delving, collobert2008unified]. However, while accurate decisions are certainly desirable, we often expect models to make these decisions for what we as humans deem to be the right reasons. Specifically, we want models that make predictions based on high-level features that are actually meaningful to us. We expect such “human-aligned” models to be more amenable to inspection and intervention, and also to act more predictably in unseen or unexpected environments.

Deep neural networks are often thought of as linear models acting on learned feature representations (also known as embeddings

)—thus, the features comprising these representations determine whether the models capture meaningful high-level concepts. Ideally, we could specify precisely which features a model should capture (e.g. eyes, fur, etc.), thus ensuring that it is indeed “human-aligned.” However, there are no known methods for controlling features in this manner (and in fact, one of the primary advantages of deep learning is that it removes the need for manual feature engineering 

[bengio2009learning, vincent2010stacked, zhou2015object, larsen2016autoencoding, bau2017network]).

While we may not be able to dictate exactly which features models should use, the representations of human-aligned models should, at a minimum, satisfy a few concrete “common-sense” properties:

  • Meaningful components. Model representations should be comprised of components that capture individual high-level features of the input data.

  • Reflectiveness of “data geometry.” Beyond having components that are human-meaningful, feature representations should capture the intrinsic geometry of the input data as humans perceive it—distance in representation space should correspond directly to a semantic notion of distance in input space.

Crucially, we should be able to verify the above properties in representations without compromising model-faithfulness, i.e., without introducing additional priors or information at interaction-time via post-processing or regularization. After all, without model-faithfulness, it is difficult to disentangle the effects of the introduced priors from the information captured by the representation itself.

Now, given their role in performance on standard benchmarks, do current learned feature representations achieve these goals?

On one hand, state-of-the-art embeddings are suitable for use in transfer learning girshick2014rich,donahue2014decaf or as perceptual distance proxies dosovitskiy2016generating,johnson2016perceptual,zhang2018unreasonable, which indicate that they do in fact encode some useful information about the input. On the other hand, the existence of adversarial examples szegedy2014intriguing—and the fact that they may correspond to flipping predictive features 

[ilyas2019adversarial]—suggests that we may not yet have attained human-aligned embeddings. (We show this directly in Section 2.)

Similarly, despite significant work towards extracting the aforementioned “common-sense” properties from standard networks, progress has thus far come at the cost of model-faithfulness. For example, while methods for saliency map generation often highlight meaningful features in the input (thus making progress towards the first property above), recent work [adebayo2018sanity] has shown that as a result, these heatmaps are not actually fully representative of models’ decision mechanisms (thus sacrificing model-faithfulness). The human-meaningfulness of the saliency maps thus cannot really be attributed to the model itself, due to human priors introduced during visualization.

Figure 1: Samples images from each property of robust representations demonstrated in this work. We show that robust representations allow for direct feature visualization (left) and image inversion (top), as well as applications such as feature manipulation (right), decision inspection (middle), and interpolations between arbitrary pairs of inputs (bottom)—all of these applications use only gradient descent on simple, unregularized, direct functions of the representations.
Our contributions.

In this work, we show that robust optimization can be viewed as a method for enforcing (user-specified) priors on the features that models should learn. We then find that the robust representations that result from training with simple priors already make significant progress towards our aforementioned goals for human-aligned representations. In fact, robust representations enable previously impossible modes of direct interaction with high-level input features (illustrated in Figure 1):

  • Simple feature visualization: Direct maximization of the coordinates of robust representations suffices to visualize human-meaningful features of the model.

  • Representation inversion: In contrast to standard representations, robust representations are approximately invertible—they provide a high-level embedding of the input on which images with similar robust representations are semantically similar.

  • Image-to-image interpolation: Straight paths between pairs of inputs in robust representation space correspond to interpolations between their features in image space. In fact, we can straightforwardly invert these paths to obtain natural-looking image interpolations.

  • Feature manipulation: Features can be added to images by directly optimizing over representation space (with first order methods).

  • Insight into specific model decisions: Without any regularization or attempts to enhance appearance, robust representations induce gradients that are more perceptually aligned than those of standard representations. This enables us to gain insight into model decisions via direct optimization over representation space.

Broadly, our results indicate that robust optimization is a promising avenue for better representation learning, and further highlight the importance of introducing human priors into the training of deep networks.

2 Limitations of Standard Representations

We follow the standard convention of defining the representation

induced by a deep network as the activations of the penultimate layer (i.e., a vector in

)—the prediction of a network on an input

can thus be viewed as the output of a linear classifier on the representation


As discussed in the previous section, state-of-the-art deep neural networks have matched or even surpassed human performance on a variety of tasks. Moreover, they already succeed in learning representations suitable for transfer to other tasks girshick2014rich,donahue2014decaf, and for use as a proxy for perceptual distance on natural images dosovitskiy2016generating,johnson2016perceptual,zhang2018unreasonable. These successes might lead us to believe that we have already obtained the aforementioned human-aligned embeddings.

We find, however, that this might not be the case: representations learned by state-of-the-art models are not entirely human-aligned. Indeed, it is straightforward to construct pairs of images with nearly identical representations yet drastically different content (Figure 2). Concretely, we find that on any two arbitrarily chosen inputs and from a standard image dataset, we can find an such that

differs imperceptibly from (1)
yet and have similar representations (2)

for some . Here, and are images containing different content but mapping to similar representations. The existence of such image pairs (and similar phenomena observed by prior work [jacobsen2019excessive]) lays bare the misalignment between the notion of distance currently learned by deep networks and the notion of distance perceived by humans.

Trouble with standard representations may extend beyond difficulty manipulating them or misalignment with human perception—Ilyas et al. [ilyas2019adversarial] suggest that standard models make use of predictive, generalizing features in the data distribution that can be drastically affected by imperceptible perturbations. Therefore, the fact that models capture these features is in direct opposition to our goal of having representations comprised of only human-meaningful concepts. It also indicates that post-hoc interpretation methods may actually suppress useful features that the model uses to make its decisions.

Figure 2: Illustration of the shortcomings discussed in Section 2. We can easily find images (via optimization) that appear completely different yet map to similar representations.

3 Learning Better Representations

In the last section we show that standard embeddings do not satisfy the requirements we set for representations in Section 1. How can we bring models closer to realizing these requirements?

A core issue here is models’ sensitivity to human-meaningless changes in input space. This sensitivity explains the need to resort to “model-unfaithful” methods—after all, without these methods there will exist a meaningless way to perturb the input and get the desired result. It also explains the lack of alignment with human perceptual geometry. Thus, having human-aligned representations requires training models that are a priori invariant to these meaningless changes in input.

In this paper, we propose the use of the robust optimization framework to enforce human priors on the features that can be learned by models. Instead of just training for maximum accuracy, robust optimization also requires that models be invariant to pre-specified classes of perturbations :


While there is no known way for to capture all possible human-meaningless perturbations, a natural point of start is to enforce simple priors via perturbations that humans are clearly invariant to. In particular, for the remainder of this work, we consider models trained with one of the simplest priors on human vision: invariance to small changes in input space. Given robust models trained to solve (3) with this perturbation set (a small ball), we now aim to determine whether their robust representations are actually more aligned with human perception.

4 Are Robust Representations More Perceptually Aligned?

In the previous section, we proposed robust optimization as a way to enforce human priors during model training, with the goal of mitigating the issues with standard representations identified in Section 2. In this section, we revisit the properties of human-aligned embeddings outlined in Section 1 and test whether these properties manifest themselves in robust representations (the feature representations of robust models, trained with (3)).

4.1 Robust representations are composed of meaningful components

We first show that robust representations are in fact composed of human-meaningful features. In particular, the coordinates of robust representations correspond to human-relevant features in the input data, and these human-relevant features emerge when the coordinate is maximized directly. Concretely, given a component of the representation vector, we use gradient descent to find an input that maximally activates it, i.e., we solve:


for various starting points (see Appendix A.4 for details). We find that individual coordinates of robust representations indeed capture human meaningful concepts (Figure 3). Moreover, these coordinates consistently represent the same concepts across different starting inputs ( sampled randomly from input space or from the data distribution). These concepts also appear in the natural inputs (from the test set) that most strongly activate their corresponding coordinates (Figure 4).

Figure 3:

Correspondence between image-level patterns and activations learned by a robust model on the restricted ImageNet dataset. Starting from randomly chosen seed inputs (noise/images), we use PGD to find inputs that (locally) maximally activate a given component of the representation vector (cf. Appendix 

A.4 for details). In the left column we have the original inputs (selected randomly), and in subsequent columns we visualize the result of the optimization (4) for different activations, with each row starting from the same (far left) input. Additional visualizations in Appendix B.1.
Figure 4: Maximizing inputs (found by solving (4) with being a gray image) and most activating images (from the test set) for two random activations of a robust model trained on the restricted ImageNet dataset. In addition to finding maximizing inputs (as in the bottom row of Figure 3), we plot the top three and bottom three images (labeled “most activated” and “least activated” respectively) from the validation set, sorted by the magnitude of the selected activation.

4.2 Robust representations induce a human-aligned geometry

In Section 2, we demonstrated a fundamental shortcoming of representations learned by standard classifiers: for any input , we could find another image that looked entirely different but had nearly the same representation. This indicated a fundamental misalignment between the geometry of standard representations and our perception of the input data.

We now show that this alignment is greatly improved for robust representations. In particular, finding representations which are close together in embedding space necessitates finding two images which are semantically similar. To demonstrate this, we use projected gradient descent to construct inputs (from different starting points) that approximately minimize distance in representation space to a pre-selected image (details in Appendix A.5), which corresponds to solving:


for a target image and a starting image .

In stark contrast to what we observe for standard models, we find that the resulting images are actually semantically similar to the image whose representation is being matched (Figure 5).

Figure 5: Visualization of inputs that are mapped to similar representations by a robust model. Original: an example image from the ImageNet test set; Seed: randomly selected starting point for the optimization process; Result: images obtained by optimizing inputs (starting from the corresponding seed) to minimize -distance to the representation of the original image.
Figure 6: Robust representations yield semantically meaningful embeddings. Original: random images from the test set (col. 1-5) and from outside of the training distribution (6-10); Result: images obtained from optimizing inputs (starting from Gaussian noise) to minimize -distance to the representations of the corresponding image in the top row. (More examples appear in Appendix B.2.)

In fact, this “meaningful inversion” property holds true even for out-of-distribution inputs, demonstrating that robust representations capture general high-level features. In particular, we repeat the previous experiment using images from classes not present in the original dataset (Figure 6 right) and structured random patterns (Figure 1415) of Appendix B.2): the reconstructed images consistently resemble the originals. This indicates that (a) the geometry of the representation space of robust models aligns well enough with human perception to prevent matching representations without also matching the high-level features of the image, and (b) we can actually recover a surprising amount of information from robust representations about the original input, despite having much lower dimensionality than the input space and not training for reconstruction explicitly.

Relation to other inversion methods.

Typical methods for inverting deep representations typically either solve an optimization problem similar to (5) while imposing a “natural image” prior on the input [mahendran2015understanding, yosinski2015understanding, ulyanov2017deep] or train a separate network to perform the inversion [kingma2013autoencoding, dosovitskiy2016inverting, dosovitskiy2016generating]. As a result, these methods are not fully faithful to the model as they introduce additional priors into the process. While it is possible to construct models that are revertible by construction [dinh2014nice, dinh2017density, jacobsen2018irevnet, behrmann2018invertible], the representations learned are not necessarily robust and thus not human-aligned [jacobsen2018irevnet].

5 Additional Benefits of Robust Representations

In the previous sections, we used -robust training to achieve representations that are composed of meaningful features, have a human-aligned geometry, and admit model-faithful interaction methods.

Here, we demonstrate several additional benefits enabled by robust embeddings and their increased human-alignment. We find that such embeddings provide us with natural methods for input interpolations (Section 5.1), feature manipulation (Section 5.2), and sensitivity evaluation (Section 5.3).

5.1 Meaningful latent space interpolations

We now leverage robust representations to produce natural interpolations between any two inputs. That is, given two images and , we find the -interpolate between them as


where, for a given , we find by solving (6) with projected gradient descent. Intuitively, this corresponds to linearly interpolating between the points in representation space and then finding a point in image space that has a similar representation. To construct a length- interpolation, we choose . The resulting interpolations, shown in Figure 7, demonstrate that the -interpolates of robust representations correspond to a meaningful feature interpolation between images. (For standard models constructing meaningful interpolations is impossible due to the brittleness identified in Section 2—see Appendix B.2.3 for details.)

Relation to other interpolation methods.

We emphasize that linearly interpolating in robust representation space works for any two images. This generality is in contrast to interpolations induced by GANs (e.g. radford2016unsupervised,brock2019large), which can only interpolate between images generated by the generator. (Reconstructions of out-of-range images tend to be decipherable but rather different from the originals [bau2019inverting].) It is worth noting that even for models with analytically invertible representations, interpolating in representation space does not yield semantic interpolations [jacobsen2018irevnet].

Figure 7: Image interpolation using robust representations compared to their image-space counterparts. The former appear perceptually plausible while the latter exhibit ghosting artifacts. For pairs of images from the Restricted ImageNet test set, we solve (6) for varying between zero and one, i.e., we match linear interpolates in representation space. Additional interpolations appear in AppendixB.3.1 Figure 17. We demonstrate the ineffectiveness of interpolation with standard representations in Appendix B.3.2 Figure 18.

5.2 Robust representations allow for feature manipulation

Robust representations also yield a natural way to individually manipulate high-level input features. We know (from Section 4.1) that when optimizing over image space to maximize a specific coordinate of robust representations, a consistent human-meaningful feature dominates the image. It turns out that our control over high-level features can be straightforwardly applied to manipulation of images via feature addition.

Specifically, by solving the following clipped maximization objective, we can introduce individual high-level features while preserving images’ original content:


Intuitively, we maximize the desired feature (the th coordinate) until it is the highest-magnitude feature in the robust representation. We visualize the result of this process for a variety of input images and activation coordinates in Figure 8, where stripes and red limbs are introduced seamlessly into images without any processing or regularization 333We repeat this process with many additional random images and random features in Appendix B.4.1..

Further, the desired feature appears in the optimization iterates of (7) gradually (cf. Appendix B.4.2), which enables us to control the prominence of added features.

Figure 8: Visualization of the results from maximizing a chosen (left) and a random (right) representation coordinate starting from random images. In each figure, the top row has the initial images, and the bottom row has a feature added.

Related work on semantic feature manipulation. The latent space of generative adversarial networks (GANs) [goodfellow2014generative] tends to allow for “semantic feature arithmetic” [radford2016unsupervised, larsen2016autoencoding] similar to that in word2vec embeddings [mikolov2013distributed]

. In a similar vein, one can utilize an image-to-image translation framework to perform such manipulation (e.g. transforming horses to zebras), although this requires a task-specific dataset and model 

[zhu2017unpaired]. Somewhat orthogonally, it is possible to utilize the deep representations of standard models to perform semantic feature manipulations; however such methods tend to either only perform well on datasets where the inputs are explicitly aligned [upchurch2017deep] or are restricted to a small set of manipulations [gatys2016image].

5.3 Robust representations provide insights into model behavior

The inner mechanisms of standard deep neural networks remain opaque to human observers. In response, significant work has been dedicated to developing interpretability techniques that attempt to attribute predictions of deep models to specific parts of the input in a human-meaningful manner [smilkov2017smoothgrad, sundararajan2017axiomatic, olah2017feature, olah2018building]. The development of these tools has uncovered a fundamental tension between truly accurate explanations and human-meaningful ones. Even state-of-the-art methods necessarily discard input information that models are extremely sensitive to. For instance, models are clearly sensitive to adversarial perturbations, but all successful interpretability techniques suppress these perturbations from visualizations due to their lack of human-meaningfulness.

Specifically, these techniques enforce human priors at explanation time—often through complex methodology olah2018building. In contrast, we show that robust representations yield explanations that are model-faithful yet visually meaningful. The methods we introduce are straightforward applications of model gradients, and do not enforce any external human priors at explanation time.

Component-level sensitivity analysis.

Perhaps the simplest interpretability method is to visualize the gradient of the model’s loss with respect to the input. The gradient conveys the sensitivity of the model to perturbations in each individual input component. However, since gradients of standard models are often uninterpretable (Appendix B.5), saliency maps are typically constructed via post-processing [sundararajan2017axiomatic, smilkov2017smoothgrad] or through a learned model [fong2017interpretable, dabkowski2017real]. The introduction of additional processing in these methods leads to explanations that are not faithful to the model, despite being visually appealing. In fact, explanations can often be model and data independent [nie2018theoretical, adebayo2018sanity] hence being inherently unable to explain the predictions of the model. Additionally, saliency maps of standard models can be very brittle to small perturbations of the input [kindermans2017reliability, ghorbani2019interpretation].

In contrast, Tsipras et al. tsipras2019robustness demonstrate that for robust models, unmodified gradient-based saliency maps are more human-meaningful (cf. Appendix B.5).

Feature-level sensitivity analysis.

In addition to the component-level analysis performed by Tsipras et al. tsipras2019robustness, we show that robust representations can also be leveraged for a feature-level understanding of model sensitivity.

To accomplish this, given any input image, we start by computing the largest-magnitude coordinates of the corresponding representation, weighted by the linear classifier in the last layer (mapping from representations to logits). More precisely, for an input

with predicted class we find:


where is the linear layer mapping from representations to logits. We then magnify these coordinates in the input by performing the maximization described in (8) (Section 4.1). The resulting images are similar to the originals, but with the highest-weight features accentuated. In addition to providing insight about correct classifications (Figure 9 right), these images can also suggest why certain images are misclassified (Figure 9 left). Performing a similar analysis for standard model requires incorporating non-trivial human priors in order to make the result human-meaningful (and hence losing model-faithfulness) [simonyan2013deep, yosinski2015understanding, olah2017feature, olah2018building].

For example, in Figure 9, accentuating the highest-weight features reveals a natural transformation from the ear of a monkey to the eye of a dog (top), from negative space to the face of a dog (middle), and from the heads of two different fish to the two eyes of a single frog, with a reed transforming into a mouth (bottom). These transformations hint at the most sensitive directions of robust models on misclassified inputs, thus providing insight into model decisions.

Figure 9: Accentuating the highest-weight features for correctly classified (right) and incorrectly (left) classified images. The highest-weight features are determined based on (8), and accentuated via the “feature addition” mechanism described in Section 5.2 (Equation (7)). The resulting images appear to show plausible ways in which the images could have been classified/misclassified.

6 Conclusion

We show that robustly trained models admit much more human-aligned input representations than those induced by standard models. We start by highlighting the brittleness of standard representations, and identify robust training as a potential solution—robustly trained models can be viewed as inducing a prior favoring features that are more human meaningful. We then show that for models with robust representations, large changes in representation space correspond to large changes in input space, and individual coordinates of representations map to human discernible concepts.

We finally introduce a number of additional benefits provided by robust representations (and not by standard ones). We can use robust representations to find natural interpolations between images, manipulate human-meaningful features in images, and better understand model behavior. Importantly, we enjoy these benefits without having to post-process the input or employ any complex regularization techniques. Our results open up a new perspective on building interpretable models, and in turn suggests new approaches to interpretability, image manipulation, and representation learning.

Appendix A Experimental Setup

a.1 Datasets

For our experimental analysis, we use the (restricted) ImageNet  [russakovsky2015imagenet] datasets. Attaining robust models for the complete ImageNet dataset is known to be a challenging problem, both due to the hardness of the learning problem itself, as well as the computational complexity. We thus restrict our focus to a subset of the dataset which we denote as restricted ImageNet. To this end, we group together semantically similar classes from ImageNet into 9 super-classes shown in Table 1. We train and evaluate only on examples corresponding to these classes.

Class Corresponding ImageNet Classes
“Dog” 151 to 268
“Cat” 281 to 285
“Frog” 30 to 32
“Turtle” 33 to 37
“Bird” 80 to 100
“Primate” 365 to 382
“Fish” 389 to 397
“Crab” 118 to 121
“Insect” 300 to 319
Table 1: Classes used in the Restricted ImageNet model. The class ranges are inclusive.

a.2 Models

We use the standard ResNet-50 architecture [he2016deep] for our adversarially trained classifiers on all datasets. Every model is trained with data augmentation, momentum of and weight decay of

. Other hyperparameters are provided in Tables 

2 and 3.

Dataset Epochs LR Batch Size LR Schedule
restricted ImageNet 110 0.1 128 Drop by 10 at epochs
Table 2: Standard hyperparameters for the models trained in the main paper.

a.3 Adversarial training

To obtain robust classifiers, we employ the adversarial training methodology proposed in [madry2018towards]. Specifically, we train against a projected gradient descent (PGD) adversary, starting from a random initial perturbation of the training data. We consider adversarial perturbations in norm. Unless otherwise specified, we use the values of provided in Table 3 to train/evaluate our models.

Dataset # steps Step size
restricted ImageNet 3.5 7 0.1
Table 3: Hyperparameters used for adversarial training.

a.4 Finding representation-feature correspondence

Dataset # steps Step size
restricted ImageNet 1000 200 1

a.5 Inverting representations and Interpolations

Dataset # steps Step size
restricted ImageNet 1000 10000 1

a.6 Insights into model behavior

Dataset # steps Step size
restricted ImageNet 40 60 1

Appendix B Omitted Figures

b.1 Do learned representations capture meaningful features?

b.1.1 Features learned by a random subset of robust representations

Figure 10: Correspondence between image-level features and representations learned by a robust model on the restricted ImageNet dataset. Starting from randomly chosen seed inputs (noise/images), we use a constrained optimization process to identify input features that maximally activate a given component of the representation vector (cf. Appendix A.4 for details). Specifically, (left column): inputs to the optimization process, and (subsequent columns): features that activate randomly chosen representation components, along with the predicted class of the feature.

b.1.2 Features learned by select robust representations

Figure 11: Correspondence between image-level features and representations learned by a robust model on the restricted ImageNet dataset. Starting from randomly chosen seed inputs (noise/images), we use a constrained optimization process to identify input features that maximally activate a given component of the representation vector (cf. Appendix A.4 for details). Specifically, (left column): inputs to the optimization process, and (subsequent columns): features that activate select representation components, along with the predicted class of the feature.

b.1.3 Standard representations are misaligned with meaningful features

Figure 12: Correspondence between image-level features and representations learned by a standard model on the restricted ImageNet dataset. Starting from randomly chosen seed inputs (noise/images), we use a constrained optimization process to identify input features that maximally activate a given component of the representation vector (cf. Appendix A.4 for details). Specifically, (left column): inputs to the optimization process, and (subsequent columns): features that activate randomly chosen representation components, along with the predicted class of the feature. Note that in comparison to the robust model (cf. Figure 11), the input features that correspond to specific representations are not perceptually meaningful to humans, and in fact barely look different from the seed used during optimization (left column).

b.2 Invertibility of robust representations

b.2.1 Reconstruction of test set images

Figure 13: Robust representations yield semantically meaningful inverses: Original: randomly chosen test set images from the restricted ImageNet dataset; Inverse: images obtained by inverting the representation of the corresponding image in the top row by solving the optimization problem (5) starting from: (a) different test images and (b) Gaussian noise.
Figure 14: Robust representations yield semantically meaningful inverses: Original: randomly chosen test set images from the CIFAR-10 dataset; Inverse: images obtained by inverting the representation of the corresponding image in the top row by solving the optimization problem (5) starting from: (a) different test images and (b) Gaussian noise. Note that this model was trained on images from the restricted ImageNet dataset with many classes (e.g. truck) missing from our restricted version.

b.2.2 Reconstruction of out-of-distribution inputs

(a) Random kaleidoscope patterns.
(b) Samples from other ImageNet classes outside what the model is trained on.
Figure 15: Robust representations yield semantically meaningful inverses: (Original): randomly chosen out-of-distribution inputs; (Inverse): images obtained by inverting the representation of the corresponding image in the top row by solving the optimization problem (5) starting from Gaussian noise.

b.2.3 Invertibility of standard representations

Figure 16: Standard representations do not yield semantically meaningful inverses: (Original): randomly chosen test set images from the restricted ImageNet dataset; (Inverse): images obtained by inverting the representation of the corresponding image in the top row by solving the optimization problem (5) starting from Gaussian noise.

b.3 Image interpolations

b.3.1 Robust models yield semantically meaningful interpolations

Figure 17: Additional image interpolation using robust representations. To find the interpolation in input space, we construct images that map to linear interpolations of the endpoints in robust representation space. Concretely, for randomly selected pairs from the restricted ImageNet test set, we use (5) to find images that match to the linear interpolates in representation space (6).

b.3.2 Standard models yield superimposed-like interpolations

Figure 18: Image interpolation using standard representations. To find the interpolation in input space, we construct images that map to linear interpolations of the endpoints in standard representation space. Concretely, for randomly selected pairs from the restricted ImageNet test set, we use (5) to find images that match to the linear interpolates in representation space (6). Image space interpolations from the standard model appear to be significantly less meaningful than their robust counterparts. They are visibly similar to linear interpolation directly in the input space, which is in fact used to seed the optimization process.

b.4 Exploration via adding features

b.4.1 Additional examples of feature addition

Figure 19:

Visualization of the results adding various neurons, labelled on the left, to randomly chosen test images. The rows alternate between the original test images, and those same images with an additional feature arising from maximizing the corresponding neuron.

b.4.2 Progressive addition of a neuron feature

Figure 20: Visualization of the results from progressively adding a neuron feature. The top row consists of the original images, and each subsequent row is the result after iterations of gradient descent.

b.5 Gradients of robust vs standard models (reproduced from tsipras2019robustness)

Figure 21: Figure from tsipras2019robustness demonstrating the increased gradient meaningfulness in robust models.