Code for "Learning Perceptually-Aligned Representations via Adversarial Robustness"
Many applications of machine learning require models that are human-aligned, i.e., that make decisions based on human-meaningful information about the input. We identify the pervasive brittleness of deep networks' learned representations as a fundamental barrier to attaining this goal. We then re-cast robust optimization as a tool for enforcing human priors on the features learned by deep neural networks. The resulting robust feature representations turn out to be significantly more aligned with human perception. We leverage these representations to perform input interpolation, feature manipulation, and sensitivity mapping, without any post-processing or human intervention after model training. Our code and models for reproducing these results is available at https://git.io/robust-reps.READ FULL TEXT VIEW PDF
Code for "Learning Perceptually-Aligned Representations via Adversarial Robustness"
A major appeal of deep neural networks is their ability to attain remarkably high accuracy on a variety of tasks [krizhevsky2012imagenet, he2015delving, collobert2008unified]. However, while accurate decisions are certainly desirable, we often expect models to make these decisions for what we as humans deem to be the right reasons. Specifically, we want models that make predictions based on high-level features that are actually meaningful to us. We expect such “human-aligned” models to be more amenable to inspection and intervention, and also to act more predictably in unseen or unexpected environments.
Deep neural networks are often thought of as linear models acting on learned feature representations (also known as embeddings
)—thus, the features comprising these representations determine whether the models capture meaningful high-level concepts. Ideally, we could specify precisely which features a model should capture (e.g. eyes, fur, etc.), thus ensuring that it is indeed “human-aligned.” However, there are no known methods for controlling features in this manner (and in fact, one of the primary advantages of deep learning is that it removes the need for manual feature engineering[bengio2009learning, vincent2010stacked, zhou2015object, larsen2016autoencoding, bau2017network]).
While we may not be able to dictate exactly which features models should use, the representations of human-aligned models should, at a minimum, satisfy a few concrete “common-sense” properties:
Meaningful components. Model representations should be comprised of components that capture individual high-level features of the input data.
Reflectiveness of “data geometry.” Beyond having components that are human-meaningful, feature representations should capture the intrinsic geometry of the input data as humans perceive it—distance in representation space should correspond directly to a semantic notion of distance in input space.
Crucially, we should be able to verify the above properties in representations without compromising model-faithfulness, i.e., without introducing additional priors or information at interaction-time via post-processing or regularization. After all, without model-faithfulness, it is difficult to disentangle the effects of the introduced priors from the information captured by the representation itself.
Now, given their role in performance on standard benchmarks, do current learned feature representations achieve these goals?
On one hand, state-of-the-art embeddings are suitable for use in transfer learning girshick2014rich,donahue2014decaf or as perceptual distance proxies dosovitskiy2016generating,johnson2016perceptual,zhang2018unreasonable, which indicate that they do in fact encode some useful information about the input. On the other hand, the existence of adversarial examples szegedy2014intriguing—and the fact that they may correspond to flipping predictive features[ilyas2019adversarial]—suggests that we may not yet have attained human-aligned embeddings. (We show this directly in Section 2.)
Similarly, despite significant work towards extracting the aforementioned “common-sense” properties from standard networks, progress has thus far come at the cost of model-faithfulness. For example, while methods for saliency map generation often highlight meaningful features in the input (thus making progress towards the first property above), recent work [adebayo2018sanity] has shown that as a result, these heatmaps are not actually fully representative of models’ decision mechanisms (thus sacrificing model-faithfulness). The human-meaningfulness of the saliency maps thus cannot really be attributed to the model itself, due to human priors introduced during visualization.
In this work, we show that robust optimization can be viewed as a method for enforcing (user-specified) priors on the features that models should learn. We then find that the robust representations that result from training with simple priors already make significant progress towards our aforementioned goals for human-aligned representations. In fact, robust representations enable previously impossible modes of direct interaction with high-level input features (illustrated in Figure 1):
Simple feature visualization: Direct maximization of the coordinates of robust representations suffices to visualize human-meaningful features of the model.
Representation inversion: In contrast to standard representations, robust representations are approximately invertible—they provide a high-level embedding of the input on which images with similar robust representations are semantically similar.
Image-to-image interpolation: Straight paths between pairs of inputs in robust representation space correspond to interpolations between their features in image space. In fact, we can straightforwardly invert these paths to obtain natural-looking image interpolations.
Feature manipulation: Features can be added to images by directly optimizing over representation space (with first order methods).
Insight into specific model decisions: Without any regularization or attempts to enhance appearance, robust representations induce gradients that are more perceptually aligned than those of standard representations. This enables us to gain insight into model decisions via direct optimization over representation space.
Broadly, our results indicate that robust optimization is a promising avenue for better representation learning, and further highlight the importance of introducing human priors into the training of deep networks.
We follow the standard convention of defining the representation
induced by a deep network as the activations of the penultimate layer (i.e., a vector in)—the prediction of a network on an input
can thus be viewed as the output of a linear classifier on the representation.
As discussed in the previous section, state-of-the-art deep neural networks have matched or even surpassed human performance on a variety of tasks. Moreover, they already succeed in learning representations suitable for transfer to other tasks girshick2014rich,donahue2014decaf, and for use as a proxy for perceptual distance on natural images dosovitskiy2016generating,johnson2016perceptual,zhang2018unreasonable. These successes might lead us to believe that we have already obtained the aforementioned human-aligned embeddings.
We find, however, that this might not be the case: representations learned by state-of-the-art models are not entirely human-aligned. Indeed, it is straightforward to construct pairs of images with nearly identical representations yet drastically different content (Figure 2). Concretely, we find that on any two arbitrarily chosen inputs and from a standard image dataset, we can find an such that
|differs imperceptibly from||(1)|
|yet and have similar representations||(2)|
for some . Here, and are images containing different content but mapping to similar representations. The existence of such image pairs (and similar phenomena observed by prior work [jacobsen2019excessive]) lays bare the misalignment between the notion of distance currently learned by deep networks and the notion of distance perceived by humans.
Trouble with standard representations may extend beyond difficulty manipulating them or misalignment with human perception—Ilyas et al. [ilyas2019adversarial] suggest that standard models make use of predictive, generalizing features in the data distribution that can be drastically affected by imperceptible perturbations. Therefore, the fact that models capture these features is in direct opposition to our goal of having representations comprised of only human-meaningful concepts. It also indicates that post-hoc interpretation methods may actually suppress useful features that the model uses to make its decisions.
In the last section we show that standard embeddings do not satisfy the requirements we set for representations in Section 1. How can we bring models closer to realizing these requirements?
A core issue here is models’ sensitivity to human-meaningless changes in input space. This sensitivity explains the need to resort to “model-unfaithful” methods—after all, without these methods there will exist a meaningless way to perturb the input and get the desired result. It also explains the lack of alignment with human perceptual geometry. Thus, having human-aligned representations requires training models that are a priori invariant to these meaningless changes in input.
In this paper, we propose the use of the robust optimization framework to enforce human priors on the features that can be learned by models. Instead of just training for maximum accuracy, robust optimization also requires that models be invariant to pre-specified classes of perturbations :
While there is no known way for to capture all possible human-meaningless perturbations, a natural point of start is to enforce simple priors via perturbations that humans are clearly invariant to. In particular, for the remainder of this work, we consider models trained with one of the simplest priors on human vision: invariance to small changes in input space. Given robust models trained to solve (3) with this perturbation set (a small ball), we now aim to determine whether their robust representations are actually more aligned with human perception.
In the previous section, we proposed robust optimization as a way to enforce human priors during model training, with the goal of mitigating the issues with standard representations identified in Section 2. In this section, we revisit the properties of human-aligned embeddings outlined in Section 1 and test whether these properties manifest themselves in robust representations (the feature representations of robust models, trained with (3)).
We first show that robust representations are in fact composed of human-meaningful features. In particular, the coordinates of robust representations correspond to human-relevant features in the input data, and these human-relevant features emerge when the coordinate is maximized directly. Concretely, given a component of the representation vector, we use gradient descent to find an input that maximally activates it, i.e., we solve:
for various starting points (see Appendix A.4 for details). We find that individual coordinates of robust representations indeed capture human meaningful concepts (Figure 3). Moreover, these coordinates consistently represent the same concepts across different starting inputs ( sampled randomly from input space or from the data distribution). These concepts also appear in the natural inputs (from the test set) that most strongly activate their corresponding coordinates (Figure 4).
In Section 2, we demonstrated a fundamental shortcoming of representations learned by standard classifiers: for any input , we could find another image that looked entirely different but had nearly the same representation. This indicated a fundamental misalignment between the geometry of standard representations and our perception of the input data.
We now show that this alignment is greatly improved for robust representations. In particular, finding representations which are close together in embedding space necessitates finding two images which are semantically similar. To demonstrate this, we use projected gradient descent to construct inputs (from different starting points) that approximately minimize distance in representation space to a pre-selected image (details in Appendix A.5), which corresponds to solving:
for a target image and a starting image .
In stark contrast to what we observe for standard models, we find that the resulting images are actually semantically similar to the image whose representation is being matched (Figure 5).
In fact, this “meaningful inversion” property holds true even for out-of-distribution inputs, demonstrating that robust representations capture general high-level features. In particular, we repeat the previous experiment using images from classes not present in the original dataset (Figure 6 right) and structured random patterns (Figure 14, 15) of Appendix B.2): the reconstructed images consistently resemble the originals. This indicates that (a) the geometry of the representation space of robust models aligns well enough with human perception to prevent matching representations without also matching the high-level features of the image, and (b) we can actually recover a surprising amount of information from robust representations about the original input, despite having much lower dimensionality than the input space and not training for reconstruction explicitly.
Typical methods for inverting deep representations typically either solve an optimization problem similar to (5) while imposing a “natural image” prior on the input [mahendran2015understanding, yosinski2015understanding, ulyanov2017deep] or train a separate network to perform the inversion [kingma2013autoencoding, dosovitskiy2016inverting, dosovitskiy2016generating]. As a result, these methods are not fully faithful to the model as they introduce additional priors into the process. While it is possible to construct models that are revertible by construction [dinh2014nice, dinh2017density, jacobsen2018irevnet, behrmann2018invertible], the representations learned are not necessarily robust and thus not human-aligned [jacobsen2018irevnet].
In the previous sections, we used -robust training to achieve representations that are composed of meaningful features, have a human-aligned geometry, and admit model-faithful interaction methods.
Here, we demonstrate several additional benefits enabled by robust embeddings and their increased human-alignment. We find that such embeddings provide us with natural methods for input interpolations (Section 5.1), feature manipulation (Section 5.2), and sensitivity evaluation (Section 5.3).
We now leverage robust representations to produce natural interpolations between any two inputs. That is, given two images and , we find the -interpolate between them as
where, for a given , we find by solving (6) with projected gradient descent. Intuitively, this corresponds to linearly interpolating between the points in representation space and then finding a point in image space that has a similar representation. To construct a length- interpolation, we choose . The resulting interpolations, shown in Figure 7, demonstrate that the -interpolates of robust representations correspond to a meaningful feature interpolation between images. (For standard models constructing meaningful interpolations is impossible due to the brittleness identified in Section 2—see Appendix B.2.3 for details.)
We emphasize that linearly interpolating in robust representation space works for any two images. This generality is in contrast to interpolations induced by GANs (e.g. radford2016unsupervised,brock2019large), which can only interpolate between images generated by the generator. (Reconstructions of out-of-range images tend to be decipherable but rather different from the originals [bau2019inverting].) It is worth noting that even for models with analytically invertible representations, interpolating in representation space does not yield semantic interpolations [jacobsen2018irevnet].
Robust representations also yield a natural way to individually manipulate high-level input features. We know (from Section 4.1) that when optimizing over image space to maximize a specific coordinate of robust representations, a consistent human-meaningful feature dominates the image. It turns out that our control over high-level features can be straightforwardly applied to manipulation of images via feature addition.
Specifically, by solving the following clipped maximization objective, we can introduce individual high-level features while preserving images’ original content:
Intuitively, we maximize the desired feature (the th coordinate) until it is the highest-magnitude feature in the robust representation. We visualize the result of this process for a variety of input images and activation coordinates in Figure 8, where stripes and red limbs are introduced seamlessly into images without any processing or regularization 333We repeat this process with many additional random images and random features in Appendix B.4.1..
Related work on semantic feature manipulation. The latent space of generative adversarial networks (GANs) [goodfellow2014generative] tends to allow for “semantic feature arithmetic” [radford2016unsupervised, larsen2016autoencoding] similar to that in word2vec embeddings [mikolov2013distributed]
. In a similar vein, one can utilize an image-to-image translation framework to perform such manipulation (e.g. transforming horses to zebras), although this requires a task-specific dataset and model[zhu2017unpaired]. Somewhat orthogonally, it is possible to utilize the deep representations of standard models to perform semantic feature manipulations; however such methods tend to either only perform well on datasets where the inputs are explicitly aligned [upchurch2017deep] or are restricted to a small set of manipulations [gatys2016image].
The inner mechanisms of standard deep neural networks remain opaque to human observers. In response, significant work has been dedicated to developing interpretability techniques that attempt to attribute predictions of deep models to specific parts of the input in a human-meaningful manner [smilkov2017smoothgrad, sundararajan2017axiomatic, olah2017feature, olah2018building]. The development of these tools has uncovered a fundamental tension between truly accurate explanations and human-meaningful ones. Even state-of-the-art methods necessarily discard input information that models are extremely sensitive to. For instance, models are clearly sensitive to adversarial perturbations, but all successful interpretability techniques suppress these perturbations from visualizations due to their lack of human-meaningfulness.
Specifically, these techniques enforce human priors at explanation time—often through complex methodology olah2018building. In contrast, we show that robust representations yield explanations that are model-faithful yet visually meaningful. The methods we introduce are straightforward applications of model gradients, and do not enforce any external human priors at explanation time.
Perhaps the simplest interpretability method is to visualize the gradient of the model’s loss with respect to the input. The gradient conveys the sensitivity of the model to perturbations in each individual input component. However, since gradients of standard models are often uninterpretable (Appendix B.5), saliency maps are typically constructed via post-processing [sundararajan2017axiomatic, smilkov2017smoothgrad] or through a learned model [fong2017interpretable, dabkowski2017real]. The introduction of additional processing in these methods leads to explanations that are not faithful to the model, despite being visually appealing. In fact, explanations can often be model and data independent [nie2018theoretical, adebayo2018sanity] hence being inherently unable to explain the predictions of the model. Additionally, saliency maps of standard models can be very brittle to small perturbations of the input [kindermans2017reliability, ghorbani2019interpretation].
In contrast, Tsipras et al. tsipras2019robustness demonstrate that for robust models, unmodified gradient-based saliency maps are more human-meaningful (cf. Appendix B.5).
In addition to the component-level analysis performed by Tsipras et al. tsipras2019robustness, we show that robust representations can also be leveraged for a feature-level understanding of model sensitivity.
To accomplish this, given any input image, we start by computing the largest-magnitude coordinates of the corresponding representation, weighted by the linear classifier in the last layer (mapping from representations to logits). More precisely, for an inputwith predicted class we find:
where is the linear layer mapping from representations to logits. We then magnify these coordinates in the input by performing the maximization described in (8) (Section 4.1). The resulting images are similar to the originals, but with the highest-weight features accentuated. In addition to providing insight about correct classifications (Figure 9 right), these images can also suggest why certain images are misclassified (Figure 9 left). Performing a similar analysis for standard model requires incorporating non-trivial human priors in order to make the result human-meaningful (and hence losing model-faithfulness) [simonyan2013deep, yosinski2015understanding, olah2017feature, olah2018building].
For example, in Figure 9, accentuating the highest-weight features reveals a natural transformation from the ear of a monkey to the eye of a dog (top), from negative space to the face of a dog (middle), and from the heads of two different fish to the two eyes of a single frog, with a reed transforming into a mouth (bottom). These transformations hint at the most sensitive directions of robust models on misclassified inputs, thus providing insight into model decisions.
We show that robustly trained models admit much more human-aligned input representations than those induced by standard models. We start by highlighting the brittleness of standard representations, and identify robust training as a potential solution—robustly trained models can be viewed as inducing a prior favoring features that are more human meaningful. We then show that for models with robust representations, large changes in representation space correspond to large changes in input space, and individual coordinates of representations map to human discernible concepts.
We finally introduce a number of additional benefits provided by robust representations (and not by standard ones). We can use robust representations to find natural interpolations between images, manipulate human-meaningful features in images, and better understand model behavior. Importantly, we enjoy these benefits without having to post-process the input or employ any complex regularization techniques. Our results open up a new perspective on building interpretable models, and in turn suggests new approaches to interpretability, image manipulation, and representation learning.
For our experimental analysis, we use the (restricted) ImageNet [russakovsky2015imagenet] datasets. Attaining robust models for the complete ImageNet dataset is known to be a challenging problem, both due to the hardness of the learning problem itself, as well as the computational complexity. We thus restrict our focus to a subset of the dataset which we denote as restricted ImageNet. To this end, we group together semantically similar classes from ImageNet into 9 super-classes shown in Table 1. We train and evaluate only on examples corresponding to these classes.
|Class||Corresponding ImageNet Classes|
|“Dog”||151 to 268|
|“Cat”||281 to 285|
|“Frog”||30 to 32|
|“Turtle”||33 to 37|
|“Bird”||80 to 100|
|“Primate”||365 to 382|
|“Fish”||389 to 397|
|“Crab”||118 to 121|
|“Insect”||300 to 319|
We use the standard ResNet-50 architecture [he2016deep] for our adversarially trained classifiers on all datasets. Every model is trained with data augmentation, momentum of and weight decay of
. Other hyperparameters are provided in Tables2 and 3.
|Dataset||Epochs||LR||Batch Size||LR Schedule|
|restricted ImageNet||110||0.1||128||Drop by 10 at epochs|
To obtain robust classifiers, we employ the adversarial training methodology proposed in [madry2018towards]. Specifically, we train against a projected gradient descent (PGD) adversary, starting from a random initial perturbation of the training data. We consider adversarial perturbations in norm. Unless otherwise specified, we use the values of provided in Table 3 to train/evaluate our models.
|Dataset||# steps||Step size|
|Dataset||# steps||Step size|
|Dataset||# steps||Step size|
|Dataset||# steps||Step size|