Notebooks for reproducing the paper "Computer Vision with a Single (Robust) Classifier"
We show that the basic classification framework alone can be used to tackle some of the most challenging computer vision tasks. In contrast to other state-of-the-art approaches, the toolkit we develop is rather minimal: it uses a single, off-the-shelf classifier for all these tasks. The crux of our approach is that we train this classifier to be adversarially robust. It turns out that adversarial robustness is precisely what we need to directly manipulate salient features of the input. Overall, our findings demonstrate the utility of robustness in the broader machine learning context. Code and models for our experiments can be found at https://git.io/robust-apps.READ FULL TEXT VIEW PDF
We show that the basic classification framework alone can be used to tac...
We introduce the MNIST-C dataset, a comprehensive suite of 15 corruption...
Modern models that perform system-critical tasks such as segmentation an...
Deep Learning has driven recent and exciting progress in computer vision...
The success of machine learning methods for computer vision tasks has dr...
We present Neural-Guided RANSAC (NG-RANSAC), an extension to the classic...
Feature tracking is a fundamental problem in computer vision, with
Notebooks for reproducing the paper "Computer Vision with a Single (Robust) Classifier"
Deep learning has revolutionized the way we tackle computer vision problems. This revolution started with progress on image classification [krizhevsky2012imagenet, he2015delving, he2016deep], which then triggered the expansion of the deep learning paradigm to encompass more sophisticated tasks such as image generation [karras2018progressive, brock2019large]isola2017image, zhu2017unpaired]. Much of this expansion was predicated on developing complex, task-specific techniques, often rooted in the generative adversarial network (GAN) framework [goodfellow2014generative]. However, is there a simpler toolkit for solving these tasks?
In this work, we demonstrate that basic classification tools alone suffice to tackle various computer vision tasks. These tasks include (cf. Figure 1): generation (Section 3.1), inpainting (Section 3.2), image-to-image translation (Section 3.3), super-resolution (Section 3.4), and interactive image manipulation (Section 3.5).
Our entire toolkit is based on a single classifier (per dataset) and involves performing a simple input manipulation: maximizing predicted class scores with gradient descent. Our approach is thus general purpose and simple to implement and train, while also requiring minimal tuning. To highlight the potential of the core methodology itself, we intentionally employ a generic classification setup (ResNet-50 [he2016deep]
with default hyperparameters) without any additional optimizations (e.g., domain-specific priors or regularizers). Moreover, to emphasize the consistency of our approach, throughout this work we demonstrate performance onrandomly selected examples from the test set.
The key ingredient of our method is adversarially robust classifiers. Previously, Tsipras et al. [tsipras2019robustness] observed that maximizing the loss of robust models over the input leads to realistic instances of other classes. Here we are able to fully leverage this connection to build a versatile computer vision toolkit. Our findings thus establish robust classifiers as a powerful primitive for semantic image manipulation, despite them being trained solely to perform image classification.
Recently, Tsipras et al. [tsipras2019robustness] observed that optimizing an image to cause a misclassification in an (adversarially) robust classifier introduces salient characteristics of the incorrect class. This property is unique to robust classifiers: standard models (trained with empirical risk minimization (ERM)) are inherently brittle, and their predictions are sensitive even to imperceptible changes in the input [szegedy2014intriguing].
Adversarially robust classifiers are trained using the robust optimization objective [wald1945statistical, madry2018towards], where instead of minimizing the expected loss over the data
we minimize the worst case loss over a specific perturbation set
Typically, the set captures imperceptible changes (e.g., small perturbations), and given such a , the problem in (2) can be solved using adversarial training [goodfellow2015explaining, madry2018towards].
From one perspective, we can view robust optimization as encoding priors into the model, preventing it from relying on imperceptible features of the input [engstrom2019learning]. Indeed, the findings of Tsipras et al. [tsipras2019robustness] are aligned with this viewpoint—by encouraging the model to be invariant to small perturbations, robust training ensures that changes in the model’s predictions correspond to salient input changes.
In fact, it turns out that this phenomenon also emerges when we maximize the probability of aspecific class (targeted attacks) for a robust model—see Figure 2 for an illustration. This indicates that robust models exhibit more human-aligned gradients, and, more importantly, that we can precisely control features in the input just by performing gradient descent on the model output. Previously, performing such manipulations has only been possible with more complex and task-specific techniques [radford2016unsupervised, isola2017image, zhu2017unpaired]. In the rest of this work, we demonstrate that this property of robust models is sufficient to attain good performance on a diverse set of computer vision tasks.
Deep learning-based methods have recently made significant progress on image synthesis and manipulation tasks, typically either by training specifically-crafted models in the GAN framework [goodfellow2014generative, iizuka2017globally, zhu2017unpaired, yu2018generative, brock2019large], or using priors obtained from deep generative models [ulyanov2017deep, yeh2017semantic]. We discuss additional related work in the following subsections as necessary.
In this section, we outline our methods and results for obtaining competitive performance on these tasks using only robust (feed-forward) classifiers. Our approach is remarkably simple: all the applications are performed using gradient ascent on class scores derived from the same robustly trained classifier. In particular, it does not involve fine-grained tuning (see Appendix A.4), highlighting the potential of robust classifiers as a versatile primitive for sophisticated vision tasks.
Synthesizing realistic samples for natural data domains (such as images) has been a long standing challenge in computer vision. Given a set of example inputs, we would like to learn a model that can produce novel perceptually-plausible inputs. The development of deep learning-based methods such as autoregressive models[hochreiter1997long, graves2013generating, van2016pixel], auto-encoders [vincent2010stacked, kingma2013autoencoding] and flow-based models [dinh2014nice, rezende2015variational, dinh2017density, kingma2018glow] has led to significant progress in this domain. More recently, advancements in generative adversarial networks (GANs) [goodfellow2014generative] have made it possible to generate high-quality images for challenging datasets [zhang2018self, karras2018progressive, brock2019large]. Many of these methods, however, can be tricky to train and properly tune. They are also fairly computationally intensive, and often require fine-grained performance optimizations.
In contrast, we demonstrate that robust classifiers, without any special training or auxiliary networks, can be a powerful tool for synthesizing realistic natural images. At a high level, our generation procedure is based on maximizing the class score of the desired class using a robust model. The purpose of this maximization is to add relevant and semantically meaningful features of that class to a given input image. As this process is deterministic, generating a diverse set of samples requires a random seed as the starting point of the maximization process.
Formally, to generate a sample of class , we sample a seed and minimize the loss of label
for some class-conditional seed distribution , using projected gradient descent (PGD) (experimental details can be found in Appendix A). Ideally, samples from should be diverse and statistically similar to the data distribution. Here, we use a simple (but already sufficient) choice for
—a multivariate normal distribution fit to the empirical class-conditional distribution
and is the distribution of natural inputs conditioned on the label . We visualize example seeds from these multivariate Gaussians in Figure 17.
This approach enables us to perform conditional image synthesis given any target class. Samples (at resolution 224224) produced by our method are shown in Figure 3 (also see Appendix B). The resulting images are diverse and realistic, despite the fact that they are generated using targeted PGD on off-the-shelf robust models without any additional optimizations. 333Interestingly, the robust model used to generate these high-quality ImageNet samples is only accurate, yet has a sufficiently rich representation to synthesize semantic features for classes.
It is worth noting that there is significant room for improvement in designing the distribution . One way to synthesize better samples would be to use a richer distribution—for instance, mixtures of Gaussians per class to better capture multiple data modes. Also, in contrast to many existing approaches, we are not limited to a single seed distribution, and we could even utilize other methods (such as procedural generation) to customize seeds with specific structure or color, and then maximize class scores to produce realistic samples (e.g., see Section 3.5).
Inception Score (IS) [salimans2016improved] is a popular metric for evaluating the quality of generated image data. Table 1 presents the IS of samples generated using a robust classifier.
|Dataset||Train Data||BigGAN [brock2019large]||WGAN-GP [gulrajani2017improved]||Our approach|
|CIFAR-10||11.2 0.2||9.22||8.4 0.1||7.5 0.1|
|ImageNet44footnotemark: 4||331.9 4.9||233.1 1||11.6||259.0 4|
We find that our approach improves over state-of-the-art (BigGAN [brock2019large]) in terms of Inception Score on the ImageNet dataset, yet, at the same time, the Fréchet Inception Distance (FID) [heusel2017gans] is worse (36.0 versus 7.4). These results can be explained by the fact that, on one hand, our samples are essentially adversarial examples (which are known to transfer across models [szegedy2014intriguing]) and thus are likely to induce highly confident predictions that IS is designed to pick up. On the other hand, GANs are explicitly trained to produce samples that are indistinguishable from true data with respect to a discriminator, and hence are likely to have a better (lower) FID.
Image inpainting is the task of recovering images with large corrupted regions [efros1999texture, bertalmio2000image, hays2007scene]. Given an image , corrupted in a region corresponding to a binary mask , the goal of inpainting is to recover the missing pixels in a manner that is perceptually plausible with respect to the rest of the image. We find that simple feed-forward classifiers, when robustly trained, can be a powerful tool for such image reconstruction tasks.
From our perspective, the goal is to use robust models to restore missing features of the image. To this end, we will optimize the image to maximize the score of the underlying true class, while also forcing it to be consistent with the original in the uncorrupted regions. Concretely, given a robust classifier trained on uncorrupted data, and a corrupted image with label , we solve
where is the cross-entropy loss, denotes element-wise multiplication, and is an appropriately chosen constant. Note that while we require knowing the underlying label for the input, it can typically be accurately predicted by the classifier itself given the corrupted image.
In Figure 4, we show sample reconstructions obtained by optimizing (3) using PGD (cf. Appendix A for details). We can observe that these reconstructions look remarkably similar to the uncorrupted images in terms of semantic content. Interestingly, even when this approach fails (reconstructions differ from the original), the resulting images do tend to be perceptually plausible to a human, as shown in Appendix Figure 12.
As discussed in Section 2, robust models provide a mechanism for transforming inputs between classes. In computer vision literature, this would be an instance of image-to-image translation, where the goal is to translate an image from a source to a target domain in a semantic manner [hertzmann2001image].
In this section, we demonstrate that robust classifiers give rise to a new methodology for performing such image-to-image translations. The key is to (robustly) train a classifier to distinguish between the source and target domain. Conceptually, such a classifier will extract salient characteristics of each domain in order to make accurate predictions. We can then translate an input from the source domain by directly maximizing the predicted score of the target domain.
In Figure 5, we provide sample translations produced by our approach using robust models—each trained only on the source and target domains for the Horse Zebra, Apple Orange, and Summer Winter datasets [zhu2017unpaired] respectively. (For completeness, we present in Appendix B Figure 10 results corresponding to using a classifier trained on the complete ImageNet dataset.) In general, we find that this procedure yields meaningful translations by directly modifying characteristics of the image that are strongly tied to the corresponding domain (e.g., color, texture, stripes).
Note that, in order to manipulate such features, the model must have learned them in the first place—for example, we want models to distinguish between horses and zebras based on salient features such as stripes. For overly simple tasks, models might extract little salient information (e.g., by relying on backgrounds instead of objects555In fact, we encountered such an issue with -robust classifiers for horses and zebras (Figure 11). Note that generative approaches also face similar issues, where the background is transformed instead of the objects zhu2017unpaired.) in which case our approach would not lead to meaningful translations. Nevertheless, this not a fundamental barrier and can be addressed by training on richer, more challenging datasets. From this perspective, scaling to larger datasets (which can be difficult for state-of-the-art methods such as GANs) is actually easy and advantageous for our approach.
Datasets for translation tasks often comprise source-target domain pairs [isola2017image]
. For such datasets, the task can be straightforwardly cast into a supervised learning framework. In contrast, our method operates in theunpaired setting, where samples from the source and target domain are provided without an explicit pairing [zhu2017unpaired]. This is due to the fact that our method only requires a classifier capable of distinguishing between the source and target domains.
Super-resolution refers to the task of recovering high-resolution images given their low resolution version [dabov2007video, burger2012image]. While this goal is underspecified, our aim is to produce a high-resolution image that is consistent with the input and plausible to a human.
In order to adapt our framework to this problem, we cast super-resolution as the task of accentuating the salient features of low-resolution images. This can be achieved by maximizing the score predicted by a robust classifier (trained on the original high-resolution dataset) for the underlying class. At the same time, to ensure that the structure and high-level content is preserved, we penalize large deviations from the original low-resolution image. Formally, given a robust classifier and a low-resolution image belonging to class , we use PGD to solve
where denotes the up-sampling operation based on nearest neighbors, and is a small constant.
upsampling using bicubic interpolation; andbottom: super-resolution using robust models. We obtain semantically meaningful reconstructions that are especially sharp in regions that contain class-relevant information.
We use this approach to upsample random CIFAR-10 images to full ImageNet size ()—cf. Figure 5(a). For comparison, we also show upsampled images obtained from bicubic interpolation. In Figure 5(b), we visualize the results for super-resolution on random -fold down-sampled images from the restricted ImageNet dataset. Since in the latter case we have access to ground truth high-resolution images (actual dataset samples), we can compute the Peak Signal-to-Noise Ratio (PSNR) of the reconstructions. Over the Restricted ImageNet test set, our approach yields a PSNR of (% CI [, ]) compared to (% CI [, ]) from bicubic interpolation. In general, our approach produces high-resolution samples that are substantially sharper, particularly in regions of the image that contain salient class information.
Note that the pixelation of the resulting images can be attributed to using a very crude upsampling of the original, low-resolution image as a starting point for our optimization. Combining this method with a more sophisticated initialization scheme (e.g., bicubic interpolation) is likely to yield better overall results.
Recent work has explored building deep learning–based interactive tools for image synthesis and manipulation. For example, GANs have been used to transform simple sketches [chen2018sketchygan, park2019semantic] into realistic images. In fact, recent work has pushed this one step further by building a tool that allows object-level composition of scenes using GANs [bau2019gan]. In this section, we show how our framework can be used to enable similar artistic applications.
By performing PGD to maximize the probability of a chosen target class, we can use robust models to convert hand-drawn sketches to natural images. The resulting images (Figure 7) appear realistic and contain fine-grained characteristics of the corresponding class.
Generative model–based paint applications often allow the user to control more fine-grained features, as opposed to just the overall class. We now show that we can perform similar feature manipulation through a minor modification to our basic primitive of class score maximization. Our methodology is based on an observation of engstrom2019learning, wherein manipulating individual activations within representations888We refer to the pre-final layer of a network as the representation layer. Then, the network prediction can simply be viewed as the output of a linear classifier on the representation. of a robust model actually results in consistent and meaningful changes to high-level image features (e.g., adding stripes to objects). We can thus build a tool to paint specific features onto images by maximizing individual activations directly, instead of just the class scores.
Concretely, given an image , if we want to add a single feature corresponding to component
of the representation vectorin the region corresponding to a binary mask , we simply apply PGD to solve
In Figure 8, we demonstrate progressive addition of features at various levels of granularity (e.g., grass or sky) to selected regions of the input image. We can observe that such direct maximization of individual activations gives rise to a versatile paint tool.
In this work, we leverage the basic classification framework to perform a wide range of computer vision tasks. In particular, we find that the features learned by a basic classifier are sufficient for all these tasks, provided this classifier is adversarially robust. We then show how this insight gives rise to a versatile computer vision toolkit that is simple, reliable, and straightforward to extend to other large-scale datasets. This is in stark contrast to state-of-the-art approaches [goodfellow2014generative, karras2018progressive, brock2019large] which typically rely on architectural, algorithmic, and task-specific optimizations to succeed at scale [salimans2016improved, daskalakis2018training, miyato2018spectral]. In fact, unlike these approaches, our methods actually benefit from scaling to more complex datasets—whenever the underlying classification task is rich and challenging, the classifier is likely to learn more fine-grained features.
We also note that throughout this work, we choose to employ the most minimal version of our toolkit. In particular, we refrain from using extensive tuning or task-specific optimizations. This is intended to demonstrate the potential of our core framework itself, rather than to exactly match/outperform the state of the art. We fully expect that better training methods, improved notions of robustness, and domain knowledge will yield even better results.
More broadly, our findings suggest that adversarial robustness might be a property that is desirable beyond security and reliability contexts. Robustness may, in fact, offer a path towards building a more human-aligned machine learning toolkit.
For our experimental analysis, we use the CIFAR-10 [krizhevsky2009learning] and ImageNet [russakovsky2015imagenet] datasets. Since obtaining a robust classifier for the full ImageNet dataset is known to be a challenging and computationally expensive problem, we also conduct experiments on a “restricted” version if the ImageNet dataset with 9 super-classes shown in Table 2. For image translation we use the Horse Zebra, Apple Orange, and Summer Winter datasets [zhu2017unpaired].
|Class||Corresponding ImageNet Classes|
|“Dog”||151 to 268|
|“Cat”||281 to 285|
|“Frog”||30 to 32|
|“Turtle”||33 to 37|
|“Bird”||80 to 100|
|“Primate”||365 to 382|
|“Fish”||389 to 397|
|“Crab”||118 to 121|
|“Insect”||300 to 319|
We use the standard ResNet-50 architecture [he2016deep] for our adversarially trained classifiers on all datasets. Every model is trained with data augmentation, momentum of and weight decay of . Other hyperparameters are provided in Tables 3 and 4.
|Dataset||Epochs||LR||Batch Size||LR Schedule|
|CIFAR-10||350||0.01||256||Drop by 10 at epochs|
|restricted ImageNet||110||0.1||128||Drop by 10 at epochs|
|ImageNet||110||0.1||256||Drop by 10 at epochs|
|Horse Zebra||350||0.01||64||Drop by 10 at epochs|
|Apple Orange||350||0.01||64||Drop by 10 at epochs|
|Summer Winter||350||0.01||64||Drop by 10 at epochs|
In all our experiments, we train robust classifiers by employing the adversarial training methodology [madry2018towards] with an perturbation set. The hyperparameters used for robust training of each of our models are provided in Table 4.
|Dataset||# steps||Step size|
Note that we did not perform any hyperparameter tuning for the hyperparameters in Table 3 because of computational constraints. We use the relatively standard benchmark of 0.5 for CIFAR-10—the rest of the values of were chosen roughly by scaling this up by the appropriate constant (i.e. proportional to sqrt(d))—we note that the networks are not critically sensitive to these values of epsilon (e.g. a CIFAR-10 model trained with gives almost the exact same results). Due to restrictions on compute we did not grid search over , but finding a more direct manner in which to set (e.g. via a desired adversarial accuracy) is an interesting future direction.
|Dataset||# steps||Step size|
|Dataset||# steps||Step size|
In order to compute the class conditional Gaussians for high resolution images (2242243) we downsample the images by a factor of 4 and upsample the resulting seed images with nearest neighbor interpolation.
|Dataset||# steps||Step size|
Inception score is computed based on 50k class-balanced samples from each dataset using code provided in https://github.com/ajbrock/BigGAN-PyTorch.
To create a corrupted image, we select a patch of a given size at a random location in the image. We reset all pixel values in the patch to be the average pixel value over the entire image (per channel).
|Dataset||patch size||# steps||Step size|
|Dataset||factor||# steps||Step size|
|Horse Zebra||Apple Orange|