# Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?

For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, researchers have demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "perceptually-aligned gradients" also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding suggests that perceptually-aligned gradients may be a general property of robust classifiers, rather than a specific property of adversarially-trained neural networks. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness.

## Authors

• 2 publications
• 6 publications
• 70 publications

Deep learning is vulnerable to adversarial examples. Many defenses based...
07/06/2021 ∙ by Sungyoon Lee, et al. ∙ 6

• ### On the Benefits of Models with Perceptually-Aligned Gradients

05/04/2020 ∙ by Gunjan Aggarwal, et al. ∙ 5

• ### SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness

Randomized smoothing is currently a state-of-the-art method to construct...
11/17/2021 ∙ by Jongheon Jeong, et al. ∙ 0

• ### A Little Robustness Goes a Long Way: Leveraging Universal Features for Targeted Transfer Attacks

Adversarial examples for neural network image classifiers are known to b...
06/03/2021 ∙ by Jacob M. Springer, et al. ∙ 0

• ### What it Thinks is Important is Important: Robustness Transfers through Input Gradients

Adversarial perturbations are imperceptible changes to input pixels that...
12/11/2019 ∙ by Alvin Chan, et al. ∙ 27

• ### Learning Perceptually-Aligned Representations via Adversarial Robustness

Many applications of machine learning require models that are human-alig...
06/03/2019 ∙ by Logan Engstrom, et al. ∙ 0

• ### DeformRS: Certifying Input Deformations with Randomized Smoothing

Deep neural networks are vulnerable to input deformations in the form of...
07/02/2021 ∙ by Motasem Alfarra, et al. ∙ 6

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Classifiers are called adversarially robust if they achieve high accuracy even on adversarially-perturbed inputs [szegedy2014intriguing, biggio2013evasion]. Two effective techniques for constructing robust classifiers are adversarial training and randomized smoothing. In adversarial training, a neural network is optimized via a min-max objective to achieve high accuracy on adversarially-perturbed training examples [szegedy2014intriguing, kurakin2017adversarial, madry2017towards]. In randomized smoothing, a neural network is smoothed by convolution with Gaussian noise [lecuyer2018certified, li2018second, cohen2019certified, salman2019provably]. Recently, [tsipras2018robustness, santurkar2019image] demonstrated that adversarially-trained networks exhibit perceptually-aligned gradients: iteratively updating an image by gradient ascent so as to maximize the score assigned to a target class will render an image that perceptually resembles the target class.

In this paper, we show that smoothed neural networks also exhibit perceptually-aligned gradients. This finding suggests that perceptually-aligned gradients may be a more general property of robust classifiers, and not only a curious consequence of adversarial training. Since the root cause behind the apparent relationship between adversarial robustness and perceptual alignment remains unclear, we hope that our findings will spur foundational research aimed at explaining this connection.

Let be a neural network image classifier that maps from images in to scores for classes. Naively, one might hope that by starting with any image and taking gradient steps so as to maximize the score of a target class , we would produce an altered image that better resembled (perceptually) the targeted class. However, as shown in Figure 1, when is a vanilla-trained neural network, this is not the case; iteratively following the gradient of class ’s score appears perceptually as a noising

of the image. In the nascent literature on the explainability of deep learning, this problem has been addressed by adding explicit regularizers to the optimization problem

[olah2017feature, nguyen2014deep, mahendran2015understanding, oygard2015visualizing]. However, [santurkar2019image] showed that for adversarially-trained neural networks, these explicit regularizers aren’t needed — merely following the gradient of a target class will render images that visually resemble class .

### Randomized smoothing

Across many studies, adversarially-trained neural networks have proven empirically successful at resisting adversarial attacks within the threat model in which they were trained [athalye2018obfuscated, brendel2019accurate]. Unfortunately, when the networks are large and expressive, no known algorithms are able to provably certify this robustness [salman2019convex], leaving open the possibility that they will be vulnerable to better adversarial attacks developed in the future.

For this reason, a distinct approach to robustness called randomized smoothing has recently gained traction in the literature [lecuyer2018certified, li2018second, cohen2019certified, salman2019provably]. In the -robust version of randomized smoothing, the robust classifier is a smoothed neural network of the form:

 ^fσ(x)=Eε∼N(0,σ2I)[f(x+ε)] (1)

where is a neural network (ending in a softmax) called the base network. In other words, , the smoothed network’s predicted scores at , is the weighted average of within the neighborhood around , where points are weighted according to an isotropic Gaussian centered at

with variance

. A disadvantage of randomized smoothing is that the smoothed network cannot be evaluated exactly, due to the expectation in (1), and instead must approximated via Monte Carlo sampling. However, by computing one can obtain a guarantee that ’s prediction is constant within an ball around ; in contrast, it is not currently possible to obtain such certificates using neural network classifiers. See Appendix B for more background on randomized smoothing.

How to best train the base network to maximize the certified accuracy of the smoothed network remains an open question in the literature. In [lecuyer2018certified, cohen2019certified], the base network was trained with Gaussian data augmentation. However, [carmon2019unlabeled, li2018second] showed that training instead using stability training [zheng2016improving] resulted in substantially higher certified accuracy, and [salman2019provably] showed that training by adversarially training also outperformed Gaussian data augmentation. Our main experiments use a base network trained with Gaussian data augmentation. In Appendix C we compare against the network from [salman2019provably].

## 2 Experiments

In this paper, we show that smoothed neural networks exhibit perceptually-aligned gradients. By design, our experiments mirror those conducted in [santurkar2019image]. To begin, we synthesize large- targeted adversarial examples for a smoothed (

ResNet-50 trained on ImageNet

[he2016deep, imagenetcvpr09]. Given some source image , we used projected gradient descent (PGD) to find an image within distance of that the smoothed network classifies confidently as target class . Specifically, decomposing as , we solve the problem:

 x∗=argmaxx:∥x−x0∥≤ϵEε∼N(0,σ2I)[logits(x+ε)t]. (2)

We find that optimizing (2) yields visually more compelling results than minimizing the cross-entropy loss of . See Appendix C for a comparison between (2) and the cross-entropy approach.

The gradient of the objective (2) cannot be computed exactly, due to the expectation over

, so we instead used an unbiased estimator obtained by sampling

noise vectors

and computing the average gradient .

Figure 1 depicts large- targeted adversarial examples for a vanilla-trained neural network, an adversarially trained network [madry2017towards], and a smoothed network. Observe that the adversarial examples for the vanilla network do not take on coherent features of the target class, while the adversarial examples for both robust networks do. Figure 2 shows large- targeted adversarial examples synthesized for the smoothed network for a variety of different target classes.

Next, as in [santurkar2019image], we use the smoothed network to class-conditionally synthesize images. To generate an image from class , we sample a seed image from a multivariate Gaussian fit to images from class , and then we iteratively take gradient steps to maximize the score of class using objective (2). Figure 3 shows two images synthesized in this way from each of seven ImageNet classes. The synthesized images appear visually similar to instances of the target class, though they often lack global coherence — the synthesized solar dish includes multiple overlapping solar dishes.

### Noise Level σ

Smoothed neural networks have a hyperparameter

which controls a robustness/accuracy tradeoff: when is high, the smoothed network is more robust, but less accurate [lecuyer2018certified, cohen2019certified]. We investigated the effect of on the perceptual quality of generated images. Figure 4 shows large- adversarial examples crafted for smoothed networks with varying in . Observe that when is large, PGD tends to paint single instance of the target class; when is small, PGD tends to add spatially scattered features.

### Other concerns

In Appendix C, we study the effects of the following factors on the perceptual quality of the generated images: the number of Monte Carlo noise samples

, the loss function used for PGD, and whether the base network

is trained using Gaussian data augmentation [lecuyer2018certified, cohen2019certified] or SmoothAdv [salman2019provably].

## Appendix B Randomized Smoothing

Randomized smoothing is relatively new to the literature, and few comprehensive references exist. Therefore, in this appendix, we review some basic aspects of the technique.

### Preliminaries

Randomized smoothing refers to a class of adversarial defenses in which the robust classifier that maps from an input in to a class in is defined as:

 g(x)=argmaxy∈[k]ET[f(T(x))]y.

Here, is a neural network “base classifier” which maps from an input in to a vector of class scores in

, the probability simplex of non-negative

-vectors that sum to 1. is a randomization operation which randomly corrupts inputs in to other inputs in , i.e. for any ,

is a random variable.

Intuitively, the score which the smoothed classifier assigns to class for the input is defined to be the expected score that the base classifier assigns to the class for the random input .

The requirement that returns outputs in the probability simplex can be satisfied in either of two ways. In the “soft smoothing” formulation (presented in the main paper), is a neural network which ends in a softmax. In the “hard smoothing” formulation, returns the indicator vector for a particular class, i.e. a length- vector with one 1 and the rest zeros, without exposing the intermediate class scores. In the hard smoothing formulation, since the expectation of an indicator function is a probability, the smoothed classifier can be interpreted as returning the most probable prediction by the classifier over the random variable . Note that no papers have yet studied soft smoothing as a certified defense, though [salman2019provably] approximated a hard smoothing classifier with the corresponding soft classifier in order to attack it.

When the base classifier is a neural network, the smoothed classifier cannot be evaluated exactly, since it is not possible to exactly compute the expectation of a neural network’s prediction over a random input. However, by repeatedly sampling the random vector , one can obtain upper and lower bounds on the expected value of each entry of that vector, which hold with high probability over the sampling procedure. In the hard smoothing case, since each entry of

is a Bernoulli random variable, one can use standard Bernoulli confidence intervals like the Clopper-Pearson, as in

[lecuyer2018certified, cohen2019certified]. In the soft smoothing case, since each entry of is bounded in , one can use Hoeffding-style concentration inequalities to derive high-probability confidence intervals for the entries of .

### Gaussian smoothing

When is an additive Gaussian corruption,

 T(x)=x+ε,ε∼N(0,σ2I),

the robust classifier is given by:

 g(x)=argmaxj∈[k]^fσ(x)where^fσ(x)=Eε∼N(0,σ2I)[f(x+ε)]. (3)

Gaussian-smoothed classifiers are certifiably robust under the norm: for any input , if we know , we can certify that ’s prediction will remain constant within an ball around :

###### Theorem 1 (Extension to “soft smoothing” of Theorem 1 from [cohen2019certified]; see also Appendix A in [salman2019provably]).

Let be any function, and define and as in (3). For some , let be the indices of the largest and second-largest entries of . Then for any with

 ∥δ∥2≤σ2(Φ−1(^fσ(x)y1)−Φ−1(^fσ(x)y2)).

Theorem 1 is easy to prove using the following mathematical fact:

###### Lemma 2 (Lemma 2 from [salman2019provably], Lemma 1 from [levine2019certifiably]).

Let be any function, and define its Gaussian convolution as . Then, for any input and any perturbation ,

 Φ(Φ−1(^hσ(x))−∥δ∥2σ)≤^hσ(x+δ)≤Φ(Φ−1(^hσ(x))+∥δ∥2σ).

Intuitively, Lemma 2 says that cannot be too much larger or too much smaller than . If this has the feel of a Lipschitz guarantee, there is good reason: Lemma 2 is equivalent to the statement that the function is -Lipschitz.

Theorem 1 is a direct consequence of Lemma 2:

###### Proof of Theorem 1.

Since the outputs of live in the probability simplex, for each class the function has output bounded in , and hence can be viewed as a function for which the condition of Lemma 2 applies.

Therefore, from applying Lemma 2 to , we know that:

 ^fσ(x+δ)y1≥Φ(Φ−1(^fσ(x)y1)−∥δ∥2σ)

and, for any , from applying Lemma 2 to , we know that:

 Φ(Φ−1(^fσ(x)j)+∥δ∥2σ)≥^fσ(x+δ)j.

Combining these two results, it follows that a sufficient condition for is:

 Φ(Φ−1(^fσ(x)y1)−∥δ∥2σ)≥Φ(Φ−1(^fσ(x)j)+∥δ∥2σ),

or equivalently,

 ∥δ∥2≤σ2(Φ−1(^fσ(x)y1−Φ−1(^fσ(x)j)).

Hence, we can conclude that so long as

 ∥δ∥2≤minj≠y1{σ2(Φ−1(^fσ(x)y1−Φ−1(^fσ(x)j))}=σ2(Φ−1(^fσ(x)y1−Φ−1(^fσ(x)y2))

In the adversarial robustness literature, Lemma 2 was originally proved in the special case of hard smoothing by [cohen2019certified], though the result had actually appeared earlier in [li1998some]. Lemma 2 was proved in the general case in [salman2019provably, levine2019certifiably].

### Training

Given a dataset, a base classifier architecture, and a smoothing level , it currently an active research question to figure out the best way to train the base classifier so that the smoothed classifier will attain high certified or empirical robust accuracies. The original randomized smoothing paper [lecuyer2018certified] proposed training with Gaussian data augmentation and the standard cross-entropy loss. However, [salman2019provably] and [li2018second, carmon2019unlabeled] showed that alternative training schemes yield substantial gains in certified accuracy. In particular, [salman2019provably] proposed training by performing adversarial training on , and [li2018second, carmon2019unlabeled] proposed training via stability training [zheng2016improving].

### Related work

Gaussian smoothing was first proposed as a certified adversarial defense by [lecuyer2018certified]

under the name “PixelDP,” though similar techniques had been proposed earlier as a heuristic defenses in

[cao2017mitigating, liu2018towards]. Subsequently, [li2018second] proved a stronger robustness guarantee, and finally [cohen2019certified] derived the tightest possible robustness guarantee in the “hard smooothing” case, which was extended to the “soft smoothing” case by [levine2019certifiably, salman2019provably].

Concurrently, [zhang2019discretization] proved a robustness guarantee in norm for Gaussian smoothing; however, since Gaussian smoothing specifically confers (not ) robustness [cohen2019certified], the certified accuracy numbers reported in [zhang2019discretization] were weak.

[pinot2019theoretical] gave theoretical and empirical arguments for an adversarial defense similar to randomized smoothing, but did not position their method as a certified defense.

[lee2019stratified] have extended randomized smoothing beyond Gaussian noise / norm by proposing a randomization scheme which allows for certified robustness in the norm.

## Appendix C Details on Generating Images

This appendix details the procedure used to generate the images that appeared in this paper.

As in [santurkar2019image], to generate an image near the starting image that is classified by a smoothed neural network as some target class , we use projected steepest descent to solve the optimization problem:

 x∗=argminx:∥x−x0∥2≤ϵL(^fσ,x,t) (4)

where is a loss function measuring the extent to which classifies as class .

The two big choices which need to be made are: which loss function to use, and how to compute its gradient?

### Loss functions for adversarially-trained networks

We first review two loss functions for generating images using adversarially-trained neural networks. Our loss functions for smoothed neural networks (presented below) are inspired by these.

The first is the cross-entropy loss. If

is an (adversarially trained) neural network classifier that ends in a softmax layer (so that its output lies on the probability simplex

), the cross-entropy loss is defined as:

The second is the “target class max” (TCM) loss. If we write as , where is minus the final softmax layer, then the TCM loss is defined as:

In other words, minimizing will maximize the score that assigns to class .

Since is just a neural network, computing the gradients of these loss functions can be easily done using automatic differentiation. (The situation is more complicated for smoothed neural networks.)

We note that [santurkar2019image] used in their experiments.

### Loss functions for smoothed networks

Our loss functions for smoothed neural networks are inspired by those described above for adversarially trained networks. If is a smoothed neural network of the form , with a neural network that ends in a softmax layer, then the cross-entropy loss is defined as:

 LCE(^fσ,x,t):=−log^fσ(x)t=−logEε∼N(0,σ2I)[f(x+ε)t]. (5)

If we decompose as , where is minus the softmax layer, then the TCM loss is defined as:

 LTCM(^fσ,x,t):=−Eε∼N(0,σ2I)[logits(x+ε)t]. (6)

In other words, minimizing will maximize the expected logit of class for the random input

To solve problem (4) using PGD, we need to be able to compute the gradient of the objective w.r.t . However, for smoothed neural networks, it is not possible to exactly compute the gradient of either or . We therefore must resort to gradient estimates obtained using Monte Carlo sampling.

For , we use the following unbiased gradient estimator:

 ∇xLTCM(^fσ,x,t)≈−1NN∑i=1∇xlogits(x+εi)t,εi∼N(0,σ2I)

This estimator is unbiased since

 Eε1,…,εN∼N(0,σ2I)[−1NN∑i=1∇xlogits(x+εi)t] =Eε∼N(0,σ2I)[−∇xlogits(x+ε)t] =∇xEε∼N(0,σ2I)[−logits(x+ε)t].

For , we are unaware of any unbiased gradient estimator, so, following [salman2019provably], we use the following biased “plug-in” gradient estimator:

 ∇xLCE(^fσ,x,t)≈∇x[−log(1NN∑i=1f(x+εi)t)],εi∼N(0,σ2I)

### Experimental comparison between loss functions

Figure 13 shows large- adversarial examples crafted for a smoothed neural network using both and . The adversarial examples crafted using seem to better perceptually resemble the target class. Therefore, in this work we primarily use .

### Experimental comparison between training procedures

For most of the figures in this paper, we used a base classifier from [cohen2019certified] trained using Gaussian data augmentation. However, in Figures 15-17, we compare large- adversarial examples for this base classifier to those synthesized for a base classifier trained using the SmoothAdv procedure from [salman2019provably], which was shown in that paper to attain much better certified accuracies than the network from [cohen2019certified]. We find that there does not seem to be a large difference in the perceptual quality of the generated images. Therefore, throughout this paper we used the network from [cohen2019certified], since we wanted to emphasize that perceptually-aligned gradients arise even with robust classifiers that do not involve adversarial training of any kind.

### Experimental study of number of Monte Carlo samples

One important question is how many Monte Carlo samples are needed when computing the gradient of or . In Figure 14 we show large- adversarial examples synthesized using Monte Carlo samples. There does not seem to be a large difference between using samples or using more than 20. Images synthesized using samples do appear a bit less developed than the others (e.g. the terrier with is has fewer ears than when is large.) In this work, we primarily used .

### Hyperparameters

The following table shows the hyperparameter settings for all of the figures in this paper.

Figure number of PGD steps PGD step size
1, 12 0.5 300 40.0 2.8 (vanilla), 0.7 20
2 0.5 300 40.0 0.7 20
3 0.5 300 40.0 0.7 20
4, 9-11 vary 300 40.0 2.8 ( = 0), 0.7 20
15-17 0.5, 1.0 300 40.0 0.7 20
13 0.5 300 40.0 2.0 (CE), 0.7 20
14 0.5 300 40.0 0.7 vary

Note that Figures 1 and 12 only use stepSize = 2.8 in the Vanilla column, Figure 13 only uses stepSize = 2.0 in the C-E Loss column, and Figures 4 and 9-11 only use stepSize = 2.8 for = 0.