Do Perceptually Aligned Gradients Imply Adversarial Robustness?

by   Roy Ganz, et al.

In the past decade, deep learning-based networks have achieved unprecedented success in numerous tasks, including image classification. Despite this remarkable achievement, recent studies have demonstrated that such networks are easily fooled by small malicious perturbations, also known as adversarial examples. This security weakness led to extensive research aimed at obtaining robust models. Beyond the clear robustness benefits of such models, it was also observed that their gradients with respect to the input align with human perception. Several works have identified Perceptually Aligned Gradients (PAG) as a byproduct of robust training, but none have considered it as a standalone phenomenon nor studied its own implications. In this work, we focus on this trait and test whether Perceptually Aligned Gradients imply Robustness. To this end, we develop a novel objective to directly promote PAG in training classifiers and examine whether models with such gradients are more robust to adversarial attacks. Extensive experiments on CIFAR-10 and STL validate that such models have improved robust performance, exposing the surprising bidirectional connection between PAG and robustness.


page 2

page 6

page 13


Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?

For a standard convolutional neural network, optimizing over the input p...

Bridging Adversarial Robustness and Gradient Interpretability

Adversarial training is a training scheme designed to counter adversaria...

Towards Robust Training of Neural Networks by Regularizing Adversarial Gradients

In recent years, neural networks have demonstrated outstanding effective...

What it Thinks is Important is Important: Robustness Transfers through Input Gradients

Adversarial perturbations are imperceptible changes to input pixels that...

Adversarially robust segmentation models learn perceptually-aligned gradients

The effects of adversarial training on semantic segmentation networks ha...

There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits)

We provide a new understanding of the fundamental nature of adversariall...

Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation

Recent studies imply that deep neural networks are vulnerable to adversa...

1 Introduction

AlexNet (Krizhevsky et al., 2012)

, one of the first Deep Neural Networks (DNNs), has significantly surpassed all the classic computer vision methods in the ImageNet

(Deng et al., 2009) classification challenge (Russakovsky et al., 2015). Since then, the amount of interest and resources invested in the deep learning (DL) field has skyrocketed. Nowadays, such models attain superhuman performance in classification (He et al., 2015; Dosovitskiy et al., 2020). However, although neural networks are allegedly inspired by the human brain, unlike the human visual system, they are known to be highly sensitive to minor corruptions (Hosseini et al., 2017; Dodge and Karam, 2017; Geirhos et al., 2017; Temel et al., 2017, 2018; Temel and AlRegib, 2018) and small malicious perturbations, known as adversarial attacks (Szegedy et al., 2014; Athalye et al., 2017; Biggio et al., 2013; Carlini and Wagner, 2017a; Goodfellow et al., 2015; Kurakin et al., 2017; Nguyen et al., 2014). With the introduction of DNNs to real-world applications affecting human lives, these issues raise significant safety concerns, and therefore, have drawn substantial research attention.

The bulk of the works in the field of robustness to adversarial attacks can be divided into two types – on the one hand, ones that propose robustification methods (Goodfellow et al., 2015; Madry et al., 2018; Zhang et al., 2019; Wang et al., 2020), and on the other hand, ones that construct stronger and more challenging adversarial attacks (Goodfellow et al., 2015; Madry et al., 2018; Carlini and Wagner, 2017b; Tramer et al., 2020; Croce and Hein, 2020). While there are numerous techniques for obtaining adversarially robust models (Lecuyer et al., 2018; Li et al., 2018; Cohen et al., 2019; Salman et al., 2019), the most effective one is Adversarial Training (AT) (Madry et al., 2018). AT proposes a simple yet highly beneficial training scheme – train the network to classify adversarial examples correctly. We provide an overview on adversarial examples and training in Appendix A.

While exploring the properties of adversarially trained models, Tsipras et al. (2019)

exposed a fascinating characteristic of these models that does not exist in standard ones – Perceptually Aligned Gradients (PAG). Generally, they discovered that such models are more aligned with human perception than standard ones, in the sense that the loss gradients w.r.t. the input are meaningful and visually understood by humans. As a result, modifying an image to maximize a conditional probability of some class, estimated by a model with PAG, yields class-related semantic visual features, as can be seen in

Figure 1. This important discovery has led to a sequence of works that uncovered conditions in which PAG occurs. Aggarwal et al. (2020) revealed that PAG also exists in adversarially trained models with small threat models, while Kaur et al. (2019) observed PAG in robust models trained without adversarial training. While it has been established that robust models lead to PAG, more research is required to better understand this intriguing property.

In this work, while aiming to shed some light on the PAG phenomenon, we pose the following reversed question – Do Perceptually Aligned Gradients Imply Robustness? This is an interesting question, as it tests the similarity between neural networks and human vision. Humans are capable of identifying the class-related semantic features and thus, can describe the modifications that need to be done to an image to change their predictions. That, in turn, makes the human visual system “robust”, as it is not affected by changes unrelated to the semantic features. With this insight, we hypothesize that since similar capabilities exist in classifiers with perceptually aligned gradients, they would be inherently more robust.

To methodologically test this question, we need to train networks that obtain perceptually aligned gradients without inheriting robust characteristics from robust models. However, PAG is known to be a byproduct of robust training, and there are currently no ways to promote this property directly and in isolation. Thus, to explore our research question, we develop a novel PAG-inducing general objective that penalizes the input-gradients of the classifier without any form of robust training. We first verify that our optimization goal indeed yields such gradients as well as sufficiently high accuracy on clean images, then test the robustness of the obtained models and compare them to models trained using standard training (“vanilla”). Our experiments strongly suggest that models with PAG are inherently more robust than their vanilla counterparts, revealing that directly promoting such a trait can imply robustness to adversarial attacks.

2 Perceptually Aligned Gradients

Perceptually aligned gradients (PAG) (Engstrom et al., 2019b; Etmann et al., 2019; Ross and Doshi-Velez, 2018b; Tsipras et al., 2019) is a phenomenon according to which, classifier input-gradients are semantically aligned with human perception. This means, inter alia, that modifying an image to maximize a specific class probability should yield visual features that humans associate with the target class. Tsipras et al. (2019) discovered that PAG occurs in adversarially trained classifiers, but not in “vanilla” models. The prevailing hypothesis is that the existence of PAG only in adversarially robust classifiers and not in regular ones indicates that features learned by such models are more aligned with human vision. PAG is a qualitative trait, and currently, no quantitative metrics for assessing it exist. Moreover, there is an infinite number of equally good gradients aligned with human perception, i.e., there are countless perceptually meaningful directions in which one can modify an image to look more like a certain target class. Thus, in this work, similar to (Tsipras et al., 2019), we gauge PAG qualitatively by examining the visual modifications done while maximizing the conditional probability of some class, estimated by the tested classifier. In other words, we examine the effects of a large- targeted adversarial attack and claim that a model has PAG if such a process yields class-related semantic modifications, as demonstrated in Figure 1.

Figure 1: Visual demonstration of large- adversarial examples on "vanilla" and robust ResNet-18 trained on CIFAR-10 as a method to determine whether a model obtains PAG.

In recent years, PAG has drawn a lot of research attention which can be divided into two main types – an applicative study and a theoretical one. The applicative study aims to harness this phenomenon for various computer vision problems, such as image generation and translation (Santurkar et al., 2019), the improvement of state-of-the-art results in image generation (Ganz and Elad, 2021), and explainability (Elliott et al., 2021).

As for the theoretical study, several works aimed to better understand the conditions under which PAG occurs. The authors of (Kaur et al., 2019) examined if PAG is an artifact of the adversarial training algorithm or a general property of robust classifiers. Additionally, it has been shown that PAG exists in adversarially robust models with a low max-perturbation bound (Aggarwal et al., 2020). To conclude, previous works discovered that training robust models leads to PAG. In this work, we explore the opposite question – Do perceptually aligned gradients imply robustness?

3 Do PAG Imply Robustness?

As mentioned in Section 2, previous work has validated that robust training implies perceptually aligned gradients. More specifically, they observed that performing targeted PGD attacks on robust models yields visual modifications aligned with human perception. In contrast, in this work, we aim to delve into the opposite direction and test if training a classifier to have perceptually aligned gradients will improve its adversarial robustness.

To this end, we propose to encourage the input-gradients of a classifier to uphold PAG. Due to the nature of our research question, we need to isolate PAG from robust training and verify whether the former implies the latter. This raises a challenging question – PAG is known to be a byproduct of robust training. How can one develop a training procedure that encourages PAG without explicitly performing robust training of some sort? Note that a framework that attains PAG via robust training cannot answer our question, as that would involve circular reasoning.

We answer this question by proposing a novel training objective consisting of two elements: the classic cross-entropy loss on the model outputs and an auxiliary loss on the model’s input-gradients. We note that the input-gradients of the classifier, , where is the

-th entry of the vector

, can be trained, since they are differentiable w.r.t. the classifier parameters . Thus, given labeled images from a dataset , assuming we have access to ground-truth perceptually aligned gradients

, we could pose the following loss function:


where is the cross-entropy loss defined in Equation 4,

is a tunable regularization hyperparameter,

is the number of classes in the dataset, and

is the cosine similarity loss defined as follows:


where is a small positive value so as to avoid division by zero. Note that this loss considers the direction of the model’s input-gradients without any requirement on their magnitude. This bodes well with the general goal of these gradients being aligned with human perception.

We emphasize that, in contrast to robust training methods such as (Madry et al., 2018; Cohen et al., 2019), our scheme does not feed the model with any perturbed images and only trains on examples originating from the training set. Moreover, while other works (Ross and Doshi-Velez, 2017; Jakubovitz and Giryes, 2018) suggest that penalizing the input-gradients’ norm yields robustness, we do not utilize this fact since we encourage gradient alignment rather than having a small norm. Thus, our method is capable of promoting PAG without utilizing robust training.

After training a model to minimize the objective in Equation 1, we aim to examine if promoting PAG in a classifier increases adversarial robustness. First, to verify that the resulting model indeed upholds PAG, we perform targeted PGD on test set images and qualitatively assess the validity of the resulting visual modifications. Afterwards, we test the adversarial robustness of the said model and compare it with vanilla baselines. If it demonstrates favorable robustness accuracy, we will have promoted an affirmative answer to the titular research question of this work.

However, one major obstacle remains in the way of training this objective: so far, we have assumed the existence of “ground-truth” model input-gradients, an assumption that does not hold in practice. While we hypothesize that these gradients should point in the general direction of the target class images, there is no clear way of obtaining point-wise realizations of them. In the following section, we present practical and simple methods for obtaining approximations for these gradients, which we then use for training PAG-promoting classifiers.

4 How are “Ground Truth” PAGs Obtained?

In order to train a classifier for minimizing the objective in Equation 1, a “ground truth” perceptually aligned gradient needs to be provided for each training image and for each target class . Since a true such gradient is infeasible to get, we instead explore two general pragmatic approaches for obtaining approximations for these PAGs.

4.1 Robust Input-Gradient Distillation

A possible realization of the “ground truth” perceptually aligned gradients may rely on the fact that adversarially trained models uphold PAG (Tsipras et al., 2019). One can use the input-gradients of such trained robust models and train a classifier to mimic them, according to Equation 1. More precisely, according to this realization, one can set , where is an adversarially robust classifier. This way, we distill the PAG property of a robust classifier into our model. Similar ideas were explored in the context of robust knowledge distillation in (Chan et al., 2020; Shao et al., 2021; Sarkar et al., 2021). We discuss their connection and differences from our work in Appendix B.

At first glance, using such gradients in our objective seems different from adversarial training – while the latter relies on training the model to classify perturbed images correctly, our approach trains solely on clean images from the dataset, as in vanilla training. However, more careful inspection reveals that despite the clear benefits of this approach, it may lead to a leakage of the desired robustness properties to the trained model. This, in turn, makes testing the robustness of this model a form of circular reasoning, a logical fallacy. To better understand this crucial point, one can view this training approach as performing function approximation of via a first-degree Taylor series. More specifically, in the zeroth degree, we train the classifier to correctly classify training samples – an approximation of the outputs of the robust classifier. In the first degree approximation, we encourage the model to have similar gradients to the robust one. Thus, overall, since this technique pushes the trained model to mimic the behavior of the robust classifier, it cannot be utilized as a valid way of answering our titular question, and we use it only as an empirical performance upper bound. Thus, in order to ensure satisfactory conditions for assessing whether PAG implies robustness, we explore alternative sources of ground truth gradients that do not stem, explicitly nor implicitly, from adversarially trained models.

4.2 Target Class Representatives

As explained above, we aim to explore “ground truth” gradients that promote PAG without relying on robust models. To this end, we adopt the following simple premise: the gradient should point towards the general direction of images of the target class . Therefore, given a representative of the target class, , we set the gradient to point away from the current image and towards the representative, i.e.,

. This general heuristic, visualized in

Figure 2, can be manifested in various ways, of which we consider the following:

One Image: Each representative should be chosen to reflect the visual features of its respective class. The simplest such choice that comes to mind is to choose as an arbitrary training set image with label , and use it as a global destination of -targeted gradients. This one image approach satisfies the abstract requirements and provides simplicity, but it introduces a strong bias towards the arbitrarily chosen representative image, without considering the target class as a whole.

Class Mean: In order to reduce the bias towards a single image, we can set to be the mean of all the training images with label . This mean can be multiplied by a constant in order to obtain an image-like norm. However, the class mean approach suffers from a clear limitation: a class’ image distribution can be highly multimodal, possibly reducing its mean to a non-informative image.

Nearest Neighbor: As a possibly better trade-off, we examine a nearest neighbor (NN) approach – for each image and each target class we set the class representative (now dependent on the image) to be the image’s NN amongst a limited set of samples from class , using distance in the pixel space. More formally, we define


where is the set of sample images with class . In practice, we sample to be a small number of i.i.d. training set images with class .

We test the aforementioned class representative approaches empirically in Section 5, owing to their relative simplicity and alignment with the abstract requirements on the gradients. However, we recognize that these approaches may oversimplify the desired behaviour, and may not be ideal. We therefore encourage the pursuit of more advanced options in future work. For example, the above two options (class mean and nearest neighbor) could be merged and extended by a preliminary clustering of each class subset to several sub-clusters, and an assignment of the NN amongst the obtained cluster means. Nevertheless, we choose not to explore this option further as it is more complex, losing much of the appeal and the simplicity of the options described above.

Figure 2: An illustration of the proposed creation of perceptually meaningful gradients for the training process as given in Eq. 1.

5 Experimental Results

In this section we empirically assess whether promoting PAG during classifier training improves its adversarial robustness at test time. We experiment using both synthetic and real datasets and present our findings in Section 5.1 and Section 5.2, respectively.

5.1 A Toy Dataset

To illustrate and better understand the proposed approach and its effects, we experiment with a synthetic 2-dimensional dataset and compare our nearest neighbor approach with the vanilla training scheme that minimizes the cross-entropy loss. We construct a dataset consisting of 6,000 samples from two classes, where each class contains exactly 3,000 examples, and the manifold assumption upholds – the data resides on a lower-dimensional manifold (Ruderman, 1994). We train a classifier (two-layer fully-connected network) twice: once with our nearest neighbor method, and once without it. We then examine the obtained accuracies and visualize the decision boundaries. While both methods reach a perfect accuracy over the test set, the obtained decision boundaries differ substantially, as can be seen in Figure 5. The baseline training method results in Figure (a)a yields dimpled manifolds as decision boundaries, as hypothesized in (Shamir et al., 2021). According to their hypothesis, since the decision boundary of DNN is very close to the data manifold, adversarial examples very close to the image manifold exist, exposing the model to malicious perturbations.

In contrast, as can be seen in Figure (b)b, the margin between the data samples and the decision boundary obtained using our approach is significantly larger than the baseline. This observation helps explain the following robustness result: our model achieves a accuracy on a simple adversarial PGD attack, whereas the baseline model collapses to . We provide additional details regarding this experiment in the supplementary material. Note that the notion of “perceptually aligned” gradients admits a very clear meaning in the context of our 2-dimensional experiment – faithfulness to the known data manifold. In the baseline approach, the gradients used in adversarial attacks will point towards close areas of the opposite class, orthogonal to the data manifold. Thus, such gradients deviate from the data behavior and are not aligned with its distribution. In contrast, in our approach, due to the absence of close orthogonal misclassified areas, adversarial gradients tend to be more aligned with the data manifold, making it perceptually meaningful.

(a) Vanilla Training Scheme
(b) Our Training Scheme
Figure 5: Visualization of the decision boundary on a synthetic two-class dataset – the points are the test samples, and the background color represents the predicted class. Figures (b)b and (a)a present the decision boundary of a vanilla training method and ours, respectively. Our approach increases the margin between the instances and the decision boundary, yielding improved robustness.

5.2 Experimenting with Real Datasets

With the encouraging findings presented in Section 5.1, we now turn to conduct thorough experiments to verify if indeed promoting PAG can lead to improved adversarial robustness on real datasets – CIFAR-10 (Krizhevsky et al., 2014) and STL (Coates et al., 2011). In order to make a well-founded empirical claim, we explore the several sources for “ground truth” PAG proposed in Section 4.1 and Section 4.2.

To verify if promoting PAG implies robustness, we first validate that such a phenomenon (i.e., PAG) occurs when using our method, and then we test the performance of such models under attacks. We start by training a classifier and qualitatively examine whether our approach leads to perceptually aligned gradients. More specifically, we probe whether modifying an image to maximize a certain class probability, estimated by a model, leads to a meaningful semantic change. Then we turn to assess the performance of our method using two main metrics – clean accuracy and adversarial robustness using both the and the threat models. More specifically, we use a 20-step PGD () as our adversarial attack, with for attacks on both datasets, and for on CIFAR-10 and STL, respectively. We include additional implementation details regarding the attack in the supplementary material. If we find that our method attains improved robust accuracy compared to standard training, we provide evidence that promoting PAG can improve robustness.

In all the conducted experiments, we train a ResNet-18 (He et al., 2015) classifier to minimize Equation 1. To validate if indeed promoting PAG implies adversarial robustness, we compare it with a “vanilla” training using the same hyperparameters (Additional implementation details are listed in the supplementary material). We train the model using two main ground-truth gradients sources listed in Section 4 – target class representatives and Robust Input-Gradient Distillatios (RIGD). While the methods of the first, One Image (OI), Class Mean (CM), and Nearest Neighbor (NN), are valid options for verifying the research question, RIGD serves only as an empirical upper bound due to the logical fallacy detailed in Section 4.1.

Results: As explained above, we determine that a method promotes PAG if the pixel-space modifications done while maximizing the conditional probability induced by a classifier align with human perception. We show in Figure 8 that while vanilla models do not exhibit semantically meaningful changes, our approach does. In addition, the modifications obtained by our method are similar to the ones of adversarially trained models, visualized in Figure 1, and both contain rich class-related information. Note that in our method, we only train the model to have point-wise gradients aligned with some “ground truth” ones. However, surprisingly, it is able to guide the iterative maximization process towards semantically meaningful modifications, although never trained on these intermediate points. In other words, although trained to have aligned gradients to some ground truth ones only on the data points, the model generalizes to have meaningful gradients far beyond these points.

We proceed by quantitatively evaluating the performance on clean and adversarial versions of the test set, as mentioned above, and show our results in Tables 1 and 2. All the tested PAG-inducing techniques improve the adversarial robustness substantially, while maintaining competitive clean accuracies. While the vanilla baseline is utterly vulnerable to adversarial examples, introducing a PAG-inducing auxiliary objective robustifies it without performing robust training. This surprising finding strongly suggests that promoting PAG can improve the classifier’s robustness. As the results indicate, our method performs better in the case over the one. We hypothesize that this stems from the Euclidean nature of the cosine similarity loss used to penalize the model gradients. We emphasize that while superior robustification methods rely on training to correctly classify (adversarially) perturbed images, our scheme achieves significant robustness, solely by promoting PAG.

6 Conclusion and Future Work

While previous work demonstrate that adversarially robust models uphold the Perceptually Aligned Gradients property, in this work, we aim to investigate the reverse question – Do Perceptually Aligned Gradients Imply Adversarial Robustness? We believe that answering this question sheds additional light on the connection between robust models and PAG. To empirically show that inducing PAG improves classifier robustness, we develop a novel generic optimization loss for promoting PAG without relying on robust models or adversarial training, and test several manifestations of it. Our findings suggest that all such methods that promote PAG improve the adversarial robustness compared to a vanilla model, while still maintaining an adequate clean accuracy. Despite that, the obtained robustness still falls behind state-of-the-art models, which possibly stems from oversimplified realizations of “ground-truth” PAGs. Therefore, we believe that improving these realizations would be a key factor for the continued success of the proposed training objective in future work.

Method No Attack
One Image
Class Mean
Nearest Neighbor
Table 1: Accuracy scores on the CIFAR-10 dataset using the ResNet-18 architecture.
Method No Attack
One Image
Class Mean
Nearest Neighbor
Table 2: Accuracy scores on the STL dataset using the ResNet-18 architecture.
(a) CIFAR-10 visual results
(b) STL visual results
Figure 8: Perceptually Aligned Gradients phenomenon demonstrated by models trained with vanilla training (top), our method (middle), and our gradient distillation baseline (bottom), all using ResNet-18 on the CIFAR-10 and STL datasets.

7 Acknowledgements

This research was partially supported by the Israel Science Foundation (ISF) under Grant 335/18, the Israeli Council for Higher Education – Planning & Budgeting Committee, and the Stephen A. Kreynes Fellowship.


  • G. Aggarwal, A. Sinha, N. Kumari, and M. K. Singh (2020) On the benefits of models with perceptually-aligned gradients. ArXiv abs/2005.01499. Cited by: §1, §2.
  • M. Andriushchenko and N. Flammarion (2020) Understanding and improving fast adversarial training. In NeurIPS, Cited by: §A.2.
  • A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2017) Synthesizing robust adversarial examples. arXiv. Cited by: §A.1, §1.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. Lecture Notes in Computer Science, pp. 387–402. External Links: ISBN 9783642387098, ISSN 1611-3349 Cited by: §A.1, §1.
  • N. Carlini and D. Wagner (2017a) Adversarial examples are not easily detected: bypassing ten detection methods. arXiv. Cited by: §A.1, §1.
  • N. Carlini and D. Wagner (2017b) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), Vol. , pp. 39–57. Cited by: §A.1, §1.
  • A. Chan, Y. Tay, and Y. Ong (2020) What it thinks is important is important: robustness transfers through input gradients. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 332–341. Cited by: Appendix B, §4.1.
  • A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , G. Gordon, D. Dunson, and M. Dudík (Eds.),
    Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 215–223. Cited by: §5.2.
  • J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. arXiv. Cited by: §1.
  • J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 1310–1320. Cited by: §3.
  • F. Croce and M. Hein (2020) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. arXiv. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • S. Dodge and L. Karam (2017) A study and comparison of human and deep learning recognition performance under visual distortions. arXiv. Cited by: §1.
  • Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 9185–9193. Cited by: §A.1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv. Cited by: §1.
  • A. Elliott, S. Law, and C. Russell (2021) Explaining classifiers using adversarial perturbations on the perceptual ball. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • L. Engstrom, A. Ilyas, H. Salman, S. Santurkar, and D. Tsipras (2019a) Robustness (python library). External Links: Link Cited by: §C.2.
  • L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, B. Tran, and A. Madry (2019b) Adversarial robustness as a prior for learned representations. arXiv: Machine Learning. Cited by: §2.
  • C. Etmann, S. Lunz, P. Maass, and C. Schoenlieb (2019) On the connection between adversarial robustness and saliency map interpretability. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 1823–1832. Cited by: §2.
  • C. Etmann, S. Lunz, P. Maass, and C. Schoenlieb (2019) On the connection between adversarial robustness and saliency map interpretability. In International Conference on Machine Learning, pp. 1823–1832. Cited by: Appendix B.
  • C. Finlay and A. M. Oberman (2021) Scaleable input gradient regularization for adversarial robustness. Machine Learning with Applications 3, pp. 100017. Cited by: Appendix B.
  • R. Ganz and M. Elad (2021) BIGRoC: boosting image generation via a robust classifier. CoRR abs/2108.03702. External Links: 2108.03702 Cited by: §2.
  • R. Geirhos, D. H. J. Janssen, H. H. Schütt, J. Rauber, M. Bethge, and F. A. Wichmann (2017) Comparing deep neural networks against humans: object recognition when the signal gets weaker. arXiv. Cited by: §1.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations, ICLR, Cited by: §A.1, §A.1, §A.2, §1, §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv. Cited by: §1, §5.2.
  • H. Hosseini, B. Xiao, and R. Poovendran (2017) Google’s cloud vision api is not robust to noise. arXiv. Cited by: §1.
  • L. Huang, C. Zhang, and H. Zhang (2020) Self-adaptive training: beyond empirical risk minimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §A.2.
  • D. Jakubovitz and R. Giryes (2018) Improving dnn robustness to adversarial attacks using jacobian regularization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 514–529. Cited by: Appendix B, §3.
  • S. Kaur, J. Cohen, and Z. Lipton (2019) Are perceptually-aligned gradients a general property of robust classifiers?. Cited by: §1, §2.
  • A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55 (5). Cited by: §5.2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, Red Hook, NY, USA, pp. 1097–1105. Cited by: §1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. Cited by: §A.1, §1.
  • M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2018) Certified robustness to adversarial examples with differential privacy. arXiv. Cited by: §1.
  • B. Li, C. Chen, W. Wang, and L. Carin (2018) Certified adversarial robustness with additive noise. arXiv. Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §A.1, §A.2, §A.2, §1, §3.
  • A. M. Nguyen, J. Yosinski, and J. Clune (2014) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. CoRR abs/1412.1897. External Links: 1412.1897 Cited by: §A.1, §1.
  • T. Pang, X. Yang, Y. Dong, T. Xu, J. Zhu, and H. Su (2020) Boosting adversarial training with hypersphere embedding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §A.2.
  • C. Qin, J. Martens, S. Gowal, D. Krishnan, K. Dvijotham, A. Fawzi, S. De, R. Stanforth, and P. Kohli (2019) Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 13824–13833. Cited by: §A.2.
  • A. Ross and F. Doshi-Velez (2018a) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: Appendix B.
  • A. S. Ross and F. Doshi-Velez (2017) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arXiv. Cited by: §3.
  • A. S. Ross and F. Doshi-Velez (2018b) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 1660–1669. Cited by: §2.
  • D. L. Ruderman (1994) The statistics of natural images. Network: Computation in Neural Systems 5 (4), pp. 517–548. Cited by: §5.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §1.
  • H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn, and S. Bubeck (2019) Provably robust deep learning via adversarially trained smoothed classifiers. arXiv. Cited by: §1.
  • S. Santurkar, D. Tsipras, B. Tran, A. Ilyas, L. Engstrom, and A. Mądry (2019) Image synthesis with a single (robust) classifier. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Cited by: §2.
  • A. Sarkar, A. Sarkar, S. Gali, and V. N Balasubramanian (2021) Get fooled for the right reason: improving adversarial robustness through a teacher-guided curriculum learning approach. In Advances in Neural Information Processing Systems, Vol. 34, pp. 12836–12848. Cited by: Appendix B, §4.1.
  • A. Shamir, O. Melamed, and O. BenShmuel (2021) The dimpled manifold model of adversarial examples in machine learning. arXiv. Cited by: §5.1.
  • R. Shao, J. Yi, P. Chen, and C. Hsieh (2021) How and when adversarial robustness transfers in knowledge distillation?. arXiv preprint arXiv:2110.12072. Cited by: Appendix B, §4.1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, ICLR, Cited by: §A.1, §1.
  • D. Temel and G. AlRegib (2018) Traffic signs in the wild: highlights from the IEEE video and image processing cup 2017 student competition [SP competitions]. IEEE Signal Processing Magazine 35 (2), pp. 154–161. Cited by: §1.
  • D. Temel, G. Kwon, M. Prabhushankar, and G. AlRegib (2017) CURE-tsr: challenging unreal and real environments for traffic sign recognition. Cited by: §1.
  • D. Temel, J. Lee, and G. AlRegib (2018) CURE-or: challenging unreal and real environments for object recognition. Cited by: §1.
  • F. Tramer, N. Carlini, W. Brendel, and A. Madry (2020) On adaptive attacks to adversarial example defenses. arXiv. Cited by: §1.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019)

    Robustness May Be at Odds with Accuracy

    In International Conference on Learning Representations, ICLR, Cited by: §1, §2, §4.1.
  • Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu (2020) Improving adversarial robustness requires revisiting misclassified examples. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §A.2, §1.
  • C. Xie, Y. Wu, L. van der Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 501–509. Cited by: §A.2.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 7472–7482. Cited by: §A.2, §1.

Appendix A Background

a.1 Adversarial Examples

We consider a deep learning-based classifier , where is the data dimension and is the number of classes. Adversarial examples are instances designed by an adversary in order to cause a false prediction by  (Athalye et al., 2017; Biggio et al., 2013; Carlini and Wagner, 2017a; Goodfellow et al., 2015; Kurakin et al., 2017; Nguyen et al., 2014; Szegedy et al., 2014). In 2013, Szegedy et al. (2014) discovered the existence of such samples and showed that it is possible to cause misclassification of an image with an imperceptible perturbation, which is obtained by maximizing the network’s prediction error. Such samples are crafted by applying modifications from a threat model to real natural images. Hypothetically, the “ideal” threat model should include all the possible label-preserving perturbations, i.e., all the modifications that can be done to an image that will not change a human observer’s prediction. Unfortunately, it is impossible to rigorously define such , and thus, simple relaxations of it are used, the most common of which are the and the -balls: .

More formally, given an input sample , its ground-truth label and a threat model , a valid adversarial example satisfies the following: , where is the prediction of the classifier on . The procedure of obtaining such examples is referred to as an adversarial attack. Such attacks can be either untargeted or targeted. Untargeted attacks generate to minimize , namely, cause a misclassification without a specific target class. In contrast, targeted attacks aim to craft in a way that maximizes , that is to say, fool the classifier to predict as a target class .

While there are various techniques for generating adversarial examples (Goodfellow et al., 2015; Carlini and Wagner, 2017b; Dong et al., 2018), we focus in this work on the Projected Gradient Descent (PGD) method (Madry et al., 2018). PGD is an iterative procedure for obtaining adversarial examples that operates as described in Algorithm 1. The operation stands for a projection operator onto , and is the classification loss, usually defined as the cross-entropy:


where and

are the classifier’s logits for classes

and , respectively.

Input: classifier , input , ground-truth class , target class , threat model parameter , step size , number of iterations
for t from 0 to T do
       if  is not None then
       end if
end for
Algorithm 1 Projected Gradient Descent

a.2 Adversarial Training

Adversarial training (AT) (Madry et al., 2018) is a learning procedure that aims to obtain adversarially robust classifiers. A classifier is considered adversarially robust if applying small adversarial perturbations to its input does not change its label prediction (Goodfellow et al., 2015). Such classifiers can be obtained by solving the following optimization problem:


Intuitively, the above optimization trains the classifier to accurately predict the class labels of its hardest perturbed images allowed by the threat model . Ideally, is the 0-1 loss, i.e., where is the indicator function. Nevertheless, since the 0-1 loss is not differentiable, the cross-entropy loss, defined in Equation 4, is used as a surrogate. In practice, solving this min-max optimization problem is challenging, and there are several ways to obtain an approximate solution. The most simple yet effective method is based on approximating the solution of the inner-maximization via adversarial attacks, such as PGD (Madry et al., 2018). According to this strategy, the above optimization is performed iteratively by first fixing the classifier’s parameters and optimizing the perturbation for each example via PGD and then fixing and updating . Repeating these steps results in a robust classifier. Since its introduction by Madry et al. (2018), various improvements to adversarial training were proposed (Andriushchenko and Flammarion, 2020; Huang et al., 2020; Pang et al., 2020; Qin et al., 2019; Xie et al., 2019; Zhang et al., 2019; Wang et al., 2020), yet in this work we will focus mainly on the basic AT scheme (Madry et al., 2018) for its simplicity and generality.

Appendix B Related Work

In Section 4.1 we demonstrate how robust input-gradient distillation can promote adversarial robustness in the trained model. This phenomenon has also been observed and harnessed in the context of Robust Knowledge Distillation by several recent papers (Chan et al., 2020; Shao et al., 2021; Sarkar et al., 2021). In these works, a student classifier model is trained to have similar gradients to a robust teacher model. The similarity is defined as either cosine similarity or the inability of a discriminator network to distinguish between the gradients of the two models. These works differ from our work, both in their loss functions and specific utilization of the teacher model, but essentially, they all demonstrate how distilling knowledge from a robust teacher model can invoke adversarial robustness in the student model. While successful in their respective tasks, these methods are not suitable for assessing whether perceptually-aligned gradients inherently promote robustness, as they are implicitly reliant on prior adversarial training, as explained in Section 4.1.

In addition, recent works have explored properties of input-gradients that improve adversarial robustness. The authors of (Jakubovitz and Giryes, 2018) demonstrate that regularizing the Frobenius norm of a classifier’s Jacobian to be small, improves robustness. Such a method is equivalent to regularizing the norm of each such gradient to be small, similar to (Ross and Doshi-Velez, 2018a; Finlay and Oberman, 2021). The work in (Etmann et al., 2019) follows suit and considers the cosine similarity between a classifier’s input-gradient w.r.t. the ground truth class and the input image itself. A positive correlation between this similarity and the classifier’s adversarial robustness is observed. Nevertheless, none of these works promotes nor exhibits perceptually aligned gradients. In contrast, our work proposes to test the relation between the alignment of gradients with human perception and adversarial robustness and presents several PAG-promoting methods, inducing improved robustness.

Appendix C Implementation Details

c.1 Toy Dataset


: We experiment with our approach on a 2-dimensional synthetic dataset to demonstrate its effects. As explained in the corresponding section in the paper, we construct a 2-class dataset where the data points reside on a straight line. Each class contains three modes, and each of them contains 1000 samples drawn from a Gaussian distribution (

, where is the mode center). The modes centers are set to be and .

Architecture and Training: We use a 2-layer fully-connected network

with ReLU non-linearity. We train it twice – using the standard cross-entropy training and our proposed method with NN realization. We do so for

epochs with a batch size of , using Adam optimizer, a learning rate of , and the same seed for both training processes.

Computational Resources: We use a single GPU via the Google Colab service.

Evaluation: As detailed in the paper, we test the performance of the models using clean and adversarial evaluation. For the clean one, we draw 600 test samples from the same distribution as the train set and measure the accuracy. As for the adversarial one, we use an -based 10-step PGD with and a step size of . Note that this choice of guarantees in our settings that the allowed threat model is too small for actually changing a sample of a certain class to the other one, making it a valid threat model.

c.2 Real Datasets

Data: As for our real datasets experiments, we use CIFAR-10 and STL that contain images of size and , respectively. For each realization, before the training procedure, we construct a dataset by computing targeted gradients for each training sample ( for CIFAR-10 and STL) for reproducibility and consistency purposes. While our target class representatives methods are model-free, RIGD requires a robust classifier to distill its targeted gradients. For CIFAR-10, we use publicly available ResNet-50 (Engstrom et al., 2019a), trained on and attacks with and , respectively. For STL, due to the lack of pretrained models, we adversarially train two ResNet-18 classifiers, using and threat models with and , respectively.

Training: For bothdatasets, we train a ResNet-18 for epochs, using SGD with a learning rate of , a momentum of , and a weight decay of

. In addition, we use the standard augmentations for these datasets – random cropping with padding of

and random horizontal flipping with a probability of . We use a batch size of for CIFAR-10 and for STL. We present in Table 3 the best choices of – the coefficient of our PAG promoting auxiliary loss term in all the tested datasets and methods.

Method value
One Image
Class Mean
Nearest Neighbor
Table 3: Values of hyperparameter .

Computational Resources: We use a single NVIDIA A40 GPU for each experiment.

Evaluation: We use TRADES111 code repository for adversarial attacks and extend it to contain attack, in addition to the existing one. For assessing the adversarial robustness, we use a -step PGD (), with random initialization and a step size of .

To validate that our training method is stable and consistent, we run CIFAR-10 experiments using the One Image method three times using different seeds and report the results in the Table 4 below. As the results indicate, our approach consistently leads to improved robustness.

No Attack
Table 4: Error bar evaluation on CIFAR-10 using One Image.

Appendix D Ablation Study

In this section, we test the effect of choosing different values of , i.e., the coefficient of the PAG promoting auxiliary loss, on the CIFAR-10 dataset, using the Class Mean method. is a crucial hyperparameter as it trades off between clean accuracy and the PAG property and thus, changes the level of PAG, robust and clean accuracies. We summarize the results of this ablation in Figure 9. As can be seen, in general, increasing leads to more robust models with gradients better aligned with human perception. However, it comes with the cost of accuracy. We hypothesize that more sophisticated realizations of “ground-truth” PAG gradients can mitigate the tradeoff between accuracy and PAG.

Figure 9: Quantitative and qualitative results of different values on CIFAR-10 using Class Mean.