The Jacobian-based Saliency Map Attack  is a family of adversarial attack methods [2, 3] for fooling classification models, such as deep neural networks for image classification tasks. By saturating a few pixels in a given image to their maximum or minimum values, JSMA can cause the model to misclassify the resulting adversarial image as a specified erroneous target class. We propose two variants of JSMA, one which removes the requirement to specify a target class, and another that additionally does not need to specify whether to only increase or decrease pixel intensities. Our experiments highlight the competitive speeds and qualities of these variants when applied to datasets of hand-written digits and natural scenes.
Ii Jacobian-based Saliency Map Attack (JSMA)
Saliency maps were originally conceived for visualizing the prediction process of classification models . The map rates each input feature (e.g. each pixel)111We use the notation to denote the -th element of the vector . on how influential it is for causing the model to predict a particular class , where
One formulation of the saliency map is given as:
measures how much positively correlates with , while also negatively correlates with all other classes . If either condition is violated, then saliency is reset to zero.
An attacker can exploit this saliency map by targeting an adversarial class that does not match the true class label of a given sample . By increasing a few high-saliency pixels according to , the modified image will have an increased prediction confidence for the adversarial class , and thus might result in misclassification.
Alternatively, one can attack by decreasing feature values based on another saliency map , which differs from only by the inversion of the low-saliency inequalities, i.e. if .
In practice, both saliency measures and are overly strict when applied to individual input features, because often the summed gradient contribution across all non-targeted classes will trigger the minimal-saliency criterion. Thus, the Jacobian-based Saliency Map Attack (JSMA)  alters these saliency measures to search over pairs of pixels instead.
Given a unit-normalized input , JSMA initializes the search domain over all input indices. Then, as illustrated in Fig. 1, it finds the most salient pixel pair , perturbs both values (by if using , or if using ), and then removes saturated feature indices from . This process is repeated until either the model misclassifies the perturbed input as the target class or till a maximum number of iterations is reached.
Carlini and Wagner 
proposed an alternation that amplifies the logitrather than the softmax probability . We denote these variants as JSMA+Z, JSMA-Z, JSMA+F, and JSMA-F, based on the choices of increasing () or decreasing () feature values, and using versus .
The original authors advocated saturating perturbations to find with the fewest feature changes (i.e. minimal norm). In some domains such as hand-written digits, we anecdotally note that adversaries found with had smaller perturbed distances and were more perceptually similar thus less likely to be detected by humans.
Furthermore, we suggest that the maximum per-feature perturbation ( norm) can be optionally -ped to within an -neighborhood to further limit perceptual differences, similar to the BIM attack :
Iii Non-Targeted JSMA (NT-JSMA)
All JSMA variants above must be given a specific target class . This choice affects the speed and quality of the attack, since misclassification under certain classes are harder to attain than others, such as trying to modify a hand-written digit “” to look like anything other than “” .
We propose a non-targeted attack formulation that removes this target-class dependency by having the algorithm decrease the model’s prediction confidence of the true class label (), instead of increasing the prediction confidence of an adversarial target . As depicted in Fig. 2, the NT-JSMA procedure is realized by swapping the saliency measure employed, i.e. following when increasing feature values (NT-JSMA+F / NT-JSMA+Z), or when decreasing feature values (NT-JSMA-F / NT-JSMA-Z). This variant also naturally relaxes the success criterion, such that an adversarial example only needs to not
be classified as the true class, i.e. .
Iv Maximal JSMA (M-JSMA)
In addition to alleviating the need to specify a target class , we can further alleviate the need to specify whether to only increase or decrease feature values. The resulting Maximal Jacobian-based Saliency Map Attack (M-JSMA) combines targeted and non-targeted strategies and considers both the increase and decrease of feature values.
As shown in Algorithm 1222Algorithm 1 contrasts with the original JSMA+F , where blue text denotes additions for M-JSMA and
red text denotes omitted parts of JSMA+F., at each iteration the maximal-salient pixel pair is chosen over every possible class , whether adversarial or not. Also, instead of enforcing low-saliency conditions via or , we simply identify the most salient pair according to either map, and consequently decide on the perturbation direction accordingly. An additional history vector is added to prevent oscillatory perturbations. As for NT-JSMA, M-JSMA terminates when the predicted class for no longer matches the true class.
We trained classifiers following the baseline MNIST architecture introduced in , which is depicted in Fig. 3. The test-set accuracies of our baseline models for MNIST , Fashion-MNIST , and CIFAR10  are , , and , respectively.
We applied the various JSMA variants to all correctly-classified test-set instances, using . Each model+dataset attack run is evaluated on its success rate (%), the average distance (which also reflects convergence speed when ), the average perceptual distance, and the average softmax entropy (H) reflecting misclassification uncertainty. To compare best-case performance, when evaluating targeted attacks on each sample, we focus on the single target class that results in misclassification in the fewest iterations possible. Samples of adversarial examples are shown in Fig. 4.
V-a JSMA vs. NT-JSMA vs. M-JSMA
Looking at the average statistics reflecting perceptual similarity and convergence speed in Table I, we observe that across all 3 datasets it is consistently faster to find adversaries by increasing pixel intensities rather than decreasing them. We also note that M-JSMA_F found adversaries with similar number of pixel changes (and thus in similar number of iterations) compared to JSMA+F. On the other hand, the non-targeted variants consistently took one or two more iterations than their targeted counterparts.
Considering next the perceptual similarities as measured by average statistics, our results showed strong preferences for the pixel-decreasing variants, JSMA-F and NT-JSMA-F. This can be attributed to the fact that most images from the 3 datasets have dark backgrounds. Also, although adversaries found by M-JSMA_F had the smallest perceptual similarity scores for MNIST, results for other datasets did not reflect similar benefits. Furthermore, we note again that NT-JSMA had slightly worse statistics compared to JSMA.
Finally, analyzing the uncertainties of adversarial predictions as reflected by average entropy statistics, we see that the original targeted JSMA formulations consistently found adversaries with lower-uncertainty predictions, especially compared to the non-targeted variants. Nevertheless, adversaries found by Maximal JSMA still showed competitive values on average.
Based on the results above, we conclude that the flexiblity of not specifying a target class in NT-JSMA resulted in minor added inefficiencies in terms of both convergence time and quality of adversaries. On the other hand, as M-JSMA considered all possible class targets, and both and metrics and perturbation directions, it inherited both the performance benefits and flexibilities among all other variants.
V-B Effects of Smaller Feature Perturbations
By perturbing features at increments of and bounding to , Table II shows that all variants found adversaries with smaller perceptual differences (i.e. smaller ), albeit requiring more search time (i.e. larger ). Also, M-JSMA resulted in stellar convergence speeds and quality of adversaries compared to the other variants across datasets. Thus, we conclude that regardless of whether adversaries with fewer feature changes () or smaller Euclidean distances ) are desirable according to a given application domain, M-JSMA performed favorably over other variants.
V-C Performance under Defensive Distillation
is an adversarial defense method that re-trains a model using ground truth labels that are no longer one-hot-encoded, but rather using softmax probabilities resulting from the original model’s logits divided by a temperature constant. At inference time, is reset back to 1. As result, the gradients of the model approaches 0 as increases, which is a form of gradient masking [11, 12, 13].
To test the effects of this adversarial defense strategy, we distilled our baseline models at (plain non-defensive distillation) and . The resulting classifiers at on MNIST, F-MNIST, and CIFAR10 had test-set accuracies of 99.39%, 91.92%, and 83.19%, respectively, while the accuracies for the distilled models at were 99.44%, 91.81%, and 83.55%, respectively.
presents four sets of results that contrast softmax-layer attacks versus logit-layer attacks, and for the plainly-distilledand defensively-distilled temperatures. We begin by noting that the first block of results closely resemble statistics from Table I. This suggests that non-defensive distillation has minimal effects on adversarial attacks like JSMA.
Looking at the second block-row next, we observe that adversaries found by attacking the logit layers () consistently suffered from poorer perceptual similarities, as reflected by larger average and statistics.
Moving on, we see that all the attack variants failed to break defensively distilled models (), which is consistent with reports by  and . Although there is a modification of JSMA proposed by  that can circumvent defensive distillation by dividing the logits of the distilled model by a temperature constant, we do not assume knowledge of the defense strategy used by the target model. However, in contrast to findings of , all attack attempts were able to fool these distilled models. Although the resulting statistics do not point to a single dominant variant, NT-JSMAZ and M-JSMA_Z both found adversaries with similarly small perceptual differences in comparably few iterations, while the targeted JSMAZ trailed behind consistently.
We introduced Non-Targeted JSMA and Maximal JSMA as more flexible variants of the Jacobian-based Saliency Map Attack , for finding adversarial examples both quickly and with limited perceptual differences. Most notably, M-JSMA subsumes the need to specify the target class and the perturbation direction. We empirically showed that M-JSMA consistently found high-quality adversaries among a variety of image datasets. With this work, we hope to raise awareness of the ease of generating adversarial examples, and to develop better understandings of attacks so as to defend against them.
N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” inIEEE European Symposium on Security and Privacy (EuroS&P’16), 2016, pp. 372–387.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations (ICLR’14), 2014. [Online]. Available: http://arxiv.org/abs/1312.6199
-  I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations (ICLR’15), 2015. [Online]. Available: http://arxiv.org/abs/1412.6572
-  K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” CoRR, vol. abs/1312.6034, 2013.
-  N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE Symposium on Security and Privacy (SP’17), 2017, pp. 39–57.
-  A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” CoRR, vol. abs/1607.02533, 2016. [Online]. Available: http://arxiv.org/abs/1607.02533
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE Symposium on Security and Privacy (SP), vol. 00, May 2016, pp. 582–597. [Online]. Available: doi.ieeecomputersociety.org/10.1109/SP.2016.41
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
-  H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” CoRR, vol. abs/1708.07747, 2017. [Online]. Available: http://arxiv.org/abs/1708.07747
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep. Vol. 1 No. 4, 2009.
N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” inACM Asia Conference on Computer and Communications Security (ASIACCS’17). New York, NY, USA: ACM, 2017, pp. 506–519. [Online]. Available: http://doi.acm.org/10.1145/3052973.3053009
-  N. Papernot, P. D. McDaniel, A. Sinha, and M. P. Wellman, “Towards the science of security and privacy in machine learning,” CoRR, vol. abs/1611.03814, 2016. [Online]. Available: http://arxiv.org/abs/1611.03814
-  F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” in International Conference on Learning Representations (ICLR’18), 2018. [Online]. Available: https://openreview.net/forum?id=rkZvSe-RZ
-  N. Carlini and D. A. Wagner, “Defensive distillation is not robust to adversarial examples,” CoRR, vol. abs/1607.04311, 2016. [Online]. Available: http://arxiv.org/abs/1607.04311