Maximal Jacobian-based Saliency Map Attack

08/23/2018 ∙ by Rey Wiyatno, et al. ∙ 0

The Jacobian-based Saliency Map Attack is a family of adversarial attack methods for fooling classification models, such as deep neural networks for image classification tasks. By saturating a few pixels in a given image to their maximum or minimum values, JSMA can cause the model to misclassify the resulting adversarial image as a specified erroneous target class. We propose two variants of JSMA, one which removes the requirement to specify a target class, and another that additionally does not need to specify whether to only increase or decrease pixel intensities. Our experiments highlight the competitive speeds and qualities of these variants when applied to datasets of hand-written digits and natural scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Jacobian-based Saliency Map Attack [1] is a family of adversarial attack methods [2, 3] for fooling classification models, such as deep neural networks for image classification tasks. By saturating a few pixels in a given image to their maximum or minimum values, JSMA can cause the model to misclassify the resulting adversarial image as a specified erroneous target class. We propose two variants of JSMA, one which removes the requirement to specify a target class, and another that additionally does not need to specify whether to only increase or decrease pixel intensities. Our experiments highlight the competitive speeds and qualities of these variants when applied to datasets of hand-written digits and natural scenes.

Ii Jacobian-based Saliency Map Attack (JSMA)

Saliency maps were originally conceived for visualizing the prediction process of classification models [4]. The map rates each input feature (e.g. each pixel)111We use the notation to denote the -th element of the vector . on how influential it is for causing the model to predict a particular class , where

is the softmax probabilities vector predicted by the victim model.

One formulation of the saliency map is given as:

measures how much positively correlates with , while also negatively correlates with all other classes . If either condition is violated, then saliency is reset to zero.

An attacker can exploit this saliency map by targeting an adversarial class that does not match the true class label of a given sample . By increasing a few high-saliency pixels according to , the modified image will have an increased prediction confidence for the adversarial class , and thus might result in misclassification.

Alternatively, one can attack by decreasing feature values based on another saliency map , which differs from only by the inversion of the low-saliency inequalities, i.e. if .

In practice, both saliency measures and are overly strict when applied to individual input features, because often the summed gradient contribution across all non-targeted classes will trigger the minimal-saliency criterion. Thus, the Jacobian-based Saliency Map Attack (JSMA) [1] alters these saliency measures to search over pairs of pixels instead.

Given a unit-normalized input , JSMA initializes the search domain over all input indices. Then, as illustrated in Fig. 1, it finds the most salient pixel pair , perturbs both values (by if using , or if using ), and then removes saturated feature indices from . This process is repeated until either the model misclassifies the perturbed input as the target class or till a maximum number of iterations is reached.

Fig. 1: Illustration of JSMA+F algorithm.

Carlini and Wagner [5]

proposed an alternation that amplifies the logit

rather than the softmax probability . We denote these variants as JSMA+Z, JSMA-Z, JSMA+F, and JSMA-F, based on the choices of increasing () or decreasing () feature values, and using versus .

The original authors advocated saturating perturbations to find with the fewest feature changes (i.e. minimal norm). In some domains such as hand-written digits, we anecdotally note that adversaries found with had smaller perturbed distances and were more perceptually similar thus less likely to be detected by humans.

Furthermore, we suggest that the maximum per-feature perturbation ( norm) can be optionally -ped to within an -neighborhood to further limit perceptual differences, similar to the BIM attack [6]:

Iii Non-Targeted JSMA (NT-JSMA)

All JSMA variants above must be given a specific target class . This choice affects the speed and quality of the attack, since misclassification under certain classes are harder to attain than others, such as trying to modify a hand-written digit “” to look like anything other than “” [1].

We propose a non-targeted attack formulation that removes this target-class dependency by having the algorithm decrease the model’s prediction confidence of the true class label (), instead of increasing the prediction confidence of an adversarial target . As depicted in Fig. 2, the NT-JSMA procedure is realized by swapping the saliency measure employed, i.e. following when increasing feature values (NT-JSMA+F / NT-JSMA+Z), or when decreasing feature values (NT-JSMA-F / NT-JSMA-Z). This variant also naturally relaxes the success criterion, such that an adversarial example only needs to not

be classified as the true class

, i.e. .

Fig. 2: Illustration of NT-JSMA-F algorithm, with highlighted differences from JSMA+F.

Iv Maximal JSMA (M-JSMA)

  Input: , target class , true class , classifier , , perturbation step , max. perturbation bound
  Initialize: , , ,
  while  and and  do
      
      for every pixel pair () and every class  do
          
          
          if   then
              
              
          end if
      end for
      
      
      
      
      
      
  end while
  Return:
Algorithm 1 Maximal Jacobian-based Saliency Map Attack (M-JSMA_F)

In addition to alleviating the need to specify a target class , we can further alleviate the need to specify whether to only increase or decrease feature values. The resulting Maximal Jacobian-based Saliency Map Attack (M-JSMA) combines targeted and non-targeted strategies and considers both the increase and decrease of feature values. As shown in Algorithm 1222Algorithm 1 contrasts with the original JSMA+F [1], where blue text denotes additions for M-JSMA and red text denotes omitted parts of JSMA+F., at each iteration the maximal-salient pixel pair is chosen over every possible class , whether adversarial or not. Also, instead of enforcing low-saliency conditions via or , we simply identify the most salient pair according to either map, and consequently decide on the perturbation direction accordingly. An additional history vector is added to prevent oscillatory perturbations. As for NT-JSMA, M-JSMA terminates when the predicted class for no longer matches the true class.

V Evaluation

We trained classifiers following the baseline MNIST architecture introduced in [7], which is depicted in Fig. 3. The test-set accuracies of our baseline models for MNIST [8], Fashion-MNIST [9], and CIFAR10 [10] are , , and , respectively.

Fig. 3: Architecture for baseline MNIST classifier model. [7]

We applied the various JSMA variants to all correctly-classified test-set instances, using . Each model+dataset attack run is evaluated on its success rate (%), the average distance (which also reflects convergence speed when ), the average perceptual distance, and the average softmax entropy (H) reflecting misclassification uncertainty. To compare best-case performance, when evaluating targeted attacks on each sample, we focus on the single target class that results in misclassification in the fewest iterations possible. Samples of adversarial examples are shown in Fig. 4.

Fig. 4:

Samples of adversarial examples found by various JSMA variants, along with various evaluation metrics:

for convergence speed, and for perceptual similarity, and softmax entropy for prediction uncertainty of the adversaries.

V-a JSMA vs. NT-JSMA vs. M-JSMA

Attack MNIST F-MNIST CIFAR10
% H % H % H
JSMA+F 100 16.9 3.26 0.58 99.9 17.0 3.17 0.98 100 15.0 2.24 1.10
JSMA-F 100 18.5 3.30 0.62 99.9 31.1 2.89 0.90 100 16.6 1.59 1.05
NT-JSMA+F 100 17.6 3.35 0.64 100 18.8 3.27 1.03 99.9 17.5 2.36 1.16
NT-JSMA-F 100 19.7 3.44 0.70 99.9 33.2 2.99 0.98 99.9 19.6 1.68 1.12
M-JSMA_F 100 14.9 3.04 0.62 99.9 18.7 3.42 1.02 99.9 17.4 2.16 1.12
TABLE I: Comparison of variants (, ).

Looking at the average statistics reflecting perceptual similarity and convergence speed in Table I, we observe that across all 3 datasets it is consistently faster to find adversaries by increasing pixel intensities rather than decreasing them. We also note that M-JSMA_F found adversaries with similar number of pixel changes (and thus in similar number of iterations) compared to JSMA+F. On the other hand, the non-targeted variants consistently took one or two more iterations than their targeted counterparts.

Considering next the perceptual similarities as measured by average statistics, our results showed strong preferences for the pixel-decreasing variants, JSMA-F and NT-JSMA-F. This can be attributed to the fact that most images from the 3 datasets have dark backgrounds. Also, although adversaries found by M-JSMA_F had the smallest perceptual similarity scores for MNIST, results for other datasets did not reflect similar benefits. Furthermore, we note again that NT-JSMA had slightly worse statistics compared to JSMA.

Finally, analyzing the uncertainties of adversarial predictions as reflected by average entropy statistics, we see that the original targeted JSMA formulations consistently found adversaries with lower-uncertainty predictions, especially compared to the non-targeted variants. Nevertheless, adversaries found by Maximal JSMA still showed competitive values on average.

Based on the results above, we conclude that the flexiblity of not specifying a target class in NT-JSMA resulted in minor added inefficiencies in terms of both convergence time and quality of adversaries. On the other hand, as M-JSMA considered all possible class targets, and both and metrics and perturbation directions, it inherited both the performance benefits and flexibilities among all other variants.

V-B Effects of Smaller Feature Perturbations

Attack MNIST F-MNIST CIFAR10
% H % H % H
JSMA+F 100 41.6 1.95 0.79 99.9 30.4 0.61 0.94 100 23.6 0.60 1.02
JSMA-F 80.3 44.7 2.24 0.81 99.2 48.5 1.16 0.89 100 22.8 0.58 1.01
NT-JSMA+F 99.9 36.5 1.93 0.86 99.5 32.5 0.64 1.01 99.3 26.3 0.63 1.12
NT-JSMA-F 54.2 34.6 1.98 0.89 94.5 48.6 1.15 0.95 96.2 24.9 0.59 1.11
M-JSMA_F 98.2 31.5 1.71 0.85 99.5 30.3 0.60 0.98 98.4 23.3 0.54 1.08
TABLE II: Comparison of variants (, ).

By perturbing features at increments of and bounding to , Table II shows that all variants found adversaries with smaller perceptual differences (i.e. smaller ), albeit requiring more search time (i.e. larger ). Also, M-JSMA resulted in stellar convergence speeds and quality of adversaries compared to the other variants across datasets. Thus, we conclude that regardless of whether adversaries with fewer feature changes () or smaller Euclidean distances ) are desirable according to a given application domain, M-JSMA performed favorably over other variants.

V-C Performance under Defensive Distillation

Defensive distillation [7]

is an adversarial defense method that re-trains a model using ground truth labels that are no longer one-hot-encoded, but rather using softmax probabilities resulting from the original model’s logits divided by a temperature constant

. At inference time, is reset back to 1. As result, the gradients of the model approaches 0 as increases, which is a form of gradient masking [11, 12, 13].

To test the effects of this adversarial defense strategy, we distilled our baseline models at (plain non-defensive distillation) and . The resulting classifiers at on MNIST, F-MNIST, and CIFAR10 had test-set accuracies of 99.39%, 91.92%, and 83.19%, respectively, while the accuracies for the distilled models at were 99.44%, 91.81%, and 83.55%, respectively.

Attack MNIST F-MNIST CIFAR10
% H % H % H
JSMA+F 1 100 17.9 3.23 0.79 100 17.5 3.16 1.09 100 15.5 2.16 1.28
JSMA-F 1 100 16.9 3.18 0.84 100 32.3 2.96 1.00 100 19.3 1.84 1.24
NT-JSMA+F 1 100 18.8 3.31 0.84 100 18.9 3.25 1.13 100 18.6 2.27 1.34
NT-JSMA-F 1 100 17.8 3.28 0.93 100 34.2 3.06 1.06 100 23.5 1.95 1.31
MJSMA_F 1 100 14.1 2.93 0.82 100 18.8 3.38 1.12 100 19.1 2.27 1.31
JSMA+Z 1 100 52.0 5.19 0.60 100 45.1 5.10 0.95 100 42.2 3.24 1.36
JSMA-Z 1 99.1 44.5 4.79 0.66 95.4 88.5 5.07 0.89 100 54.3 2.93 1.38
NT-JSMA+Z 1 100 20.0 3.47 1.14 99.9 19.6 3.41 1.29 99.9 22.6 2.50 1.61
NT-JSMA-Z 1 100 19.0 3.42 1.20 99.8 39.1 3.29 1.29 100 28.6 2.18 1.61
M-JSMA_Z 1 100 15.3 3.06 1.01 99.9 22.0 3.69 1.25 100 25.7 2.61 1.41
JSMA+F 100 0.1 2.6 0.89 0.00 2.6 3.6 0.95 0.03 4.5 2.9 0.95 0.03
JSMA-F 100 0.1 2.1 0.97 0.00 2.6 4.0 0.91 0.04 5.0 3.0 0.65 0.03
NT-JSMA+F 100 0.1 2.4 0.85 0.00 2.7 3.6 0.95 0.03 4.6 2.9 0.96 0.03
NT-JSMA-F 100 0.1 2.1 1.01 0.00 2.8 4.0 0.93 0.03 5.0 3.0 0.65 0.03
M-JSMA_F 100 0.1 2.0 0.97 0.00 2.6 3.3 1.10 0.02 4.8 2.8 0.79 0.03
JSMA+Z 100 100 37.1 4.43 0.01 100 32.5 4.40 0.02 100 46.8 3.48 0.04
JSMA-Z 100 98.7 42.4 4.55 0.01 95.6 58.0 3.86 0.04 100 59.8 3.21 0.05
NT-JSMA+Z 100 100 16.2 3.33 0.02 100 21.5 3.58 0.03 100 23.6 2.56 0.04
NT-JSMA-Z 100 100 19.8 3.52 0.02 100 39.8 3.26 0.05 100 27.3 2.13 0.04
M-JSMA_Z 100 100 14.7 3.09 0.02 99.6 28.0 3.92 0.03 100 27.1 2.70 0.04
TABLE III: Comparison of variants (, ) on defensively distilled models with .

Table III

presents four sets of results that contrast softmax-layer attacks versus logit-layer attacks, and for the plainly-distilled

and defensively-distilled temperatures. We begin by noting that the first block of results closely resemble statistics from Table I. This suggests that non-defensive distillation has minimal effects on adversarial attacks like JSMA.

Looking at the second block-row next, we observe that adversaries found by attacking the logit layers () consistently suffered from poorer perceptual similarities, as reflected by larger average and statistics.

Moving on, we see that all the attack variants failed to break defensively distilled models (), which is consistent with reports by [5] and [7]. Although there is a modification of JSMA proposed by [14] that can circumvent defensive distillation by dividing the logits of the distilled model by a temperature constant, we do not assume knowledge of the defense strategy used by the target model. However, in contrast to findings of [5], all attack attempts were able to fool these distilled models. Although the resulting statistics do not point to a single dominant variant, NT-JSMAZ and M-JSMA_Z both found adversaries with similarly small perceptual differences in comparably few iterations, while the targeted JSMAZ trailed behind consistently.

Vi Conclusions

We introduced Non-Targeted JSMA and Maximal JSMA as more flexible variants of the Jacobian-based Saliency Map Attack [1], for finding adversarial examples both quickly and with limited perceptual differences. Most notably, M-JSMA subsumes the need to specify the target class and the perturbation direction. We empirically showed that M-JSMA consistently found high-quality adversaries among a variety of image datasets. With this work, we hope to raise awareness of the ease of generating adversarial examples, and to develop better understandings of attacks so as to defend against them.

References