The Efficacy of SHIELD under Different Threat Models

02/01/2019 ∙ by Cory Cornelius, et al. ∙ Intel 0

We study the efficacy of SHIELD in the face of alternative threat models. We find that SHIELD's robustness decreases by 65 against an adaptive adversary (one who knows JPEG compression is being used as a pre-processing step but not necessarily the compression level) in the gray-box threat model (adversary is aware of the model architecture but not necessarily the weights of that model). However, these adversarial examples are, so far, unable to force a targeted prediction. We also find that the robustness of the JPEG-trained models used in SHIELD decreases by 67 drops from 57 threat model. The addition of SLQ pre-processing to these JPEG-trained models is also not a robust defense (accuracy drops to 0.1 adversary in the gray-box threat model, and an adversary can create adversarial perturbations that force a chosen prediction. We find that neither JPEG-trained models with SLQ pre-processing nor SHIELD are robust against an adaptive adversary in the white-box threat model (accuracy is 0.1 can control the predicted output of their adversarial images. Finally, ensemble-based attacks transfer better (29.8 non-ensemble based attacks (1.4

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 10

page 11

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial examples are inputs that adversaries find or craft to fool a machine learning model 

[2]

. Even the most accurate image classification models that rely upon state-of-the-art deep neural networks are potentially vulnerable to these adversarial examples 

[3, 4, 5]. Thus, it is an important research problem to develop defenses against these adversarial examples, especially when these models are deployed in safety or security critical systems.

One such recently proposed defense, Secure Heterogeneous Image Ensemble with Localized Denoising () [1], uses a combination of techniques to defend against adversarial examples in image classification systems. trains several models with JPEG-compressed images at a randomly chosen compression levels. These JPEG-trained models are treated as a majority vote ensemble to yield the final classification result for a given input image. At inference time, also applies Stochastic Local Quantization (SLQ), a randomized form of JPEG compression, as a pre-processing step. In the face of a static adversary that only knows the model architecture (but not its weights nor any defensive measures), reports a decrease in accuracy against adversarial examples. However, we know that most defenses fail against adaptive adversaries and/or for different threat models. Thus we evaluate this defense against an adaptive adversary and under a different threat model to better understand the limits of .

2 Method

When deploying a defense to mitigate adversarial examples, one must understand the limits of that defense in terms of the strength of the attacker and threat models it was tested for. We usually quantify a defense’s capability as a well-defined threat model along with a security curve that plots the accuracy of the model in the face of an adversary that has the ability to manipulate the image at different strengths for that threat model [6]. We measure strength as the difference between a test image and its corresponding adversarial image generated via an attack method. Such difference images are high-dimensional so we often summarize this difference by computing some metric like (largest absolute difference) or

(euclidean difference). At the limit of the strength (i.e., the ability to manipulate pixels with any value), the attacker can simply use those examples that are already incorrectly classified or just replace the image with another image of their desired target. It turns out, however, that many models exhibit weaknesses at much smaller strengths where it is difficult for humans to visually distinguish any adversarial perturbation. Note that the study of these security curves is of interest to defenders due to the asymmetry between adversaries and defenders. That is, adversaries need only find one particular weakness while defenders ought to mitigate all possible weaknesses.

(a)
(b)
Figure 1: Example adversarial images generated against . The first column shows the original images and their corresponding predictions. The second column shows the adversarial images their corresponding predictions. The final image is the difference between the original and adversarial image, which we call the perturbation. Note that both perturbations have distance of 16 but different distances. The perturbation in the alp photo is visually noticeable, while the perturbation in the koala photo is not. Despite being visually noticeable, the alp perturbation has lower than the koala perturbation. Ideally a chosen distance would be well ordered with respect to visual distinguishability. Images are best viewed in color.

In this work we only examine strength in terms of because it is easy to understand: what is the maximum pixel value deviation an attacker can apply across all image channels? Figure 1 shows two different adversarial images with their corresponding perturbations. Both perturbations have of yet exhibit different distances. The distance of the koala perturbation is larger than the distance of the alps perturbation, yet the alps

perturbation is easily discernible in the adversarial image. Finding better metrics to summarize adversarial perturbations that take into account human perception remains an open problem. Because was evaluated against ImageNet, we chose the same attacker strength that other state-of-the-art defenses on ImageNet use: an

of out of .

3 White-box Robustness

In the white-box threat model, the attacker is allowed access to all parts of the proposed defense. For , this includes the model architecture, weights, and any JPEG pre-processing (i.e., SLQ). Such access enables an attacker to compute changes to a desired image such that those changes, the perturbation, would fool the defended model. This computation, backpropagation, is exactly the same method one would use to learn such a model except that, rather than changing the weights of the model, the attacker seeks to change only the input image. There exists a variety of these types of attacks that we can apply. For the white-box scenario, we rely upon the projected gradient descent (PGD) method of attack 

[4] as implemented in Cleverhans [7]. For details on this attack, we refer readers to the Cleverhans technical report [7].

Backpropagation requires that all operations in the model are differentiable. By using JPEG compression, which is non-differentiable, forces attackers to create JPEG compression approximations that are differentiable. A recent paper, JPEG-resistant Adversarial Images [8]

, describes how to do this, and we apply the authors’ differentiable JPEG approximation when attacking . Similarly, because of the majority voting of the ensemble of models in is also non-differentiable, we approximate the majority vote by averaging the logits output of each model before applying the softmax function. One can also apply alternative ensemble approximations 

[9], but we found averaging to be effective.

A further difficulty in naively applying PGD to is the stochasticity of the SLQ pre-processing step. Because SLQ introduces randomness in the input presented to the model, the model effectively sees a different, albeit similar, input image every time for the same image. relies upon its ensemble of JPEG trained models to ensure that the same input image even after SLQ pre-processing still maps to the same prediction. The attacker, however, typically has no control of this randomness. To circumvent this issue, we compress the image at many different compression levels (, , , , , , , , , and no compression) and average over these compression levels to find an adversarial perturbation. We refer readers to the JPEG-resistant Adversarial Images paper [8] for more details on this technique for making images more robust against different JPEG compression levels.

We sampled of the ImageNet validation images and used PGD to generate adversarial images. Only of these adversarial images are correctly classified by . The remaining were incorrectly classified as something other than the ground truth label. Furthermore an attacker is also able to force to chose a prediction of their choosing. Here we chose the prediction target to be the least-likely class that predicts on the original image. Such a target should be difficult for an attacker to hit, yet our results show that of those adversarial image were successfully predicted as the least-likely label. We found similar results hold for JPEG trained model defenses and JPEG/SLQ pre-processing defenses (see the Appendix for more information). In the white-box threat model, neither , JPEG/SLQ pre-processing, nor JPEG trained models are robust to adversarial examples of small distance, and an attacker can control the predicted target of those adversarial examples.

4 Gray-box Robustness

We also examined a gray-box threat model different from the gray-box model that was analyzed under. The gray-box threat model that was previously examined under assumed that the adversary had knowledge of the model weights and architecture but was ignorant of the defense. We examined an alternative gray-box threat model: the attacker is aware of the defense and model architecture, but oblivious to the weights of the model. We believe such a threat model is more realistic since many application employ existing architectures but with weights learned for their specific task. Furthermore, it is more realistic that the weights of a model can remain confidential by using a trusted execution environment or deployment as a cloud API. Finally, we believe our assumption that the attacker is aware that a JPEG-based defense is more reasonable. Like model architectures, the set of known defenses is small enough that an attacker can exhaustively try them until they are successful. In fact, our grey-box threat model does not assume the attacker knows the exact JPEG compression levels used. Even so, the number of compression levels that yield different image distributions is small (i.e., less than ). After all, the whole purpose of JPEG compression is to preserve the semantic information in the image!

Rather than employing gradient-estimation techniques that rely upon repeated queries to the model, we employed a weaker form of attack based upon transferability. It turns out that even different model architectures trained on the same data-set tend to be susceptible to the same adversarial examples. This phenomenon, called black-box transferability, is well studied 

[10]. We studied in the face of such transfer attacks. To do this, we first generated adversarial examples against an off-the-shelf pre-trained model using a similar adaptive attack procedure described in the white-box attacks above. However, rather than using PGD, we used the Fast Gradient Method (FGM) because examples generated using FGM tend to transfer better. We then fed these adversarial examples into the model to determine whether they fool .

We created our gray-box substitute model by taking an off-the-shelf-model and added SLQ preprocessing. We generated adversarial images using FGM from those same 1000 ImageNet validation images. Of those 1000 adversarial images, 781 of them successfully transferred from our gray-box substitute to , while, on average, 812 of those images also fooled just the JPEG trained models. However, in all of these tests, we found that the adversary was unable to control the prediction target. That is, the adversarial image we generated did not classify to the ground truth label nor the targeted label, but to some other label. It is application dependent whether this ability of the attacker to control the predicted output is of concern. In authentication usages, for example, the attacker might impersonate a specific user with targeted attacks or simply any authorized user with untargeted attacks. In this gray-box threat model , JPEG/SLQ preprocessing, and JPEG trained models show a significant decrease in robustness by , , and , respectively, to adversarial examples; however, an attacker cannot control the prediction like they could in the white-box threat model.

Because of the low targeted black-box transferability, we used an ensemble-based approach to generate transferable targeted examples as described in Liu et al˙[10]. To do this, we computed adversarial examples using 3 of the 4 JPEG-trained models. One can think of this as the adversary having trained their own JPEG-trained models and then used them to attack a separate, unknown JPEG-trained model. Our theory is that JPEG-trained models tend to share errors so this ensemble-based approach should find those errors that are shared between models. This attack technique improves the targeted transferable adversarial examples by 2000% from 1.4% targeted accuracy to 29.8% targeted accuracy on average against the held-out model. We also notice the phenomenon that adversarial examples generated at higher JPEG compression levels do not transfer well to model trained with smaller JPEG compression levels. This intuitively makes sense since lower compression levels significantly change the image while higher compression levels tend to leave the image as is. This result demonstrates that attacks only get better.

A full black-box transferability analysis of requires one train their own substitute models. We leave this analysis that uses newly JPEG-trained models for ensemble-based black-box transferability as future work. Similarly, we also believe our FGM attacks do not transfer well because of the randomness in the SLQ pre-processing. Averaging over this randomness (like BIM and PGD do because they are iterative) may overcome this issue and produce even more transferable adversarial examples since FGM is known to produce better transferable examples [10]. We leave this for future work.

5 Conclusion

As with any empirical security analysis, our results represent upper-bounds on the robustness of and JPEG/SLQ pre-processing as defenses. Attacks only get stronger. We believe that an adversary can better target gray-box attacks using a technique that recently won the both the non-targeted adversarial attack and targeted adversarial attack competitions at the 2017 Competition on Adversarial Attacks and Defenses [11], or adversaries will use query-based black-box methods to increase their targeted attack success [12]. More so, while understanding the limits of a model with respect to some limit is instructive, ultimately adversaries will find more clever ways to fool a model [13, 14].

Our results show that is not robust in white-box threat model ( accuracy), while in the gray-box threat model fares worse than Ensemble Adversarial Training [15] ( vs accuracy). While the training time of these state-of-the-art defenses is longer than the time it takes to train a -based defense, the inference time of these defenses is faster because they do not require an ensemble of models nor any special pre-processing steps. Finally, although our results demonstrate weaknesses in at a single attack strength, the purpose of our work was to understand the efficacy of against adaptive attack techniques and under different threat models. We hope that future defenses will employ similar adaptive attack techniques to demonstrate their robustness at a variety of attack strengths and in the face of different, perhaps more realistic, threat models.

Acknowledgments

We thank Nilaksh Das for helping us run and providing checkpoints for the JPEG-trained models, and Richard Shin for sharing his differentiable JPEG implementation.

References

Appendix A Raw Data

We ran several experiments to better understand . Below, we present the raw data from those experiments. In each table, we attacked the model (or ensemble of models A, B, C in the case of JPEG-A-B-C) specified in the first column using the attack specified in the upper-left cell. Our attack presumes knowledge of JPEG compression pre-processing (but not any specific compression level). We then evaluated our attack against all other models (an off-the-shelf trained model, models trained at different JPEG compression levels, and the model). Each defensive model employs stochastic local quantization as a pre-processing step. This style of pre-processing typically add 1-2% more robustness as compared to no pre-processing.

The diagonal cells of each table represents “white-box” attacks since the adversarial images were generated using the same model that was used for defensive purposes. The off-diagonal cells represent “gray-box” attacks where the adversary assumes JPEG pre-processing is occurring in the defense.

We ran three different targeted attacks–Fast Gradient Method, Basic Iterative Method, and Projected Gradient Descent–against these models by targeting the least-likely label that the model predicts. For more detail about these attacks, we recommend reading the Cleverhans technical report. Finally, for each attack we report the accuracy of the defense with respect to the ground-truth label (where higher accuracy indicates a more robust model) and with respect to targeted label (where higher accuracy indicates that an adversary can chose a target). We report the Top-1 and Top-5 (in parentheses) accuracy for each attacked model-defensive model combination. Top-5 accuracy is not reported when the defensive model is due to ’s use of majority vote ensembling. We have color coded each cell of the able to correspond to the threat model: white cells are white-box attacks, while gray-cells represent the gray-box threat model described above.

Defending Model
Attacked Model ResNet-v2-50 JPEG-20 JPEG-40 JPEG-60 JPEG-80
ResNet-v2-50 15.0% (35.9%) 16.5% (32.8%) 19.9% (39.0%) 21.3% (42.0%) 17.8% (36.7%) 21.9% (—)
JPEG-20 29.0% (53.5%) 2.00% (5.40%) 11.5% (27.0%) 12.3% (26.9%) 9.70% (24.2%) 11.9% (—)
JPEG-40 27.4% (50.2%) 9.40% (21.9%) 2.20% (7.30%) 12.1% (26.1%) 10.1% (22.9%) 10.3% (—)
JPEG-60 27.7% (50.1%) 9.70% (23.0%) 11.2% (24.9%) 2.10% (7.90%) 10.1% (24.1%) 11.3% (—)
JPEG-80 28.6% (52.2%) 10.3% (22.9%) 12.0% (27.4%) 12.8% (29.6%) 2.50% (7.70%) 12.8% (—)
18.8% (38.1%) 2.70% (6.50%) 3.20% (8.50%) 3.30% (10.4%) 3.00% (7.10%) 3.50% (—)
JPEG-20-40-60 19.7% (39.9%) 2.30% (6.40%) 3.00% (8.50%) 3.20% (10.3%) 6.80% (14.9%) 4.90% (—)
JPEG-20-40-80 20.9% (41.3%) 2.50% (6.80%) 3.00% (9.30%) 7.30% (19.5%) 3.00% (7.60%) 4.90% (—)
JPEG-20-60-80 20.3% (42.4%) 2.60% (6.30%) 5.90% (18.3%) 3.90% (10.5%) 2.80% (7.80%) 4.40% (—)
JPEG-40-60-80 20.7% (40.4%) 6.20% (14.5%) 3.00% (9.50%) 3.40% (10.7%) 3.00% (8.40%) 4.70% (—)
No Attack 69.7% (90.2%) 52.9% (76.3%) 58.0% (82.5%) 57.5% (80.6%) 58.6% (82.8%) 62.8% (—)
Table 1: Top-1 (Top-5) Accuracy of defensive model against targeted least-likely Fast Gradient Method () adversarial examples computed on attacked model. Higher percentage indicates a more robust model.
Defending Model
Attacked Model ResNet-v2-50 JPEG-20 JPEG-40 JPEG-60 JPEG-80
ResNet-v2-50 0.10% (5.20%) 32.2% (56.6%) 35.3% (60.5%) 32.1% (56.4%) 33.7% (59.7%) 40.1% (—)
JPEG-20 49.0% (72.5%) 0.00% (0.00%) 30.8% (53.6%) 26.9% (49.2%) 27.7% (52.0%) 29.8% (—)
JPEG-40 46.7% (72.3%) 25.7% (46.4%) 0.10% (0.20%) 24.8% (44.4%) 27.3% (49.8%) 27.8% (—)
JPEG-60 48.1% (72.8%) 25.9% (50.6%) 31.1% (53.8%) 0.10% (0.40%) 26.6% (50.3%) 30.5% (—)
JPEG-80 52.0% (75.2%) 29.2% (52.6%) 33.0% (56.3%) 29.1% (52.8%) 0.10% (1.20%) 32.2% (—)
19.0% (38.4%) 0.20% (2.20%) 0.20% (1.00%) 0.10% (0.80%) 0.40% (1.00%) 0.20% (—)
JPEG-20-40-60 24.4% (47.1%) 0.30% (1.60%) 0.10% (0.70%) 0.20% (1.10%) 8.00% (19.2%) 0.30% (—)
JPEG-20-40-80 26.4% (49.3%) 0.10% (1.20%) 0.20% (0.90%) 9.20% (21.7%) 0.30% (1.30%) 0.30% (—)
JPEG-20-60-80 27.7% (49.3%) 0.20% (1.50%) 13.3% (26.9%) 0.00% (0.40%) 0.20% (1.20%) 0.30% (—)
JPEG-40-60-80 26.2% (48.7%) 11.6% (27.2%) 0.30% (1.30%) 0.10% (0.70%) 0.30% (1.50%) 0.40% (—)
No Attack 69.7% (90.2%) 52.9% (76.3%) 58.0% (82.5%) 57.5% (80.6%) 58.6% (82.8%) 62.8% (—)
Table 2: Top-1 (Top-5) Accuracy of defensive model against targeted least-likely Basic Iterative Method () adversarial examples computed on attacked model. Higher percentage indicates a more robust model.
Defending Model
Attacked Model ResNet-v2-50 JPEG-20 JPEG-40 JPEG-60 JPEG-80
ResNet-v2-50 0.10% (5.90%) 30.1% (54.7%) 34.3% (57.8%) 31.1% (56.3%) 34.4% (58.6%) 39.4% (—)
JPEG-20 48.7% (74.0%) 0.00% (0.00%) 30.1% (52.5%) 25.5% (49.5%) 25.7% (49.2%) 29.2% (—)
JPEG-40 46.6% (72.8%) 23.1% (43.2%) 0.00% (0.30%) 24.7% (46.1%) 25.7% (47.0%) 26.3% (—)
JPEG-60 48.9% (72.1%) 23.1% (45.5%) 28.1% (51.4%) 0.00% (0.20%) 26.7% (48.1%) 27.1% (—)
JPEG-80 51.8% (74.3%) 25.1% (47.3%) 30.7% (53.0%) 27.5% (50.7%) 0.10% (0.70%) 29.0% (—)
17.3% (36.4%) 0.10% (0.60%) 0.10% (0.40%) 0.00% (0.50%) 0.30% (0.80%) 0.20% (—)
JPEG-20-40-60 24.4% (45.4%) 0.00% (0.60%) 0.00% (0.30%) 0.00% (0.50%) 6.70% (17.7%) 0.10% (—)
JPEG-20-40-80 26.0% (47.7%) 0.00% (0.30%) 0.10% (0.60%) 9.10% (20.7%) 0.10% (0.80%) 0.10% (—)
JPEG-20-60-80 27.0% (50.0%) 0.10% (0.40%) 11.3% (24.7%) 0.00% (0.20%) 0.20% (0.80%) 0.10% (—)
JPEG-40-60-80 24.5% (49.2%) 9.70% (24.3%) 0.00% (0.70%) 0.00% (0.30%) 0.30% (0.70%) 0.20% (—)
No Attack 69.7% (90.2%) 52.9% (76.3%) 58.0% (82.5%) 57.5% (80.6%) 58.6% (82.8%) 62.8% (—)
Table 3: Top-1 (Top-5) Accuracy of defensive model against targeted least-likely Projected Gradient Descent () adversarial examples computed on attacked model. Higher percentage indicates a more robust model.

Tables 1, 2, and 3 show the robustness of each model against various attacks. We measure robustness as the accuracy of the defense model with respect to the ground-truth label that the attack generated its adversarial example from. Although each attack targeted a specific (least-likely) label, robustness fails to capture whether the attacker was successful at controlling the desired prediction of an adversarial example.

Defending Model
Attacked Model ResNet-v2-50 JPEG-20 JPEG-40 JPEG-60 JPEG-80
ResNet-v2-50 0.20% (0.40%) 0.10% (0.30%) 0.00% (0.20%) 0.00% (0.20%) 0.00% (0.10%) 0.00% (—)
JPEG-20 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (—)
JPEG-40 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (—)
JPEG-60 0.00% (0.00%) 0.00% (0.20%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.10%) 0.00% (—)
JPEG-80 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (—)
0.00% (0.10%) 0.00% (0.00%) 0.00% (0.10%) 0.00% (0.10%) 0.00% (0.10%) 0.00% (—)
JPEG-20-40-60 0.10% (0.20%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.10%) 0.00% (0.00%) 0.00% (—)
JPEG-20-40-80 0.00% (0.20%) 0.00% (0.00%) 0.00% (0.10%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (—)
JPEG-20-60-80 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.10%) 0.00% (—)
JPEG-40-60-80 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.00%) 0.00% (0.10%) 0.00% (—)
Table 4: Top-1 (Top-5) Targeted attack success against defensive model using least-likely Fast Gradient Method () adversarial examples computed on attacked model. Higher percentage indicates an adversary can choose a target.
Defending Model
Attacked Model ResNet-v2-50 JPEG-20 JPEG-40 JPEG-60 JPEG-80
ResNet-v2-50 99.2% (99.7%) 0.40% (1.90%) 1.20% (3.60%) 1.10% (3.10%) 0.70% (4.40%) 0.80% (—)
JPEG-20 0.50% (2.60%) 98.4% (99.3%) 1.40% (3.40%) 2.70% (5.50%) 2.00% (4.70%) 23.2% (—)
JPEG-40 0.90% (4.80%) 0.40% (1.10%) 99.3% (99.6%) 1.90% (5.30%) 2.50% (6.90%) 28.5% (—)
JPEG-60 0.90% (4.80%) 1.20% (1.60%) 1.20% (2.60%) 98.8% (99.6%) 2.50% (7.90%) 24.4% (—)
JPEG-80 0.50% (1.70%) 0.00% (0.50%) 0.60% (1.70%) 0.60% (2.90%) 98.8% (99.2%) 25.2% (—)
38.7% (64.5%) 93.7% (96.3%) 95.6% (97.5%) 96.8% (98.5%) 97.4% (98.6%) 97.2% (—)
JPEG-20-40-60 25.6% (48.5%) 94.8% (97.8%) 96.4% (98.7%) 97.8% (99.0%) 37.2% (56.1%) 97.3% (—)
JPEG-20-40-80 23.4% (45.6%) 94.6% (97.0%) 96.5% (98.2%) 32.4% (52.3%) 97.7% (98.8%) 97.0% (—)
JPEG-20-60-80 21.1% (44.2%) 94.6% (97.0%) 23.5% (42.0%) 96.3% (98.7%) 97.8% (98.7%) 97.6% (—)
JPEG-40-60-80 24.0% (46.7%) 8.10% (22.4%) 96.8% (98.2%) 97.1% (98.8%) 97.9% (98.8%) 98.0% (—)
Table 5: Top-1 (Top-5) Targeted attack success against defensive model using least-likely Basic Iterative Method () adversarial examples computed on attacked model. Higher percentage indicates an adversary can choose a target.
Defending Model
Attacked Model ResNet-v2-50 JPEG-20 JPEG-40 JPEG-60 JPEG-80
ResNet-v2-50 99.9% (100.%) 0.40% (1.60%) 1.10% (2.90%) 0.10% (2.50%) 1.00% (4.50%) 1.00% (—)
JPEG-20 1.00% (2.10%) 99.8% (99.9%) 1.10% (3.60%) 2.30% (5.80%) 2.40% (6.20%) 24.0% (—)
JPEG-40 0.80% (4.50%) 0.00% (0.50%) 99.8% (99.8%) 2.40% (6.70%) 2.80% (7.40%) 30.6% (—)
JPEG-60 0.90% (4.80%) 0.60% (1.60%) 1.00% (2.50%) 99.9% (100.%) 2.60% (7.20%) 26.3% (—)
JPEG-80 0.50% (1.60%) 0.00% (0.40%) 0.60% (2.00%) 0.70% (2.30%) 99.6% (99.9%) 27.6% (—)
45.5% (69.7%) 98.3% (99.2%) 98.9% (99.5%) 99.3% (99.7%) 99.1% (99.4%) 99.4% (—)
JPEG-20-40-60 30.1% (53.8%) 98.8% (99.1%) 99.4% (99.6%) 99.7% (99.9%) 41.9% (61.0%) 99.4% (—)
JPEG-20-40-80 26.8% (51.9%) 98.4% (99.0%) 99.1% (99.4%) 38.9% (61.1%) 99.0% (99.2%) 99.0% (—)
JPEG-20-60-80 28.3% (51.2%) 98.9% (99.3%) 29.6% (48.9%) 99.4% (99.6%) 99.3% (99.4%) 99.4% (—)
JPEG-40-60-80 29.4% (54.3%) 8.70% (21.7%) 99.1% (99.5%) 99.4% (99.7%) 99.2% (99.3%) 99.3% (—)
Table 6: Top-1 (Top-5) Targeted attack success against defensive model using least-likely Projected Gradient Descent () adversarial examples computed on attacked model. Higher percentage indicates an adversary can choose a target.

Tables 4, 5, and 6 show the targeted attack success against each model for various attacks. We measure targeted attack success as the accuracy of the defense model with respect to the attacker’s target label that the attack generated its adversarial example for. Each attack targeted a specific (least-likely) label, so higher values in these tables indicate that an attacker can control the predicted output of adversarial examples.

Appendix B Shield Implementation Issues

In the original implementation, we discovered an issue in the experimental evaluation code that affects the results reported in the published paper [1]. During training, central crops are enabled to better learn discriminative features for the object of interesting. However, during evaluation it is customary to turn off this central cropping. The implementation of does not turn off this cropping during evaluation nor did the attacks in take into account this cropping when generating perturbations.

We discovered this issue after noticing our adversarial images were not working against the provided implementation of . We found that our images were adversarial against our own implementation of using the same SLQ parameters and model parameters, but not against the publicly released implementation111https://github.com/poloclub/jpeg-defense at commit 1576429cf199c38065b941a48b0fcd7747901457. When we used the implementation of provided to us with the same SLQ parameters and model checkpoints, those same adversarial images failed to remain adversarial. After some investigation, we found the aforementioned central-cropping-at-evaluation-time issue, disabled this central cropping, and found that our adversarial images remained adversarial.

As such, central cropping is now a feature of in the sense that the evaluation portion of the published paper [1] is SLQ with central cropping. We reported this issue to the authors of , which they acknowledged. Private correspondence with the authors also confirms our own experiments that reveal that much of the reported robustness of comes from this central cropping and not from SLQ nor the JPEG retrained models. Given that central cropping is deterministic, we do not believe central cropping is an viable defense nor do we believe it is viable in combination with SLQ.