Adversarial examples are inputs that adversaries find or craft to fool a machine learning model
. Even the most accurate image classification models that rely upon state-of-the-art deep neural networks are potentially vulnerable to these adversarial examples[3, 4, 5]. Thus, it is an important research problem to develop defenses against these adversarial examples, especially when these models are deployed in safety or security critical systems.
One such recently proposed defense, Secure Heterogeneous Image Ensemble with Localized Denoising () , uses a combination of techniques to defend against adversarial examples in image classification systems. trains several models with JPEG-compressed images at a randomly chosen compression levels. These JPEG-trained models are treated as a majority vote ensemble to yield the final classification result for a given input image. At inference time, also applies Stochastic Local Quantization (SLQ), a randomized form of JPEG compression, as a pre-processing step. In the face of a static adversary that only knows the model architecture (but not its weights nor any defensive measures), reports a decrease in accuracy against adversarial examples. However, we know that most defenses fail against adaptive adversaries and/or for different threat models. Thus we evaluate this defense against an adaptive adversary and under a different threat model to better understand the limits of .
When deploying a defense to mitigate adversarial examples, one must understand the limits of that defense in terms of the strength of the attacker and threat models it was tested for. We usually quantify a defense’s capability as a well-defined threat model along with a security curve that plots the accuracy of the model in the face of an adversary that has the ability to manipulate the image at different strengths for that threat model . We measure strength as the difference between a test image and its corresponding adversarial image generated via an attack method. Such difference images are high-dimensional so we often summarize this difference by computing some metric like (largest absolute difference) or
(euclidean difference). At the limit of the strength (i.e., the ability to manipulate pixels with any value), the attacker can simply use those examples that are already incorrectly classified or just replace the image with another image of their desired target. It turns out, however, that many models exhibit weaknesses at much smaller strengths where it is difficult for humans to visually distinguish any adversarial perturbation. Note that the study of these security curves is of interest to defenders due to the asymmetry between adversaries and defenders. That is, adversaries need only find one particular weakness while defenders ought to mitigate all possible weaknesses.
In this work we only examine strength in terms of because it is easy to understand: what is the maximum pixel value deviation an attacker can apply across all image channels? Figure 1 shows two different adversarial images with their corresponding perturbations. Both perturbations have of yet exhibit different distances. The distance of the koala perturbation is larger than the distance of the alps perturbation, yet the alps
perturbation is easily discernible in the adversarial image. Finding better metrics to summarize adversarial perturbations that take into account human perception remains an open problem. Because was evaluated against ImageNet, we chose the same attacker strength that other state-of-the-art defenses on ImageNet use: anof out of .
3 White-box Robustness
In the white-box threat model, the attacker is allowed access to all parts of the proposed defense. For , this includes the model architecture, weights, and any JPEG pre-processing (i.e., SLQ). Such access enables an attacker to compute changes to a desired image such that those changes, the perturbation, would fool the defended model. This computation, backpropagation, is exactly the same method one would use to learn such a model except that, rather than changing the weights of the model, the attacker seeks to change only the input image. There exists a variety of these types of attacks that we can apply. For the white-box scenario, we rely upon the projected gradient descent (PGD) method of attack as implemented in Cleverhans . For details on this attack, we refer readers to the Cleverhans technical report .
Backpropagation requires that all operations in the model are differentiable. By using JPEG compression, which is non-differentiable, forces attackers to create JPEG compression approximations that are differentiable. A recent paper, JPEG-resistant Adversarial Images 
, describes how to do this, and we apply the authors’ differentiable JPEG approximation when attacking . Similarly, because of the majority voting of the ensemble of models in is also non-differentiable, we approximate the majority vote by averaging the logits output of each model before applying the softmax function. One can also apply alternative ensemble approximations, but we found averaging to be effective.
A further difficulty in naively applying PGD to is the stochasticity of the SLQ pre-processing step. Because SLQ introduces randomness in the input presented to the model, the model effectively sees a different, albeit similar, input image every time for the same image. relies upon its ensemble of JPEG trained models to ensure that the same input image even after SLQ pre-processing still maps to the same prediction. The attacker, however, typically has no control of this randomness. To circumvent this issue, we compress the image at many different compression levels (, , , , , , , , , and no compression) and average over these compression levels to find an adversarial perturbation. We refer readers to the JPEG-resistant Adversarial Images paper  for more details on this technique for making images more robust against different JPEG compression levels.
We sampled of the ImageNet validation images and used PGD to generate adversarial images. Only of these adversarial images are correctly classified by . The remaining were incorrectly classified as something other than the ground truth label. Furthermore an attacker is also able to force to chose a prediction of their choosing. Here we chose the prediction target to be the least-likely class that predicts on the original image. Such a target should be difficult for an attacker to hit, yet our results show that of those adversarial image were successfully predicted as the least-likely label. We found similar results hold for JPEG trained model defenses and JPEG/SLQ pre-processing defenses (see the Appendix for more information). In the white-box threat model, neither , JPEG/SLQ pre-processing, nor JPEG trained models are robust to adversarial examples of small distance, and an attacker can control the predicted target of those adversarial examples.
4 Gray-box Robustness
We also examined a gray-box threat model different from the gray-box model that was analyzed under. The gray-box threat model that was previously examined under assumed that the adversary had knowledge of the model weights and architecture but was ignorant of the defense. We examined an alternative gray-box threat model: the attacker is aware of the defense and model architecture, but oblivious to the weights of the model. We believe such a threat model is more realistic since many application employ existing architectures but with weights learned for their specific task. Furthermore, it is more realistic that the weights of a model can remain confidential by using a trusted execution environment or deployment as a cloud API. Finally, we believe our assumption that the attacker is aware that a JPEG-based defense is more reasonable. Like model architectures, the set of known defenses is small enough that an attacker can exhaustively try them until they are successful. In fact, our grey-box threat model does not assume the attacker knows the exact JPEG compression levels used. Even so, the number of compression levels that yield different image distributions is small (i.e., less than ). After all, the whole purpose of JPEG compression is to preserve the semantic information in the image!
Rather than employing gradient-estimation techniques that rely upon repeated queries to the model, we employed a weaker form of attack based upon transferability. It turns out that even different model architectures trained on the same data-set tend to be susceptible to the same adversarial examples. This phenomenon, called black-box transferability, is well studied. We studied in the face of such transfer attacks. To do this, we first generated adversarial examples against an off-the-shelf pre-trained model using a similar adaptive attack procedure described in the white-box attacks above. However, rather than using PGD, we used the Fast Gradient Method (FGM) because examples generated using FGM tend to transfer better. We then fed these adversarial examples into the model to determine whether they fool .
We created our gray-box substitute model by taking an off-the-shelf-model and added SLQ preprocessing. We generated adversarial images using FGM from those same 1000 ImageNet validation images. Of those 1000 adversarial images, 781 of them successfully transferred from our gray-box substitute to , while, on average, 812 of those images also fooled just the JPEG trained models. However, in all of these tests, we found that the adversary was unable to control the prediction target. That is, the adversarial image we generated did not classify to the ground truth label nor the targeted label, but to some other label. It is application dependent whether this ability of the attacker to control the predicted output is of concern. In authentication usages, for example, the attacker might impersonate a specific user with targeted attacks or simply any authorized user with untargeted attacks. In this gray-box threat model , JPEG/SLQ preprocessing, and JPEG trained models show a significant decrease in robustness by , , and , respectively, to adversarial examples; however, an attacker cannot control the prediction like they could in the white-box threat model.
Because of the low targeted black-box transferability, we used an ensemble-based approach to generate transferable targeted examples as described in Liu et al˙. To do this, we computed adversarial examples using 3 of the 4 JPEG-trained models. One can think of this as the adversary having trained their own JPEG-trained models and then used them to attack a separate, unknown JPEG-trained model. Our theory is that JPEG-trained models tend to share errors so this ensemble-based approach should find those errors that are shared between models. This attack technique improves the targeted transferable adversarial examples by 2000% from 1.4% targeted accuracy to 29.8% targeted accuracy on average against the held-out model. We also notice the phenomenon that adversarial examples generated at higher JPEG compression levels do not transfer well to model trained with smaller JPEG compression levels. This intuitively makes sense since lower compression levels significantly change the image while higher compression levels tend to leave the image as is. This result demonstrates that attacks only get better.
A full black-box transferability analysis of requires one train their own substitute models. We leave this analysis that uses newly JPEG-trained models for ensemble-based black-box transferability as future work. Similarly, we also believe our FGM attacks do not transfer well because of the randomness in the SLQ pre-processing. Averaging over this randomness (like BIM and PGD do because they are iterative) may overcome this issue and produce even more transferable adversarial examples since FGM is known to produce better transferable examples . We leave this for future work.
As with any empirical security analysis, our results represent upper-bounds on the robustness of and JPEG/SLQ pre-processing as defenses. Attacks only get stronger. We believe that an adversary can better target gray-box attacks using a technique that recently won the both the non-targeted adversarial attack and targeted adversarial attack competitions at the 2017 Competition on Adversarial Attacks and Defenses , or adversaries will use query-based black-box methods to increase their targeted attack success . More so, while understanding the limits of a model with respect to some limit is instructive, ultimately adversaries will find more clever ways to fool a model [13, 14].
Our results show that is not robust in white-box threat model ( accuracy), while in the gray-box threat model fares worse than Ensemble Adversarial Training  ( vs accuracy). While the training time of these state-of-the-art defenses is longer than the time it takes to train a -based defense, the inference time of these defenses is faster because they do not require an ensemble of models nor any special pre-processing steps. Finally, although our results demonstrate weaknesses in at a single attack strength, the purpose of our work was to understand the efficacy of against adaptive attack techniques and under different threat models. We hope that future defenses will employ similar adaptive attack techniques to demonstrate their robustness at a variety of attack strengths and in the face of different, perhaps more realistic, threat models.
We thank Nilaksh Das for helping us run and providing checkpoints for the JPEG-trained models, and Richard Shin for sharing his differentiable JPEG implementation.
N. Das, M. Shanbhogue, S.-T. Chen, F. Hohman, S. Li, L. Chen, M. E. Kounavis, and D. H. Chau, “Shield: Fast, practical defense and vaccination for deep learning using jpeg compression,” inProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), Aug. 2018. http://doi.acm.org/10.1145/3219819.3219910.
-  I. J. Goodfellow, “Defense against the dark arts: An overview of adversarial example security research and future research directions,” CoRR, vol. abs/1806.04169, 2018. https://arxiv.org/abs/1806.04169.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014. https://openreview.net/forum?id=kklr_MTHMRQjG.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Proceedings of the 6th International Conference on Learning Representations (ICLR), Apr. 2018. https://openreview.net/forum?id=rJzIBfZAb.
N. Carlini and D. Wagner, “Adversarial examples are not easily detected:
Bypassing ten detection methods,” in
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec), Nov. 2017. http://doi.acm.org/10.1145/3128572.3140444.
-  B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,” CoRR, vol. abs/1712.03141, 2017. https://arxiv.org/abs/1712.03141.
-  N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko, V. Behzadan, K. Hambardzumyan, Z. Zhang, Y.-L. Juang, Z. Li, R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot, P. Hendricks, J. Rauber, and R. Long, “Technical report on the cleverhans v2.1.0 adversarial examples library,” arXiv preprint arXiv:1610.00768, 2018. http://arxiv.org/abs/1610.00768.
-  R. Shin and D. Song, “Jpeg-resistant adversarial images,” in Proceedings of the Machine Learning and Computer Security Workshop, Dec. 2017. https://machine-learning-and-security.github.io/papers/mlsec17_paper_54.pdf.
-  W. He, J. Wei, X. Chen, N. Carlini, and D. Song, “Adversarial example defense: Ensembles of weak defenses are not strong,” in Proceedings of the 11th USENIX Workshop on Offensive Technologies (WOOT), (Vancouver, BC), USENIX Association, 2017. https://www.usenix.org/conference/woot17/workshop-program/presentation/he.
-  Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” in Proceedings of the 5th International Conference on Learning Representations (ICLR), Apr. 2017. https://openreview.net/forum?id=Sys6GJqxl.
-  Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in , 2018. https://doi.org/10.1109/CVPR.2018.00957.
-  A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial attacks with limited queries and information,” in Proceedings of the 35th International Conference on Machine Learning (ICML), July 2018. http://proceedings.mlr.press/v80/ilyas18a.html.
-  Y. Sharma and P.-Y. Chen, “Attacking the madry defense model with -based adversarial examples,” in Proceedings of the 6th International Conference on Learning Representations (ICLR), Apr. 2018. https://openreview.net/forum?id=Sy8WeUJPf.
-  T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,” in Proceedings of the Machine Learning and Computer Security Workshop, Dec. 2017. https://machine-learning-and-security.github.io/papers/mlsec17_paper_27.pdf.
-  F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” in Proceedings of the 6th International Conference on Learning Representations (ICLR), Apr. 2018. https://openreview.net/forum?id=rkZvSe-RZ.
Appendix A Raw Data
We ran several experiments to better understand . Below, we present the raw data from those experiments. In each table, we attacked the model (or ensemble of models A, B, C in the case of JPEG-A-B-C) specified in the first column using the attack specified in the upper-left cell. Our attack presumes knowledge of JPEG compression pre-processing (but not any specific compression level). We then evaluated our attack against all other models (an off-the-shelf trained model, models trained at different JPEG compression levels, and the model). Each defensive model employs stochastic local quantization as a pre-processing step. This style of pre-processing typically add 1-2% more robustness as compared to no pre-processing.
The diagonal cells of each table represents “white-box” attacks since the adversarial images were generated using the same model that was used for defensive purposes. The off-diagonal cells represent “gray-box” attacks where the adversary assumes JPEG pre-processing is occurring in the defense.
We ran three different targeted attacks–Fast Gradient Method, Basic Iterative Method, and Projected Gradient Descent–against these models by targeting the least-likely label that the model predicts. For more detail about these attacks, we recommend reading the Cleverhans technical report. Finally, for each attack we report the accuracy of the defense with respect to the ground-truth label (where higher accuracy indicates a more robust model) and with respect to targeted label (where higher accuracy indicates that an adversary can chose a target). We report the Top-1 and Top-5 (in parentheses) accuracy for each attacked model-defensive model combination. Top-5 accuracy is not reported when the defensive model is due to ’s use of majority vote ensembling. We have color coded each cell of the able to correspond to the threat model: white cells are white-box attacks, while gray-cells represent the gray-box threat model described above.
|ResNet-v2-50||15.0% (35.9%)||16.5% (32.8%)||19.9% (39.0%)||21.3% (42.0%)||17.8% (36.7%)||21.9% (—)|
|JPEG-20||29.0% (53.5%)||2.00% (5.40%)||11.5% (27.0%)||12.3% (26.9%)||9.70% (24.2%)||11.9% (—)|
|JPEG-40||27.4% (50.2%)||9.40% (21.9%)||2.20% (7.30%)||12.1% (26.1%)||10.1% (22.9%)||10.3% (—)|
|JPEG-60||27.7% (50.1%)||9.70% (23.0%)||11.2% (24.9%)||2.10% (7.90%)||10.1% (24.1%)||11.3% (—)|
|JPEG-80||28.6% (52.2%)||10.3% (22.9%)||12.0% (27.4%)||12.8% (29.6%)||2.50% (7.70%)||12.8% (—)|
|18.8% (38.1%)||2.70% (6.50%)||3.20% (8.50%)||3.30% (10.4%)||3.00% (7.10%)||3.50% (—)|
|JPEG-20-40-60||19.7% (39.9%)||2.30% (6.40%)||3.00% (8.50%)||3.20% (10.3%)||6.80% (14.9%)||4.90% (—)|
|JPEG-20-40-80||20.9% (41.3%)||2.50% (6.80%)||3.00% (9.30%)||7.30% (19.5%)||3.00% (7.60%)||4.90% (—)|
|JPEG-20-60-80||20.3% (42.4%)||2.60% (6.30%)||5.90% (18.3%)||3.90% (10.5%)||2.80% (7.80%)||4.40% (—)|
|JPEG-40-60-80||20.7% (40.4%)||6.20% (14.5%)||3.00% (9.50%)||3.40% (10.7%)||3.00% (8.40%)||4.70% (—)|
|No Attack||69.7% (90.2%)||52.9% (76.3%)||58.0% (82.5%)||57.5% (80.6%)||58.6% (82.8%)||62.8% (—)|
|ResNet-v2-50||0.10% (5.20%)||32.2% (56.6%)||35.3% (60.5%)||32.1% (56.4%)||33.7% (59.7%)||40.1% (—)|
|JPEG-20||49.0% (72.5%)||0.00% (0.00%)||30.8% (53.6%)||26.9% (49.2%)||27.7% (52.0%)||29.8% (—)|
|JPEG-40||46.7% (72.3%)||25.7% (46.4%)||0.10% (0.20%)||24.8% (44.4%)||27.3% (49.8%)||27.8% (—)|
|JPEG-60||48.1% (72.8%)||25.9% (50.6%)||31.1% (53.8%)||0.10% (0.40%)||26.6% (50.3%)||30.5% (—)|
|JPEG-80||52.0% (75.2%)||29.2% (52.6%)||33.0% (56.3%)||29.1% (52.8%)||0.10% (1.20%)||32.2% (—)|
|19.0% (38.4%)||0.20% (2.20%)||0.20% (1.00%)||0.10% (0.80%)||0.40% (1.00%)||0.20% (—)|
|JPEG-20-40-60||24.4% (47.1%)||0.30% (1.60%)||0.10% (0.70%)||0.20% (1.10%)||8.00% (19.2%)||0.30% (—)|
|JPEG-20-40-80||26.4% (49.3%)||0.10% (1.20%)||0.20% (0.90%)||9.20% (21.7%)||0.30% (1.30%)||0.30% (—)|
|JPEG-20-60-80||27.7% (49.3%)||0.20% (1.50%)||13.3% (26.9%)||0.00% (0.40%)||0.20% (1.20%)||0.30% (—)|
|JPEG-40-60-80||26.2% (48.7%)||11.6% (27.2%)||0.30% (1.30%)||0.10% (0.70%)||0.30% (1.50%)||0.40% (—)|
|No Attack||69.7% (90.2%)||52.9% (76.3%)||58.0% (82.5%)||57.5% (80.6%)||58.6% (82.8%)||62.8% (—)|
|ResNet-v2-50||0.10% (5.90%)||30.1% (54.7%)||34.3% (57.8%)||31.1% (56.3%)||34.4% (58.6%)||39.4% (—)|
|JPEG-20||48.7% (74.0%)||0.00% (0.00%)||30.1% (52.5%)||25.5% (49.5%)||25.7% (49.2%)||29.2% (—)|
|JPEG-40||46.6% (72.8%)||23.1% (43.2%)||0.00% (0.30%)||24.7% (46.1%)||25.7% (47.0%)||26.3% (—)|
|JPEG-60||48.9% (72.1%)||23.1% (45.5%)||28.1% (51.4%)||0.00% (0.20%)||26.7% (48.1%)||27.1% (—)|
|JPEG-80||51.8% (74.3%)||25.1% (47.3%)||30.7% (53.0%)||27.5% (50.7%)||0.10% (0.70%)||29.0% (—)|
|17.3% (36.4%)||0.10% (0.60%)||0.10% (0.40%)||0.00% (0.50%)||0.30% (0.80%)||0.20% (—)|
|JPEG-20-40-60||24.4% (45.4%)||0.00% (0.60%)||0.00% (0.30%)||0.00% (0.50%)||6.70% (17.7%)||0.10% (—)|
|JPEG-20-40-80||26.0% (47.7%)||0.00% (0.30%)||0.10% (0.60%)||9.10% (20.7%)||0.10% (0.80%)||0.10% (—)|
|JPEG-20-60-80||27.0% (50.0%)||0.10% (0.40%)||11.3% (24.7%)||0.00% (0.20%)||0.20% (0.80%)||0.10% (—)|
|JPEG-40-60-80||24.5% (49.2%)||9.70% (24.3%)||0.00% (0.70%)||0.00% (0.30%)||0.30% (0.70%)||0.20% (—)|
|No Attack||69.7% (90.2%)||52.9% (76.3%)||58.0% (82.5%)||57.5% (80.6%)||58.6% (82.8%)||62.8% (—)|
Tables 1, 2, and 3 show the robustness of each model against various attacks. We measure robustness as the accuracy of the defense model with respect to the ground-truth label that the attack generated its adversarial example from. Although each attack targeted a specific (least-likely) label, robustness fails to capture whether the attacker was successful at controlling the desired prediction of an adversarial example.
|ResNet-v2-50||0.20% (0.40%)||0.10% (0.30%)||0.00% (0.20%)||0.00% (0.20%)||0.00% (0.10%)||0.00% (—)|
|JPEG-20||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (—)|
|JPEG-40||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (—)|
|JPEG-60||0.00% (0.00%)||0.00% (0.20%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.10%)||0.00% (—)|
|JPEG-80||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (—)|
|0.00% (0.10%)||0.00% (0.00%)||0.00% (0.10%)||0.00% (0.10%)||0.00% (0.10%)||0.00% (—)|
|JPEG-20-40-60||0.10% (0.20%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.10%)||0.00% (0.00%)||0.00% (—)|
|JPEG-20-40-80||0.00% (0.20%)||0.00% (0.00%)||0.00% (0.10%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (—)|
|JPEG-20-60-80||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.10%)||0.00% (—)|
|JPEG-40-60-80||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.00%)||0.00% (0.10%)||0.00% (—)|
|ResNet-v2-50||99.2% (99.7%)||0.40% (1.90%)||1.20% (3.60%)||1.10% (3.10%)||0.70% (4.40%)||0.80% (—)|
|JPEG-20||0.50% (2.60%)||98.4% (99.3%)||1.40% (3.40%)||2.70% (5.50%)||2.00% (4.70%)||23.2% (—)|
|JPEG-40||0.90% (4.80%)||0.40% (1.10%)||99.3% (99.6%)||1.90% (5.30%)||2.50% (6.90%)||28.5% (—)|
|JPEG-60||0.90% (4.80%)||1.20% (1.60%)||1.20% (2.60%)||98.8% (99.6%)||2.50% (7.90%)||24.4% (—)|
|JPEG-80||0.50% (1.70%)||0.00% (0.50%)||0.60% (1.70%)||0.60% (2.90%)||98.8% (99.2%)||25.2% (—)|
|38.7% (64.5%)||93.7% (96.3%)||95.6% (97.5%)||96.8% (98.5%)||97.4% (98.6%)||97.2% (—)|
|JPEG-20-40-60||25.6% (48.5%)||94.8% (97.8%)||96.4% (98.7%)||97.8% (99.0%)||37.2% (56.1%)||97.3% (—)|
|JPEG-20-40-80||23.4% (45.6%)||94.6% (97.0%)||96.5% (98.2%)||32.4% (52.3%)||97.7% (98.8%)||97.0% (—)|
|JPEG-20-60-80||21.1% (44.2%)||94.6% (97.0%)||23.5% (42.0%)||96.3% (98.7%)||97.8% (98.7%)||97.6% (—)|
|JPEG-40-60-80||24.0% (46.7%)||8.10% (22.4%)||96.8% (98.2%)||97.1% (98.8%)||97.9% (98.8%)||98.0% (—)|
|ResNet-v2-50||99.9% (100.%)||0.40% (1.60%)||1.10% (2.90%)||0.10% (2.50%)||1.00% (4.50%)||1.00% (—)|
|JPEG-20||1.00% (2.10%)||99.8% (99.9%)||1.10% (3.60%)||2.30% (5.80%)||2.40% (6.20%)||24.0% (—)|
|JPEG-40||0.80% (4.50%)||0.00% (0.50%)||99.8% (99.8%)||2.40% (6.70%)||2.80% (7.40%)||30.6% (—)|
|JPEG-60||0.90% (4.80%)||0.60% (1.60%)||1.00% (2.50%)||99.9% (100.%)||2.60% (7.20%)||26.3% (—)|
|JPEG-80||0.50% (1.60%)||0.00% (0.40%)||0.60% (2.00%)||0.70% (2.30%)||99.6% (99.9%)||27.6% (—)|
|45.5% (69.7%)||98.3% (99.2%)||98.9% (99.5%)||99.3% (99.7%)||99.1% (99.4%)||99.4% (—)|
|JPEG-20-40-60||30.1% (53.8%)||98.8% (99.1%)||99.4% (99.6%)||99.7% (99.9%)||41.9% (61.0%)||99.4% (—)|
|JPEG-20-40-80||26.8% (51.9%)||98.4% (99.0%)||99.1% (99.4%)||38.9% (61.1%)||99.0% (99.2%)||99.0% (—)|
|JPEG-20-60-80||28.3% (51.2%)||98.9% (99.3%)||29.6% (48.9%)||99.4% (99.6%)||99.3% (99.4%)||99.4% (—)|
|JPEG-40-60-80||29.4% (54.3%)||8.70% (21.7%)||99.1% (99.5%)||99.4% (99.7%)||99.2% (99.3%)||99.3% (—)|
Tables 4, 5, and 6 show the targeted attack success against each model for various attacks. We measure targeted attack success as the accuracy of the defense model with respect to the attacker’s target label that the attack generated its adversarial example for. Each attack targeted a specific (least-likely) label, so higher values in these tables indicate that an attacker can control the predicted output of adversarial examples.
Appendix B Shield Implementation Issues
In the original implementation, we discovered an issue in the experimental evaluation code that affects the results reported in the published paper . During training, central crops are enabled to better learn discriminative features for the object of interesting. However, during evaluation it is customary to turn off this central cropping. The implementation of does not turn off this cropping during evaluation nor did the attacks in take into account this cropping when generating perturbations.
We discovered this issue after noticing our adversarial images were not working against the provided implementation of . We found that our images were adversarial against our own implementation of using the same SLQ parameters and model parameters, but not against the publicly released implementation111https://github.com/poloclub/jpeg-defense at commit 1576429cf199c38065b941a48b0fcd7747901457. When we used the implementation of provided to us with the same SLQ parameters and model checkpoints, those same adversarial images failed to remain adversarial. After some investigation, we found the aforementioned central-cropping-at-evaluation-time issue, disabled this central cropping, and found that our adversarial images remained adversarial.
As such, central cropping is now a feature of in the sense that the evaluation portion of the published paper  is SLQ with central cropping. We reported this issue to the authors of , which they acknowledged. Private correspondence with the authors also confirms our own experiments that reveal that much of the reported robustness of comes from this central cropping and not from SLQ nor the JPEG retrained models. Given that central cropping is deterministic, we do not believe central cropping is an viable defense nor do we believe it is viable in combination with SLQ.