Convolutional Neural Networks (CNNs) are widely used for image classification Krizhevsky et al. (2012); Simonyan and Zisserman (2015) and object detection. Despite their widespread use, CNNs have been shown to be vulnerable to adversarial examples Goodfellow et al. (2014). Adversarial examples are clean images which have malicious noise added to them. This noise is small enough so that humans can visually recognize the images, but CNNs misclassify them.
Adversarial examples can be created through white-box or black-box attacks Carlini et al. (2019a), depending on the assumed adversarial model. White-box attacks (e.g FGSM Goodfellow et al. (2014), BIM Kurakin et al. (2017), MIM Dong et al. (2018), PGD Madry et al. (2018), CW Carlini and Wagner (2017b) and EAD Chen et al. (2018)) are the attacks that create adversarial examples using the derivatives of the model for given inputs, black-box attacks do not use this information for crafting adversarial examples. In black box attacks the attacker can know the training data set Szegedy et al. (2014); Papernot et al. (2016); Athalye et al. (2018); Liu et al. (2017) or can make the query to the (target) model Papernot et al. (2017); Chen et al. (2017); Chen and Jordan (2019a). The latter case is known as adaptive attacks.
To defend against these attacks many defenses have been proposed Pang et al. (2019); Verma and Swami (2019); Nguyen et al. (2019); Jia et al. (2019); Raff et al. (2019); Xiao et al. (2020); Kou et al. (2020); Roth et al. (2019); Liu et al. (2019) and the field is rapidly expanding. For example, in 2019 alone multiple attack and defense papers were released every month111https://nicholas.carlini.com/writing/2019/all-adversarial-example-papers.html has a curated list of adversarial machine learning papers submitted to Arxiv by month in 2019.
has a curated list of adversarial machine learning papers submitted to Arxiv by month in 2019.. As studying every defense and attack is not feasible, we focus our analyses in two specific ways.
In terms of attacks we investigate black-box adversaries which utilize adaptive attacks Papernot et al. (2017). A natural question is why do we focus on adaptive black-box type attacks? We do so for the following three reasons:
1) Gradient masking makes it possible for a defense to give a false sense of security Athalye et al. (2018). This means that demonstrating robustness to white-box attacks does not always imply robustness to black-box attacks. Hence, there is a need to extensively test both white and black-box attacks. The majority of defense papers Pang et al. (2019); Verma and Swami (2019); Jia et al. (2019); Raff et al. (2019); Xiao et al. (2020); Kou et al. (2020); Roth et al. (2019); Liu et al. (2019) try to demonstrate robustness (either theoretically or experimentally) to white-box attacks. The same level of experimentation is often not done for adaptive black-box attacks. This brings us to our second point.
2) State-of-the-art white-box attacks on published defenses have been extensively studied in the literature Tramer et al. (2020); Athalye et al. (2018); Carlini and Wagner (2017b). As previously mentioned, the level of attention given to adaptive black-box attacks in defense papers is significantly less. By focusing on adaptive black-box attacks we seek to complete the security picture. This full security picture means current defenses we study have not only white-box attacks but also adaptive black-box results. Future defenses can use our framework and publicly available source code to aid in their own adaptive analysis. This completed security spectrum brings us to our third point.
3) By completing the security picture (with adaptive black-box attacks) we allow the readers to compare defense results. This comparison can be done because the same adversarial model, dataset and attack is used for each defense. This is in stark contrast to adaptive white-box attacks which may require different adversarial models and different security assumptions for each attack. As noted in Carlini et al. (2019b) it is improper to compare the robustness of two defenses under different adversarial models.
Having explained our focus for attack, we next explain why we chose the following 9 defenses to investigate:
1) Each defense is unique in the following aspect: No two defenses use the exact same set of underlying methods to try and achieve security. We illustrate this point in Table 1. In Section 3 we go into specifics about why each individual defense is chosen. As a whole, this diverse group of defenses allows us to evaluate many different competing approaches to security.
2) Most of the defenses have been published at NIPs, ICML, ICLR or CVPR indicating the machine learning community and reviewers found these approaches worthy of examination and further study.
Having explained the scope of our analyses and the reasoning behind it, the rest of the paper is organized as follows: In Section 2 we explain our adversarial model. We also explain the types of attacks possible under the given model. In Section 3 we discuss the categorization of defenses and give a brief summary of each defense (and why it was selected). We also explain how we evaluate the security of the defenses. Experimental results are given in Section 4. Also included in this section is our analysis of the performance of each defense. We end with concluding remarks in Section 5. Extensive experimental details for reproducing our results is given in the supplemental material and our complete Github code will be released as well on publication.
2 Black-box attacks
The general setup in adversarial machine learning for both white-box and black-box attacks is as follows Yuan et al. (2017): A trained classifier , correctly identified sample with class label . The goal of the adversary is to modify by some amount such that produces class label . In the case of untargeted attacks, the attack is considered successful as long as . In the case of targeted attacks, the attack is only successful if and where is a target class label specified by the adversary. For both untargeted and targeted attacks, typically the magnitude of is limited Goodfellow et al. (2014) so that humans can still visually recognize the image.
In both white-box and black-box attacks, the goal of the adversary is to create examples that fool the classifier. The differences between white-box and black-box attacks stems from the information about the classifier that the adversary is given. In white-box attacks the architecture of the classifier, and trained weights are assumed to be known. In black-box attacks, the architecture and trained weights are assumed to be secret Carlini et al. (2019a).
A natural question is, how is it possible that a defense that claims to be secure to white-box attacks is then vulnerable to black-box attacks? In essence this can occur because the knowledge needed to conduct a successful white-box attack is not the same as the knowledge needed for black-box attacks. For example, if the defense conceals gradient information, white-box attacks will fail but black-box attacks may succeed. In standard security literature it is common to assume a defense that is secure against white-box attacks is also secure against black-box attacks. In the field of adversarial machine learning, this concept is simply not applicable. As a result, both types of attacks must be examined. As much of the literature is already devoted to white-box analysis, this gives us clear motivation to complete the picture with an analysis of black-box attacks.
2.1 Black-box attack variations
We briefly introduce widely used black-box attacks. In essence black-box attacks can be categorized as follows:
Target model based black-box attacks. The attacker adaptively queries the defense to create adversarial examples. In these attacks, the adversary does not build any synthetic model to generate adversarial examples. Target Model based attacks can further be divided into two categories: score based black-box attacks and decision based black-box attack.
A. Score based black-box attacks In the literature these attacks are also called zeroth Order Optimization based black-box attacks Chen et al. (2017). The adversary adaptively queries the defense to approximate the gradient for a given input based on a derivative-free optimization approach. This approximated gradient allows the adversary directly work with the classifier of the defense. Another attack in this line is called SimBA (Simple Black Box Attack) Guo et al. (2019)
. This attack also requires the score vectorto mount the attack.
B. Decision based black-box attack Ilyas et al. (2018); Chen and Jordan (2019a). The main idea of this attack is to try and find the boundaries between the class regions. This is accomplished using a binary searching methodology and gradient approximation for the points located on the boundaries. These type of attacks are also called boundary attacks. As mentioned in Chen and Jordan (2019a) Boundary Attacks are not as efficient as pure black-box attacks, which we define in the next subsection. Boundary attacks are not as efficient as pure black-box attacks because the attacker requires a large number of queries Ilyas et al. (2018); Cheng et al. (2019); Chen and Jordan (2019b); Cheng et al. (2020) to successfully create adversarial examples.
Synthetic model based black-box attacks. This type of attack is also known as a transfer-based black-box attack or a first order optimization attack. In this attack the adversary trains a synthetic model based on the information about the defense. After training, the adversary creates adversarial examples by applying white-box attacks (e.g FGSM Goodfellow et al. (2014), BIM Kurakin et al. (2017), MIM Dong et al. (2018), PGD Madry et al. (2018), CW Carlini and Wagner (2017b) and EAD Chen et al. (2018)) on the synthetic model. The adversarial examples created using the synthetic model are then submitted to the defense. Since adversarial examples are created based on the derivatives of the synthetic model, synthetic model based black box attacks is also known as first order black box attacks. These types of attacks can further be divided into three sub categories:
B. Oracle based black-box attack Papernot et al. (2017): The attacker does not have access to the original training dataset, but may generate a synthetic dataset similar to the training data. The adversary can adaptively generate synthetic data and query the defense to obtain class labels for this data. The synthetic dataset is then used to train the synthetic model. It is important to note the adversary does not have access to the original training dataset .
C. Mixed black-box attack Nguyen et al. (2019): In this paper, we use an enhanced synthetic model based attack which is known as the mixed black-box attack. This attack is a simple combination of the pure black-box attack and oracle based black-box attack. The adversary has access to some or all of the original training dataset . The adversary can also adaptively generate synthetic dataset . Lastly, in this attack the adversary can label both synthetic and original training data by querying the defense. It is obvious that this type of attack is more powerful than either the pure or oracle attack as it uses a stronger adversarial model.
2.2 Our black-box attack scope
We focus on synthetic model based black-box attacks, specifically the pure black-box attacks and mixed black-box attacks categories. Why do we refine our scope in this way? First of all we pick the pure black-box attack because this attack has no adaptation and no knowledge of the defense. In essence it is weaker than the mixed black-box attack. It may seem counterintuitive to start with a weak black-box attack. However, by using a relatively weak attack we can see the security of the defense under idealized circumstances. This represents a kind of best case defense scenario.
The second type of attack we focus on is the mixed black-box attack. This is the strongest synthetic model type of black-box attack in terms of the powers given to the adversary. We use this as our adaptive attack instead of target model black-box attacks for the following reasons: We do not focus on target model score based black-box attacks because they can be neutralized by gradient masking Carlini et al. (2019b). Furthermore, it has been noted that a decision based black-box attack represents a more practical adversarial model Chen et al. (2019). This leaves target model decision based black-box attacks. However, it has been claimed that decision based black-box attacks may perform poorly on randomized models Carlini et al. (2019b). In addition, as described in Ilyas et al. (2018); Cheng et al. (2019); Chen and Jordan (2019b); Cheng et al. (2020); Guo et al. (2019); Tu et al. (2019); Chen et al. (2017) to create a successful adversarial example, a large number of queries need to be made. Therefore, it would be very expensive in terms of queries, computation and time to create many successful adversarial examples as compared to the synthetic model based black-box attacks.
We chose the pure and mixed black-box attack because they do not suffer from the limitations of the score and decision based attacks. Overall the pure and mixed black-box attack can be used as an efficient and nearly universally applicable security test.
3 Defense summaries and metrics
We investigate 9 defenses, Barrage of Random Transforms (BaRT) Raff et al. (2019), End-to-End Image Compression Models (ComDefend) Jia et al. (2019), The Odds are Odd (Odds) Roth et al. (2019), Feature Distillation (FD) Liu et al. (2019), Buffer Zones (BUZz) Nguyen et al. (2019), Ensemble Diversity (ADP) Pang et al. (2019), Distribution Classifier (DistC) Kou et al. (2020), Error Correcting Output Codes (ECOC) Verma and Swami (2019) and K-Winner-Take-All (k-WTA) Xiao et al. (2020).
In Table 1 we decompose these defenses into the underlying methods they use to try and achieve security. This is by no means the only way these defenses can be categorized and the definitions here are not absolute. We merely provide this hierarchy to provide a basic overview and show common defense themes. Although the definitions themselves are somewhat self explanatory, we give more precise defense method definitions in the supplemental material.
3.1 Barrage of random transforms
Barrage of Random Transforms (BaRT) Raff et al. (2019)
is a defense based on applying image transformations before classification. The defense works by randomly selecting a set of transformations and a random order in which the image transformations are applied. In addition, the parameters for each transformation are also randomly selected at run time to further enhance the entropy of the defense. Broadly speaking, there are 10 different image transformations groups: JPEG compression, image swirling, noise injection, Fourier transform perturbations, zooming, color space changes, histogram equalization, grayscale transformations and denoising operations.
Why we selected it: In Carlini et al. (2019b)
they claim gradient free attacks (i.e. black-box attacks) most commonly fail due to randomization. Therefore BaRT is a natural candidate to test for black-box security. Also in the original paper BaRT is only tested with ImageNet. We wanted to see if this defense could be expanded to work for other datasets.
3.2 End-to-end image compression models
ComDefendJia et al. (2019)
is a defense where image compression/reconstruction is done using convolutional autoencoders before classification. ComDefend consists of two modules: a compression convolutional neural network (ComCNN) and a reconstruction convolutional neural network (RecCNN). The compression network transforms the input image into a compact representation by compressing the original 24 bit pixels into compact 12 bit representations. Gaussian noise is then added to the compact representation. Decompression is then done using the reconstruction network and the final output is fed to the classifier. In this defense retraining of the classifier on reconstructed input data is not required.
Why we selected it: Other autoencoder defenses have fared poorly Carlini and Wagner (2017c). It is worth studying new autoencoder defenses to see if they work, or if they face the same vulnerabilities as older defense designs. Since this paper Jia et al. (2019) does not study black-box adversaries, our analysis also provides new insight on this defense.
3.3 The odds are odd
The Odds are Odd Roth et al. (2019)
is a defense based on a statistical test. This test is motivated by the following observation: the behaviors of benign and adversarial examples are different at the logits layer (i.e. the input to the softmax layer). The test works as follows: For a given input image, multiple copies are created and a random noise is added to each copy. This creates multiple random noisy images. The defense calculates the logits values of each noisy image and use them as the input for the statistical test.
Why we selected it: In Tramer et al. (2020) they mention that Odds is based on the common misconception that building a test for certain adversarial examples will then work for all adversarial examples. However, in the black-box setting this still brings up an interesting question: If the attacker is unaware of the type of test, can they still adaptively query the defense and come up with samples that circumvent the test?
3.4 Feature distillation
Feature Distillation (FD) implements a unique JPEG compression/decompression technique to defend against adversarial examples. Standard JPEG compression/decompression preserves low frequency components. However, it is claimed in Liu et al. (2019) that CNNs learn features which are based on high frequency components. Therefore, the authors propose a compression technique where a smaller quantization step is used for CNN accuracy sensitive frequencies and a larger quantization step is used for the remaining frequencies. The goal of this technique is two-fold. First, by maintaining high frequency components the defense aims to preserve clean accuracy. Second, by reducing the other frequencies the defense tries to eliminate the noise that make samples adversarial. Note this defense does have some parameters which need to be selected through experimentation. For the sake of brevity, we provide the experiments for selecting these parameters in the supplemental material.
Why we selected it: A common defense theme is the utilization of multiple image transformations like in the case of BaRT, BUZz and DistC. However, this requires a cost in the form of network retraining or clean accuracy. If a defense could use only one type of transformation (as done in FD) it may be possible to significantly reduce those costs. To the best of our knowledge, so far no single image transformation has accomplished that which makes the investigation of FD interesting.
3.5 Buffer zones
Buffer Zones (BUZz) employs a combination of techniques to try and achieve security. The defense is based on unanimous majority voting using multiple classifiers. Each classifier applies a different fixed secret transformation to its input. If the classifiers are unable to agree on a class label the defense marks the input as adversarial. The authors also note that a large drop in clean accuracy is incurred due to the number of defense techniques employed.
Why we selected it: We selected this defense to study because it specifically claims to deal with the exact adversarial model (adaptive black-box) that we work with. However, in their paper they only use a single strength adversary (i.e. one that uses the entire training dataset). We test across multiple strength adversaries (see Section 4) to see how well their defense holds up.
3.6 Improving adversarial robustness via promoting ensemble diversity
Constructing ensembles of enhanced networks is one defense strategy to improve the adversarial robustness of classifiers. However, in an ensemble model, the lack of interaction among individual members may cause them to return similar predictions. This defense proposes a new notion of ensemble diversity by promoting the diversity among the predictions returned by members of an ensemble model using an adaptive diversity promoting (ADP) regularizer, which works with a logarithm of ensemble diversity term and an ensemble entropy term Pang et al. (2019). The ADP regularizer helps non-maximal predictions of each ensemble member to be mutually orthogonal, while the maximal prediction is still consistent with the correct label. This defense employs a different training procedure where the ADP regularizer is used as the penalty term and the ensemble network is trained interactively.
Why we selected it: It has been shown that adversarial samples can have high transferability Papernot et al. (2017). Synthetic model black-box attacks have a basic underlying assumption: adversarial samples that fool the synthetic model will also fool the defense. ADP trains networks to specifically enhance diversity which could mitigate the transferability phenomena. If the adversarial transferability between networks is indeed really mitigated, then black-box attacks should not be effective.
3.7 Enhancing transformation-based defenses against adversarial attacks with a distribution classifier
The basic idea of this defense is that if the input is adversarial, basing the predicted class on the softmax output may yield a wrong result. Instead in this defense the input is randomly transformed multiple times, to create many different inputs. Each transformed input yields a softmax output from the classifier. Prediction is then done on the distribution of the softmax outputs Kou et al. (2020). To classify the softmax distributions, a separate distributional classifier is trained.
Why we selected it: In Kou et al. (2020) the defense is tested with target model black-box attacks but it does not test any synthetic model based black-box attacks. This defense is built on Xie et al. (2018) which was initially a promising randomization defense Xie et al. (2018) that was broken in Athalye et al. (2018). Whether the combination of a new classification scheme and randomization can achieve synthetic model based black-box security is an open question.
3.8 Error correcting output codes
The Error Correcting Output Codes (ECOC) Verma and Swami (2019)
defense uses the idea of coding theory and changes the output representation in a network to codewords. There are three main ideas of the defense. First is the use of a special sigmoid decoding activation function instead of the softmax function. This function allocates the non-trivial volume in logit space to uncertainty. This makes the attack surface smaller to the attacker who tries to craft adversarial examples. Two, a larger Hamming distance between the codewords is used to increase the distance between two high-probability regions for a class in logit space. This forces the adversary to use larger perturbations in order to succeed. Lastly, the correlation between outputs is reduced by training an ensemble model.
Why we selected it: Much like ADP, this method relies on an ensemble of models. However unlike ADP, this defense is based on coding theory and the original paper does not consider a black-box adversary. Therefore, exploring the black-box security of this defense is of interest.
In k-Winner-Take-All (k-WTA) a special activation function is used that is discontinuous. This activation function mitigates white-box attacks through gradient masking. The authors claim this architecture change is nearly free.
Why we selected it:
The authors of the defense claim that k-WTA performs better under synthetic model black-box attacks than networks that use ReLU activation functions. If this claim is true, this would be the first defense in which gradient masking could mitigate both white-box and black-box attacks. InTramer et al. (2020) they already show the vulnerability of this defense to white-box attacks. Therefore, this indicates black-box security maybe an issue for this defense as well.
3.10 Defense metric
In this paper our goal is to demonstrate what kind of gain in security can be achieved by using each defense against a black-box adversary. Our aim is not to claim any defense is broken. To measure the improvement in security, we use a simple metric: Defense accuracy improvement.
Defense accuracy improvement is the percent increase in correctly recognized adversarial examples gained when implementing the defense as compared to having no defense. We compute this value by first conducting a specific black-box attack on a vanilla network (no defense). This gives us a vanilla defense accuracy score . The vanilla defense accuracy is the percent of adversarial examples the vanilla network correctly identifies. We run the same attack on a given defense. For the defense we will obtain a defense accuracy score of . The defense accuracy improvement for the defense will be: . For example if , then we want the defense accuracy improvement not to be negative; if , then a defense accuracy improvement of is considered good; if , then we want at least a defense accuracy improvement. To make these comparisons as precise as possible almost every defense is built with same architecture and models. Exceptions to this occur in some cases, which we full explain in the supplemental material.
4 Experimental results
In this section we conduct experiments to test the black-box security of the 9 defenses. We measure the results using the metric defense accuracy improvement (see Section 3.10). For each defense we test it under a pure black-box adversary and five different strength mixed black-box adversaries. The strength of the mixed black-box adversary is determined by how much of the original training dataset they are given access to (either , , , or ). For every adversary, once the synthetic model is trained we have 6 different ways (FGSM Goodfellow et al. (2014), BIM Kurakin et al. (2017), MIM Dong et al. (2018), PGD Madry et al. (2018), CW Carlini and Wagner (2017b) and EAD Chen et al. (2018)) to generate adversarial examples.
Considering the range of our experiments (9 defenses, 6 adversarial models, 6 ways to generate adversarial samples and 2 datasets) it is infeasible to report all the results here. A comprehensive break down of the implementation details and full results for every defense, dataset and attack is given in the supplemental material. Here we present the most pertinent results in Figures 1 and 2 and give the principal takeaways in the following subsection.
4.1 Principal results
1. Marginal or negligible improvements over no defense: For CIFAR-10 with the mixed black-box adversary, we can clearly see 7 out of 9 defenses give marginal (less than ) increases in defense accuracy for any attack. This is shown in Figure 1. BUZz and the Odds defense are the only ones to break this trend for CIFAR-10. Likewise, for Fashion-MNIST again 7 out of 9 defenses give only marginal improvements (see Figure 2). BUZz and BaRT are the exceptions for this dataset.
2. Security is not free (yet): Thus far no defense we experimented with that offers significant (greater than increase) improvements comes for free. For example consider the defenses that give significant defense accuracy improvements. BUZz-8 drops the clean accuracy by for CIFAR-10. BaRT-6 drops the clean accuracy by for Fashion-MNIST. As defenses improve, we expect to see this trade-off between clean accuracy and security become less costly. However, our experiments show we have not reached this point with the current defenses.
3. Common defense mechanisms: It is difficult to decisively prove any one defense mechanism guarantees security. However, among the defenses that provide more than marginal improvements (Odds, BUZz and BaRT) we do see common defense trends. Both Odds and BUZz use adversarial detection. This indirectly deprives the adaptive black-box adversary of training data. When an input sample is marked as adversarial, the black-box attacker cannot use it to train the synthetic model. This is because the synthetic model has no adversarial class label. It is worth noting that in the supplemental material we also argue why a synthetic model should not be trained to output an adversarial class label.
Along similar lines, both BaRT and BUZz offer significant defense accuracy improvements for Fashion-MNIST. Both employ image transformations so jarring that the classifier must be retrained on transformed data. The experiments show that increasing the magnitude (number) of the transformations only increases security up to a certain point though. For example, BaRT-8 does not perform better than BaRT defenses that use less image transformations (see BaRT-6 and BaRT-4 in Figure 2).
In this paper we investigate and rigorously experiment with adaptive black-box attacks on recent defenses. Our paper’s results span nine defenses, two adversarial models, six different attacks and two datasets. We show that most defenses (7 out of 9 for each dataset) offer less than a improvement in defense accuracy for an adaptive black-box adversary. Our analysis also yields new insight into what defense mechanisms may useful against black-box adversaries in the future. Overall we complete the security picture for currently proposed defense with our experiments and give future defense designers insight and direction with our analysis.
Broader impact statement
Machine learning is becoming more and more prevalent in every day life. From self driving cars to facial recognition systems, machine learning may be used in a variety of applications where the decisions are significant. Adversarial machine learning on the other hand poses a clear threat to these systems. If these machine learning systems fail on adversarial images this could lead to harsh consequences. Therefore, developing defenses and safe guarding these systems from adversarial machine learning is paramount. Current defenses mainly focus on a white-box adversary, one that has complete system knowledge. However it has been demonstrated that white-box security does not necessarily imply black-box security. Our paper investigates current defenses with an adaptive black-box adversary. We do this to make the security picture about these defense more complete. Our analysis gives future defense designers a better picture of what black-box adversaries may be capable of. Our paper also suggests possible promising defense mechanisms that are present in currently proposed defenses.
- Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In ICML 2018, pp. 274–283. Cited by: §B.3, §1, §1, §1, §2.1, §3.7.
- On evaluating adversarial robustness. CoRR abs/1902.06705. Cited by: §1, §2.
- On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705. Cited by: §1, §2.2, §3.1.
- MagNet and "Efficient Defenses Against Adversarial Attacks" are Not Robust to Adversarial Examples. CoRR abs/1711.08478. Cited by: §B.3.
- Adversarial examples are not easily detected: Bypassing ten detection methods. In AISec@CCS, Cited by: §B.2, Table 4, §1, §1, §2.1, §4.
- MagNet and "efficient defenses against adversarial attacks" are not robust to adversarial examples. External Links: Cited by: §3.2.
- HopSkipJumpAttack: a query-efficient decision-based attack. External Links: Cited by: §2.2.
- Boundary attack++: query-efficient decision-based adversarial attack. CoRR abs/1904.02144. External Links: Cited by: §1, §2.1.
- Boundary attack++: query-efficient decision-based adversarial attack. CoRR abs/1904.02144. Cited by: §2.1, §2.2.
Ead: elastic-net attacks to deep neural networks via adversarial examples.
Thirty-second AAAI conference on artificial intelligence, Cited by: §B.2, §1, §2.1, §4.
- ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks Without Training Substitute Models. In AISec, Cited by: §1, §2.1, §2.2.
- Query-efficient hard-label black-box attack: an optimization-based approach. In International Conference on Learning Representation (ICLR), Cited by: §2.1, §2.2.
- Sign-opt: a query-efficient hard-label adversarial attack. Cited by: §2.1, §2.2.
- Boosting adversarial attacks with momentum. In , pp. 9185–9193. Cited by: §B.2, §1, §2.1, §4.
- Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572. External Links: Cited by: §B.2, §1, §1, §2.1, §2, §4.
- Simple black-box adversarial attacks. arXiv preprint arXiv:1905.07121. Cited by: §2.1, §2.2.
- Identity mappings in deep residual networks. Lecture Notes in Computer Science, pp. 630–645. External Links: Cited by: §C.2.
- Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pp. 2137–2146. Cited by: §2.1, §2.2.
- Comdefend: an efficient image compression model to defend adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6084–6092. Cited by: §C.8, §C.8, §1, §1, §3.2, §3.2, §3.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §B.3.
- Enhancing transformation-based defenses against adversarial attacks with a distribution classifier. In International Conference on Learning Representations, External Links: Cited by: §C.6, §C.6, §1, §1, §3.7, §3.7, §3.
- Distribution regression network. CoRR abs/1804.04775. External Links: Cited by: §C.6.
- ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §1.
- Adversarial examples in the physical world. International Conference on Learning Representations (ICLR) Workshop. Cited by: §B.2, §1, §2.1, §4.
- Delving into Transferable Adversarial Examples and Black-box Attacks. ICLR (Poster). Cited by: §A.3, §B.3, §1, §2.1.
- Feature distillation: dnn-oriented jpeg compression against adversarial examples. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 860–868. Cited by: §C.7, §C.7, §1, §1, §3.4, §3.
Towards deep learning models resistant to adversarial attacks. International Conference on Learn- ing Representations (ICLR). Cited by: §B.2, §C.9, §1, §2.1, §4.
- BUZz: buffer zones for defending adversarial examples in image classification. External Links: Cited by: §A.7, §B.3, §B.3, Table 2, Table 3, Table 4, §C.10, §1, §2.1, §3, Algorithm 1.
- Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pp. 4970–4979. Cited by: §C.4, §1, §1, §3.6, §3.
- Practical Black-Box Attacks against Machine Learning. In ACM AsiaCCS 2017, pp. 506–519. Cited by: §B.3, §1, §1, §2.1, §3.6.
- Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples. CoRR. Cited by: §B.2, §1, §2.1.
- Barrage of random transforms for adversarially robust defense. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6528–6537. Cited by: §C.3, §C.3, §1, §1, §3.1, §3.
- The odds are odd: a statistical test for detecting adversarial examples. In International Conference on Machine Learning, pp. 5498–5507. Cited by: §C.9, §1, §1, §3.3, §3.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §C.2.
- Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR. Cited by: §1.
- Intriguing properties of neural networks. In ICLR, Cited by: §1, §2.1.
- On adaptive attacks to adversarial example defenses. arXiv preprint arXiv:2002.08347. Cited by: §1, §3.3, §3.9.
- Autozoom: autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 742–749. Cited by: §2.2.
Error correcting output codes improve probability estimation and adversarial robustness of deep neural networks. In Advances in Neural Information Processing Systems, pp. 8643–8653. Cited by: §C.5, §1, §1, §3.8, §3.
- Enhancing adversarial defense by k-winners-take-all. In International Conference on Learning Representations, External Links: Cited by: §1, §1, §3.
- Mitigating adversarial effects through randomization. CoRR abs/1711.01991. External Links: Cited by: §C.6.
- Mitigating adversarial effects through randomization. ICLR (Poster). Cited by: §3.7.
- Adversarial Examples: Attacks and Defenses for Deep Learning. CoRR abs/1712.07107. External Links: Cited by: §2.
Appendix A Supplementary experimental results and analyses
a.1 Pure black-box and variable strength mixed black-box attacks
In Figure 3 and 4 we show results for the pure black-box attack for CIFAR-10 and Fashion-MNIST. Just like for the mixed black-box attack we see the similar trends in terms of which defenses provide the highest security gains. For CIFAR-10 the defenses that give at least greater defense accuracy than the vanilla defense include BUZz and Odds. For Fashion-MNIST the only defense that gives this significant improvement is BUZz.
Both the mixed black-box and pure black-box attack have access to the entire original training dataset. The difference between them lies in the fact that the mixed black-box attack can generate synthetic data and label the training data by querying the defense. Since both attacks are similar in terms of how much data they start with, a question arises. How effective is the attack if the attacker doesn’t have access to the full training dataset? In the following subsections we seek to answer that question by considering each defense under a variable strength adversary in the mixed black-box setting. Specifically we test out adversaries that can query the defense but only have , , or of the original training dataset.
To simplify things with the variable strength mixed black-box adversary, we only consider the untargeted MIM attack for generating adversarial examples. We use the MIM attack because it is the best performing attack on the vanilla (no defense) network for both datasets. Therefore, this attack represents the place where greatest improvements in security can be made. For those who are interested, we do report all the defense accuracies for all six types of attacks for the variable strength mixed black-box adversaries in the tables at the end of this supplementary material.
a.2 Barrage of random transforms analysis
The mixed black-box attack with variable strength for BaRT defenses is shown in Figure 5. There are several interesting observations that can be made about this defense. First, for CIFAR-10 the maximum transformation defense (BaRT-10) actually performs worse than the vanilla defense in most cases. BaRT-1, BaRT-4 and BaRT-7 perform approximately the same as the vanilla defense. These statements hold except for the strength mixed black-box adversary. Here all BaRT defenses show a or greater improvement over the vanilla defense.
Where as the performance of BaRT is rather varied for CIFAR-10, for Fashion-MNIST this is not the case. All BaRT defenses show improvement for the MIM attack for adversaries with strength or greater.
When examining the results of BaRT on CIFAR-10 and Fashion-MNIST we see a clear discrepancy in performance. One possible explanation is as follows: the image transformations in a defense must be selected in a way that does not greatly impact the original clean accuracy of the classifier. In the case of BaRT-10 (the maximum number of transformations) for CIFAR-10 it performs much worse than the vanilla case. However, BaRT-8 for Fashion-MNIST (again the maximum number of transformations) performs much better than the vanilla case. If we look at the clean accuracy of BaRT-10 it is approximately on CIFAR-10. This a drop of more than as compared to the vanilla clean accuracy. For BaRT-8 the clean accuracy is approximately on Fashion-MNIST. This is a drop of about . Here we do not use precise numbers when describing the clean accuracy because as a randomized defense, the clean accuracy may drop or rise a few percentage points every time the test set is evaluated.
From the above stated results we can make the following conclusion: A defense that employs random image transformations cannot be applied naively to every dataset. The set of image transformations much be selected per dataset in such a manner that the clean accuracy is not drastically impacted. In this sense while random image transformations may be a promising defense direction, it seems they may need to be designed on a per dataset basis.
a.3 End-to-end image compression models analysis
The mixed black-box attack with variable strength for ComDefend is shown in Figure 6. For CIFAR-10 we see the defense performs very close to the vanilla network (and sometimes slightly worse). On the other hand, for Fashion-MNIST the defense does offer a modest average defense accuracy improvement of across all mixed black-box adversarial models.
In terms of understanding the performance of ComDefend it is important to note the following: In general it has been shown that more complex architectures (e.g. deeper networks) can better resist transfer based adversarial attacks Liu et al. . In essence an autoencoder/decoder setup can be viewed as additional layers in the CNN and hence a more complex model. Although this concept was shown for ImageNet Liu et al.  it may be a phenomena that occurs in other datasets as well.
This more complex model can partially explain why ComDefend slightly outperforms the vanilla defense in most cases. In short, a slightly more complex model is slightly more difficult to learn and attack. Of course this begs the question, if a more complex model yields more security, why does the model complexity even have come from an autoencoder/decoder? Why not use ResNet164 or ResNet1001?
These are valid questions which are beyond the scope of our current work. While ComDefend itself does not yield significant (greater than ) improvements in security, it does bring up an interesting question: Under a black-box adversarial model, to what extent can increasing model complexity also increase defense accuracy? We leave this as an open question for possible future work.
a.4 The odds are odd analysis
In Figure 7 the mixed black-box attack with different strengths is shown for the Odds defense. For CIFAR-10 the Odds has an average improvement of across all adversarial models. However, for Fashion-MNIST the average improvement over the vanilla model is only . As previously stated, this defense relies on the underlying assumption that creating a test for one set of adversarial examples will then generalize to all adversarial examples.
When the test used in the Odds does provide security improvements (as in the case for CIFAR-10), it does highlight one important point. If the defense can mark some samples as adversarial, it is possible to deprive the mixed black-box adversary of data to train the synthetic model. This in turn weakens the overall effectiveness of the mixed black-box attack. We stress however that this occurs only when the test is accurate and does not greatly hurt the clean prediction accuracy of the classifier.
a.5 Feature distillation analysis
Figure 8 shows the mixed black-box with a variable strength adversary for the feature distillation defense. In general feature distillation performs worse than the vanilla network for all Fashion-MNIST adversaries. It performs worse or roughly the same for all CIFAR-10 adversaries, except for the case where it shows a marginal improvement of .
In the original feature distillation paper the authors claim that they test a black-box attack. However, our understanding of their black-box attack experiment is that the synthetic model used in their experiment was not trained in an adaptive way. To be specific, the adversary they use does not have query access to the defense. Hence, this may explain why when an adaptive adversary is considered, the feature distillation defense performs roughly the same as the vanilla network.
As we stated in the main paper, it seems unlikely a single image transformation would be capable of providing significant defense accuracy improvements. Thus far, the experiments on feature distillation support that claim for the JPEG compression/decompression transformation. The study of this image transformation and the defense are still very useful. The idea of JPEG compression/decompression when combined with other image transformations may still provide a viable defense, similar to what is done in BaRT.
a.6 Buffer zones analysis
The results for the buffer zone defense in regards to the mixed black-box variable strength adversary are given in Figure 9. For all adversaries and all datasets we see an improvement over the vanilla model. This improvement is quite small for the adversary for the CIFAR-10 dataset at only a increase in defense accuracy for BUZz-2. However, the increases are quite large for stronger adversaries. For example, the difference between the BUZz-8 and vanilla model for the Fashion-MNIST full strength adversary is .
As we stated in the main paper, BUZz is one of the defenses that does provide more than marginal improvements in defense accuracy. This improvement comes at a cost in clean accuracy however. To illustrate: BUZz-8 has a drop of and in clean testing accuracy for CIFAR-10 and Fashion-MNIST respectively. An ideal defense is one in which the clean accuracy is not greatly impacted. In this regard BUZz still leaves much room for improvement. The overall idea presented in BUZz of combining adversarial detection and image transformations does give some indications of where future black-box security may lie if these methods can be modified to better preserve clean accuracy.
a.7 Improving adversarial robustness via promoting ensemble diversity analysis
The ADP defense and its performance under various strength mixed black-box adversaries is shown in Figure 10. For CIFAR-10 the defense does slightly worse than the vanilla model. For Fashion-MNIST the defense does almost the same as the vanilla model.
It has also been shown before in Nguyen et al.  that using multiple vanilla networks does not yield significant security improvements against a black-box adversary. The mixed black-box attacks presented here support these claims when it comes to the ADP defense.
At this time we do not have an adequate explanation as to why the ADP defense performs worse on CIFAR-10 given its clean accuracy is actually slightly higher than the vanilla model. We would expect slightly higher clean accuracy would result in slightly higher defense accuracy but this is not the case. Overall though, we do not see significant improvements in defense accuracy when implementing ADP against mixed black-box adversaries of varying strengths for CIFAR-10 and Fashion-MNIST.
a.8 Enhancing transformation-based defenses against adversarial attacks with a distribution classifier analysis
The distribution classifier defense results for mixed black-box adversaries of varying strength are shown in Figure 11
. This defense does not perform significantly better than the vanilla model for either CIFAR-10 or Fashion-MNIST. This defense employs randomized image transformations, just like BaRT. However, unlike BaRT there is no clear improvement in defense accuracy. We can attribute this to two main reasons. First, the number of transformations in BaRT are significantly larger (i.e. 10 different image transformation groups in CIFAR-10, 8 different image transformation groups in Fashion-MNIST). In the distribution classifier defense, only resizing and zero padding transformations are used. Second, BaRT requires retraining the entire classifier to accommodate the transformations. This means all parts of the network from the convolutional layers, to the feed forward classifier are modified (retrained). The distribution classifier defense only retrains the final classifier after the soft-max output. This means the feature extraction layers (convolutional layers) between the vanilla model and the distributional classifier are virtually unchanged. If two networks have the same convolutional layers with the same weights, it is not surprising that the experiments show that they have similar defense accuracies.
a.9 Error correcting output codes analysis
In Figure 12 we show the ECOC defense for the mixed black-box adversaries with varied strength. For CIFAR-10 ECOC performs worse than the vanilla defense in all cases except for the strength adversary. For Fashion-MNIST the ECOC defense performs only slightly better than the vanilla model. ECOC performs greater in terms of defense accuracy on average when considering all the different strength mixed black-box adversaries for Fashion-MNIST. In general however we don’t see significant improvements (greater than increases) in defense accuracy when implementing ECOC.
a.10 k-winner-take-all analysis
The results for the mixed black-box variable strength adversary for the k-WTA defense are given in Figure 13. We can see that the k-WTA defense performs approximately the same or slightly worse than the vanilla model in almost all cases.
The slightly worse performance on CIFAR-10 can be attributed to the fact that the clean accuracy of the k-WTA ResNet56 is slightly lower than the clean accuracy of the vanilla model. We go into detailed explanations about the lower accuracy in Section C.11
Regardless of the slight clean accuracy discrepancies, we see that this defense does not offer any significant improvements over the vanilla defense. From a black-box attacker perspective this makes sense. Replacing an activation function in the network while still making it have almost identical performance on clean images should not yield security. The only exception to this would be if the architecture change fundamentally alters the way the image is processed in the CNN. In the case of k-WTA, the experiments support the hypothesis that this is not the case.
Appendix B Background on adversarial examples attacks
b.1 General setup of adversarial examples in image classification
Basically, the adversary is given a trained image classifier (e.g, CNN classifier) which outputs a class label for a given input data (i.e., image) . Let be a -dimensional vector of confidence scores of classifier . The class label is computed as
The adversary adds a perturbation to the original input to create an adversarial example with class label , i.e., . As desired, should be small to make the adversarial noise barely recognizable by humans. The classifier may be fooled to produce any desired class label . An untargeted attack means for any , while in a targeted attack the adversary specifies an adversarial label a-priori.
b.2 White-box attacks
As mentioned in the introduction, we restrict ourselves to synthetic model based black-box attacks. Once a synthetic model is trained, any white-box attack can be run on the synthetic model to create an adversarial example. This type of attacks exploit the transferability between classifiers Papernot et al.  to create successful adversarial examples. We briefly introduce the following commonly used white-box attacks in literature.
Fast Gradient Sign Method (FGSM) Goodfellow et al. . This is one of the most widely used white-box attacks. The adversarial examples are generated as follows.
is a loss function (e.g, cross entropy) of model. In this paper, and for CIFAR-10 and Fashion-MNIST, respectively.
Basic Iterative Methods (BIM) Kurakin et al. . The basic idea of the attack is similar to FGSM attack. However, this is an iterative attack, i.e., the adversarial examples are created by applying the attack many iterations. Typically,
where , is the number of iterations, clip is a clipping operation. In this paper, and for CIFAR-10 and Fashion-MNIST, respectively.
Momentum Iterative Methods (MIM) Dong et al. . This is similar to BIM and it uses momentum trick to create the gradient , i.e.,
In this paper, it has the same setups as described above for BIM for CIFAR-10 and Fashion-MNIST. Since it uses a momentum trick, it has a decaying factor in the algorithm. In this paper, we set for both CIFAR-10 and Fashion-MNIST.
Projected Gradient Descent (PGD) Madry et al. . This is similar to BIM where the clipping operation is replaced by a projection operation. In this paper, it has the same setup of the number of iterations and noise as described for BIM for CIFAR-10 and Fashion-MNIST. The initial noise is randomly generated in a ball which has radius and center at (see Madry et al.  for more information). The radius of the ball for generating the initial noise is set as for both CIFAR-10 and Fashion-MNIST.
Carlini and Wagner attack (CW)Carlini and Wagner [2017b]. We define and where is the score vector of input as described above and controls the confidence on the adversarial examples. The adversary builds the following objective function for finding the adversarial noise.
where is a constant chosen by a modified binary search. Moreover, this is an iterative attack and the number of iterations in our paper is set to 1000 for both CIFAR-10 and Fashion-MNIST.
Elastic Net Attack (EAD) Chen et al. . This is the variant of CW attack with the following objective function.
We set , the number of iterations and constant is found by using binary search for both CIFAR-10 and Fashion-MNIST.
b.3 Black-box settings
We describe the detailed setup of our black box attacks in this paper. We note that we strictly follow the setup of the black box attacks as described in Nguyen et al. . Indeed, this setting is carefully chosen by the authors to allow them to properly analyze the security of many defenses under the notion of pure black box attacks and mixed black box attacks. To complete this section, we re-introduce the setup used in Nguyen et al.  which we also use in this paper.
Algorithm 1 describes the oracle based black-box attack from Papernot et al. . The oracle represents black-box access to the target model and only returns the final class label for a query (and not the score vector ). Initially, the adversary is given (a part of) the training data set , i.e., he knows for some . If , Algorithm 1 is the pure black box attack Carlini and Wagner [2017a], Athalye et al. , Liu et al. , which results in an algorithm which does not need any oracle access to .
Let and be an a-priori synthetic architecture and the parameter of the synthetic network, respectively. is trained using Algorithm 1, i.e., the image-label pairs in are used to train using a training method (e.g., Adam Kingma and Ba ). The data augmentation trick (i.e., Jacobian trick) is used to increase the samples in training dataset as described in line 17. Algorithm 1 runs iterations before outputting the final trained parameters .
Tables 2, 3, and 4 from Nguyen et al.  describe the setup of our experiments in this paper. Table 2 presents the setup of the optimization algorithm used for training in Algorithm 1. The architecture of the synthetic model is described in Table 4 and the main parameters for Algorithm 1 for CIFAR-10 and Fashion-MNIST are presented in Table 3.
|Layer Type||Fashion-MNIST and CIFAR-10|
|Convolution + ReLU||3 3 64|
Convolution + ReLU
|3 3 64|
Convolution + ReLU
|3 3 128|
Convolution + ReLU
|3 3 128|
Fully Connected + ReLU
Fully Connected + ReLU
b.4 Note on training the synthetic model
In the main paper we mention that training the synthetic model is done with data labeled from the defense. However, we do not use data which has an null class label (such as the adversarial label given in BUZz and the Odds defense). We ignore this type of data because this would require modifying the untargeted attack in an unnecessary way. The untargeted attack tries to find the malicious (wrong) label. If the synthetic network is outputting null labels, it is possible for the untargeted attack to produce an adversarial sample that will have a null label. In essence, the attack would fail under those circumstances. To prevent this, the objective function of every untargeted attack would need to be modified, such that the untargeted attack produces the malicious label and it is not the null label. To avoid needlessly complicating the attack, we simply do not use null labeled data. It is an open question of whether using null labeled data to train the synthetic network and the specialized untargeted attack we describe, would actually yield performance gains.
Appendix C Defense implementations
In the first subsection we describe the defense terms used in Table 1. In the following subsections we describe the implementation details for the vanilla networks and all 9 defenses studied in this paper.
c.1 Defense mechanism definitions
In this subsection we give definitions for the underlying defense mechanisms shown in Table 1. These definitions are not absolute. We provide them here as one possible way to categorize the defenses based on shared traits.
Multiple models - The defense uses multiple classifiers for prediction. The classifiers outputs may be combined through averaging (i.e. ADP), majority voting (BUZz) or other methods (ECOC).
Fixed input transformation - A non-randomized transformation is applied to the input before classification. Examples of this include, image denoising using an autoencoder (Comdefend), JPEG compression (FD) or resizing and adding (BUZz).
Random input transformation - A random transformation is applied to the input before classification. For example both BaRT and DistC randomly select from multiple different image transformations to apply at run time.
Adversarial detection - The defense outputs a null label if the sample is considered to be adversarially manipulated. Both BUZz and Odds employ adversarial detection mechanisms.
Architecture change A change in the architecture which is made solely for the purposes of security. For example k-WTA use different activation functions in the convolutional layers of a CNN. ECOC uses a different activation function on the output of the network.
Network retraining - The network is retrained to accommodate the implemented defense. For example BaRT and BUZz require network retraining to achieve acceptable clean accuracy. This is due to the significant transformations both apply to the input. On the other hand different architectures mandate the need for network retraining like in the case of ECOC, DistC and k-WTA. Note network retraining is different from adversarial retraining. In the case of adversarial retraining it is a fundamentally different technique in the sense that it can be combined with almost every defense we study. Our interest however is not to make each defense as strong as possible. Our aim is to understand how much each defense improves security on its own. Adding in techniques beyond what the original defense focuses on is essentially adding in confounding variables. It then becomes even more difficult to determine from where security may arise. As a result we limit the scope of our defenses to only consider retraining when required and do not consider adversarial retraining.
c.2 Vanilla model
CIFAR-10: We train a ResNet56 He et al.  for 200 epochs with ADAM. We accomplish this using Keras222https://github.com/keras-team/keras and the ResNet56 version 2 implementation333https://keras.io/examples/cifar10_resnet/. In terms of the dataset, we use 50,000 samples for training and 10,000 samples for testing. All images are normalized in the range [0,1] with a shift of -0.5 so that they are in the range [-0.5, 0.5]. We also use the built in data augmentation technique provided by Keras during training. With this setup our vanilla network achieves a testing accuracy of
Fashion-MNIST: We train a VGG16 network Simonyan and Zisserman  for 100 epochs using ADAM. We use 60,000 samples for training and 10,000 samples for testing. All images are normalized in the range [0,1] with a shift of -0.5 so that they are in the range [-0.5, 0.5]. For this dataset we do not use any augmentation techniques. However, our VGG16 network has a built in resizing layer that transforms the images from 28x28 to 32x32. We found this process slightly boosts the clean accuracy of the network. On testing data we achieve an accuracy of
c.3 Barrage of random transforms
The authors of BaRT Raff et al.  do not provide source code for their defense. We contacted the authors and followed their recommendations as closely as possible to re-implement their defense. However, some implementation changes had to be made. For the sake of the reproducibility of our results, we enumerate the changes made here.
Image transformations:In the appendix for BaRT, they provide code snippets which are configured to work with sci-kit image package version 14.0.0. However, due to compatibility issues the closest version we could implement with our other existing packages was sci-kit image 14.4.0. Due to the different sci-kit version, two parts of the defense had to be modified. The original denoising wavelet transformation code in the BaRT appendix had invalid syntax for version 14.4.0, so we had to modify it and run it with different less random parameters.
The second defense change we made was due to error handling. In extremely rarely cases, certain sequences of image transformations return images with NAN values. When contacting the authors they acknowledged that their code failed when using newer versions of sci-kit. As a result, in sci-kit 14.4.0 when we encounter this error, we randomly pick a new sequence of random transformations for the image. We experimentally verified that this has a negligible impact on the entropy of the defense. For example, in CIFAR-10 for the 5 transformation defense, we encounter this error 47 times when running all 50,000 training samples. That means roughly only 0.094 of the possible transformations sequences cannot be used in sci-kit 14.4.0.
It is worth noting one other change we made to the Fashion-MNIST version of this defense. The original BaRT defense was only implemented for ImageNet, a three color channel (RGB) dataset. Fashion-MNIST is a single color channel (grayscale) dataset. As a results two transformation groups are not usable for the Fashion-MNIST BaRT defense (the color space change group and grayscale transformation group).
Training BaRT: In Raff et al.  the authors start with a ResNet model pre-trained on ImageNet and further train it on transformed data for 50 epochs using ADAM. The transformed data is created by transforming samples in the training set. Each sample is transformed times, where is randomly chosen from distribution . Since the authors did not experiment with CIFAR-10 and Fashion-MNIST, we tried two approaches to maximize the accuracy of the BaRT defense. First, we followed the author’s approach and started with a ResNet56 pre-trained for 200 epochs on CIFAR-10 with data-augmentation. We then further trained this model on transformed data for 50 epochs using ADAM. For CIFAR-10 we were able to achieve an accuracy of on the training dataset and a testing accuracy of . Likewise we tried the same approach for training the defense on the Fashion-MNIST dataset. We started with a VGG16 model that had already been trained with the standard Fashion-MNIST dataset for 100 epochs using ADAM. We then generated the transformed data and trained it for an additional 50 epochs using ADAM. We were able to achieve a training accuracy and a testing accuracy. Due to the relatively low testing accuracy on the two datasets, we tried a second way to train the defense.
In our second approach we tried training the defense on the randomized data using untrained models. For CIFAR-10 we trained ResNet56 from scratch with the transformed data and data augmentation provided by Keras for 200 epochs. We found the second approach yielded a higher testing accuracy of . Likewise for Fashion-MNIST we trained a VGG16 network from scratch on the transformed data and obtained a testing accuracy of . Due to the better performance on both datasets, we built the defense using models trained using the second approach.
c.4 Improving adversarial robustness via promoting ensemble diversity
The original source code for the ADP defense Pang et al.  on MNIST and CIFAR-10 datasets was provided on the author’s Github page444https://github.com/P2333/Adaptive-Diversity-Promoting. We used the same ADP training code the authors provided, but trained on our own architecture. For CIFAR-10 we used the ResNet56 model mentioned in C.2 and for Fashion-MNIST we used the VGG16 model mentioned in C.2
. We used K = 3 networks for ensemble model. We followed the original paper for the selection of the hyperparameters, which are= 2 and = 0.5 for the adaptive diversity promoting (ADP) regularizer. In order to train the model for CIFAR-10, we trained using the 50,000 training images for 200 epochs with a batch size of 64. We trained the network using ADAM optimizer with Keras data augmentation. For Fashion-MNIST we trained the model for 100 epochs with a batch size of 64 on the 60,000 training images. For this dataset we again used ADAM as the optimizer but did not use any data augmentation.
We constructed a wrapper for the ADP defense where the inputs are predicted by the ensemble model and the accuracy is evaluated. For CIFAR-10, we used 10,000 clean test images and obtained an accuracy of 94.3%. We observed no drop in clean accuracy with the ensemble model, but rather observed a slight increase from which is the original accuracy of the vanilla model. For Fashion-MNIST, we tested the model with 10,000 clean test images and obtained an accuracy of . Again for this dataset we observed no drop in accuracy after training with the ADP method.
c.5 Error correcting output codes
. We employed their “TanhEns32” method which uses 32 output codes and the hyperbolic tangent function as sigmoid function with an ensemble model. We choose this model because it yields better accuracy with clean and adversarial images for both CIFAR-10 and MNIST than the other ECOC models they tested, as reported in the original paper.
For CIFAR-10, we used the original training code provided by the authors. Unlike the other defenses, we did not use a ResNet network for this defense because the models used in their ensemble predict individual bits of the error code. As a result these models are much less complex than ResNet56 (fewer trainable parameters). Due to the lower model complexity of each individual model in the ensemble, we used the default CNN structure the authors provided instead of our own. We did this to avoid over parameterization of the ensemble. We used 4 individual networks for the ensemble model and trained the network with 50,000 clean images for 400 epochs with a batch size of 200. We used data augmentation (with Keras) and batch normalization during training.
We used the original MNIST training code to train Fashion-MNIST by simply changing the dataset. Similarly, to avoid over parameterization, we again used the CNNs the authors used with lower complexity instead of using our VGG16 architecture. We trained the ensemble model with 4 networks for 150 epochs and with a batch size of 200. We did not use data augmentation for this dataset.
For our implementation we constructed our own wrapper class where the input images are predicted and evaluated using the TanhEns32 model. We tested the defense with 10,000 clean testing images for both CIFAR-10 and Fashion-MNIST, and obtained and accuracy, respectively.
c.6 Distribution classifier
For the distribution classifier defense Kou et al. , we used random resize and pad (RRP) Xie et al.  and a DRN Kou et al.  as distribution classifier. The authors did not provide a public code for their complete working defense. However, the DRN implementation by the same author was previously released on Github666https://github.com/koukl/drn. We also contacted the authors, followed their recommendations for the training parameters and used the DRN implementation they sent to us as a blueprint.
In order to implement RRP, we followed the resize ranges the paper suggested, specifically for IFGSM attack. Therefore, we chose the resize range as [19, 25] for CIFAR-10 and [22, 28] for Fashion-MNIST and used these parameters for all of our experiments.
As for the distribution classifier, the DRN consists of fully connected layers and each node encodes a distribution. We use one hidden layer of 10 nodes. For the final layer, there are 10 nodes (representing each class) and there are two bins representing the logit output for each class. In this type of network the output from the layers are 2D. For final classification we convert from 2D to 1D by taking the output from the hidden layer and simply discarding the second bin each time. The distribution classifier then performs the final classification and outputs the class label.
We followed the parameters the paper suggested to prepare training data. First, we collected 1000 correctly classified training clean images for Fashion-MNIST and 10,000 correctly classified clean images for CIFAR-10. Therefore, with no transformation, the accuracy of the networks is 100%. For Fashion-MNIST, we used N = 100 transformation samples and for CIFAR-10, we used N = 50 samples, as suggested in the original paper. After collecting N samples from the RRP, we fed them into our main classifier network and collected the softmax probabilities for each class. Finally, for each class, we made an approximation by computing the marginal distributions using kernel density estimation with a Gaussian kernel (kernel width = 0.05). We used 100 discretization bins to discretize the distribution. For each image, we obtain 100 distribution samples per class. For further details of this distribution, we refer the readers toKou et al. .
We trained the model with the previously collected distribution of 1,000 correctly classified Fashion-MNIST images for 10 epochs as the authors suggested. For CIFAR-10, we trained the model with the distributions collected from 10,000 correctly classified images for 50 epochs. For both of the datasets, we used a learning rate of 0.1 and a batch size of 16. The cost function is the cross entropy loss on the logits and the distribution classifier is optimized using backpropagation with ADAM.
Testing: We first tested the RRP defense alone with 10,000 clean test images for both CIFAR-10 and Fashion-MNIST to see the drop in clean accuracy. We observed that this defense resulted in approximately 71% for CIFAR-10 and 82% for Fashion-MNIST. Compared to the clean accuracies we obtain without the defense (93.56% for Fashion-MNIST and 92.78% for CIFAR-10), we observe drops in accuracy after random resizing and padding.
We tested the full implementation with RRP and DRN. In order to compare our results with the paper, we collected 5,000 correctly classified clean images for both datasets and collected distributions after transforming images using RRP (N = 50 for Fashion-MNIST and N = 100 for CIFAR-10) like we did for training. We observed a clean test accuracy of 87.48% for CIFAR-10 and 97.76% Fashion-MNIST, which is consistent with the results reported by the original paper. Clearly, if we test all of the clean testing data (10,000 images), we obtain lower accuracy (approximately 83% for CIFAR-10 and 92% for Fashion-MNIST) since there is also some drop in accuracy caused by the CNN. On the other hand, it can be seen that there is a smaller drop in clean accuracy as compared to the basic RRP implementation.
c.7 Feature distillation
Background: The human visual system (HVS) is more sensitive to high frequency parts of the image and less sensitive to the low frequency parts. The standard JPEG compression is based on this understanding, so the standard JPEG quantization table compresses less sensitive frequency parts of the image (i.e. low frequency components) more than other parts. In order to defend against images, a higher compression rate is needed. However, since the CNNs work differently than the HVS, the testing accuracy and defense accuracy both suffer if a higher compression rate is used across all frequencies. In the Feature Distillation defense, as mentioned in Section 3, a crafted quantization technique is used as a solution to this problem. A large quantization step () can reduce adversarial perturbations but also cause more classification errors. Therefore, the proper selection of is needed. In the crafted quantization technique, the frequency components are separated as Accuracy Sensitive (AS) band and Malicious Defense (MD) band. A higher quantization step () is applied to the MD band to mitigate adversarial perturbations while a lower quantization step () is used for AS band to enhance clean accuracy. For more details of this technique, we refer the readers to Liu et al. .
Implementation: The implementation of the defense can be found on the author’s Github page777https://github.com/zihaoliu123. However, this defense has only been implemented and tested for the ImageNet dataset by the authors. In order to fairly compare our results with the other defenses, we implemented and tested this defense for CIFAR-10 and Fashion-MNIST datasets.
This defenses uses two different methods: A one-pass process and a two-pass process. The one-pass process uses the proposed quantization/dequantization only in the decompression of the image. The two-pass process, on the other hand, uses the proposed quantization/dequantization in compression followed by one-pass process. In our experiments, we use the two-pass method as it has better defense accuracy than the one-pass process Liu et al. .
In the original paper, experiments were performed in order to find a proper selection of () and () for the AS and MD bands. At the end of these experiments, they set () and (). However, these experiments were performed on ImageNet images where the images are much larger than CIFAR-10 and Fashion-MNIST images. Therefore, we performed experiments in order to properly select and for the Fashion-MNIST and CIFAR-10 datasets. For each dataset we start with the vanilla classifier (see C.2). For each vanilla CNN we first do a one-pass and then generate 500 adversarial samples using untargeted FGSM. For CIFAR-10 we use and for Fashion-MNIST we use . Here we use FGSM to do the hyperparameter selection for the defense because this is how the authors designed the original defense for ImageNet.
After generating the adversarial examples for each QS combination, we do a grid search over the possible hyperparameters and . Specifically, we test 100 defense combinations by varying from 10 to 100 and varying from 10 to 100. For every possible combination of and we measure the accuracy on the clean test set and on the adversarial examples. The results of these experiments are shown in Figure 16.
In Figure 16 for the CIFAR-10 dataset there is an intersection where both the green dots and red dots overlap. This region represents a defense with both higher clean accuracy and higher defense accuracy (the idealized case). There are multiple different combinations of and that we could choose that give a decent trade-off. Here we arbitrarily select from among these better combinations and which gives a clean score of and a defense accuracy of .
In Figure 16 for the Fashion-MNIST dataset there is no region in which both the clean accuracy and defense accuracy are high. This may show a limitation in the use of feature distillation as a defense for some datasets as here no ideal trade-off exists. We pick and which gives a clean score of and a defense accuracy of . We picked these values because this combination gave the highest defense accuracy out of all possible hyperparameter choices.
c.8 End-to-end image compression models
The original source code for defenses on Fashion-MNIST and ImageNet were provided by the authors of ComDefendJia et al.  on their Github page888https://github.com/jiaxiaojunQAQ/Comdefend. In addition, they included their trained compression and reconstruction models for Fashion-MNIST and CIFAR-10 separately.
Since this defense is a pre-processing module, it does not require modifications to the classifier network Jia et al. . Therefore, in order to perform the classification, we used our own models as described in Section C.2 and we combined them with this pre-processing module.
According to the authors of ComDefend, ComCNN and RecCNN were trained on 50,000 clean (not perturbed) images from the CIFAR-10 dataset for 30 epochs using a batch size of 50. In order to use their pre-trained models, we had to install the canton package v0.1.22 for Python. However, we had incompatibility issues with canton and the other Python packages installed in our system. Therefore, instead of installing this package directly, we downloaded the source code of the canton library from its Github page and added it to our defense code separately. We constructed a wrapper for ComDefend, where the type of dataset (Fashion-MNIST or CIFAR-10) is indicated as input so that the corresponding classifier can be used (either ResNet56 or VGG16). We tested the defense with the testin data of CIFAR-10 and Fashion-MNIST and we were able to achieve an accuracy of and respectively.
c.9 The odds are odd
Mathematical background: Here we give a detailed description of the defense based on the statistical test derived from the logits layer. For given image , we denote as the logits layer (i.e., the input to the softmax layer) of a classifier, where is the weight vector for the class . The class label is determined by . We define pair-wise log-odds between class and as
We denote the noise-perturbed log-odds where the noise is sampled from a distribution . Moreover, we define the following formulas for a pair :
For the original training data set, we compute and for all . We apply the untargeted white-box attack (PGD Madry et al. ) to generate the adversarial dataset. After that, we compute and using the adversarial dataset. We denote as the threshold to control the false positive rate (FPR) and it is computed based on and . The distribution of clean data and the distribution of adversarial data are represented by and , respectively. These distributions are supposed to be separated and is used to control the FPR.
For a given image , the statistical test is done as follows. First, we calculate the expected perturbed log-odds where is the predicted class label of image given by the vanilla classifier. The test will determine the image with the label as adversarial (malicious) if
Otherwise, the input will be considered benign. In case the test recognizes the image as malicious one, the “corrected” class label is defined as
Implementation details: The original source code for the Odds defense Roth et al.  on CIFAR-10 and ImageNet was provided by the authors999https://github.com/yk/icml19_public. We use their code as a guideline for our own defense implementation. We develop the defense for the CIFAR-10 and Fashion-MNIST and datasets. For each dataset, we apply the untargeted 10-iteration PGD attack on the vanilla classifier that will be used in the defense. Note this is a white-box attack. The parameters for the PGD attack are for CIFAR-10 and for Fashion-MNIST respectively. By applying the white-box PGD attack we can create the adversarial datasets for the defense. We choose these attack parameters because they yield adversarial examples with small noise. In Roth et al. 
, the authors assume that the adversarial examples are created by adding small noise. Hence, they are not robust against adding the white noises. For a given image, it is normalized first to be in the range. For each pixel, we generate a noise from and add it to the pixel.
For CIFAR-10, we create 50,000 adversarial examples. For Fashion-MNIST, we create 60,000 adversarial examples. We calculate and for each data set for FPR= and as described in the mathematical background. For each image, we evaluate it 256 times to compute . Table 5 shows the prediction accuracy of the defense for the clean (non-adversarial) dataset for CIFAR-10 and Fashion-MNIST. To compute the clean prediction accuracy, we use 1000 samples from the test dataset of CIFAR-10 and Fashion-MNIST.
From Table 5, we can see that while the distributions of clean and adversarial examples of Fashion-MNIST are quite separate, the distributions of clean and adversarial examples of CIFAR-10 are very closed to each other.
For CIFAR-10, the distributions of clean data and adversarial data are closed to each other. This makes it difficult to distinguish clean from adversarial examples. It seems that the Odds defense may not directly be applied to the vanilla model of CIFAR-10. To make the defense efficient for CIFAR-10, we may want to adversarially retrain the network first to make these two distributions far from each other. In this paper we do not use adversarial retraining however (see Section C.1 for the reasoning behind this).
For Fashion-MNIST we chose a FPR of because this is the lowest FPR we experimented with and it still gives an acceptable (but rather low) clean accuracy of . For CIFAR-10 we chose an FPR of since this is the lowest FPR that still has a reasonable clean accuracy.
c.10 Buffer zones
For the buffer zone defense Nguyen et al.  we could not find any publicly available source code. We contacted the authors and they provided us with their training code for the 2 network defense (BUZz-2) and the 8 network defense (BUZz-8).
CIFAR-10 implementation: For each model in the BUZz ensemble we used a ResNet56 network. We trained each ResNet in the defense for 200 epochs using the ADAM optimizer. We used the standadrd training dataset and Keras data augmentation. On BUZz-2 we were able to achieve a testing accuracy of . On BUZz-8 we achieved a testing accuracy of .
Fashion-MNIST implementation: For each model in the BUZz ensemble we used a VGG16. We trained it for 100 epochs using the ADAM optimizer. We achieved a testing accuracy of and on BUZz-2 and BUZz-8, respectively.
For the k-WTA defense, the authors provide Github code for their new activation function101010https://github.com/a554b554/kWTA-Activation
using PyTorch. We use their Github code and convert the architectures and weights of our vanilla Tensorflow/Keras models inC.2 to PyTorch. We then change the ReLu activation functions in the CNNs to k-WTA activation functions.
We tried two approaches to maximize the accuracy of this defense. When we trained our converted PyTorch architectures from scratch with k-WTA activation functions. We found the performance to be slightly less than what we achieved in Tensorflow/Keras. For CIFAR-10 the main difference occurs because we cannot use the exact same Keras data augmentation technique in PyTorch. To compensate for this we used a different approach. We transferred not only the architectures but also the trained weights from Tensorflow/Keras to PyTorch. When we replace the activation function, we then fine tune the models for 20 epochs with an ADAM optimizer. Using this approach we were able to achieve closest to the vanilla model performance with this defense. For CIFAR-10 we were able to achieve a clean testing accuracy of and for Fashion-MNIST we were able to achieve a clean testing accuracy of . This is a test 6