1 Introduction
We nowadays see an increasing adoption of deep learning techniques in production systems, partially even safetyrelevant ones (as exemplified in [1]). Hence, it is not surprising that the discovery of adversarial spaces for neural networks [2] has sparked a lot of interest. A growing community has quickly focused on different ways to reach those spaces [3, 4, 5], understand their properties [1, 6, 7], and protect vulnerable models from their malicious nature [8, 9, 10].
Most prevalent methods exploiting adversarial spaces [3, 4, 5, 10] use gradients as their main starting point to search for perturbations surrounding a clean sample. Due to the intractability of transformations modeled by neural networks and the limited amount of change that is allowed for perturbations to be considered adversarial, gradients from a classifier expose just enough information about which parts of the input correlate highly with the label that a model associates to it. With this in mind, defenses against adversarial attacks have been devised upon those same gradients in two fundamental ways: 1) They complement traditional optimization schemes with a twofold objective that minimizes the overall prediction cost while maximizing the perturbation space around clean images that classifiers can withstand [10, 11, 12]. 2) Gradients are blocked or obfuscated in such a way that attacking algorithms can no longer use them to find effective adversarial perturbations [13, 14, 15]. Type 1 methods enjoy mathematical rigor and hence, provide formal guarantees with respect to the kind of perturbations they are robust to. However, note that this can also be disadvantageous since networks become attackdependent. Any other strategy for finding perturbations could circumvent such a defense mechanism [16]. While effective for smallscale problems such as MNIST and CIFAR, we found no empirical evidence that these methods scale to larger problems such as ImageNet. It has even been shown that defenses for adversarial attacks tested on small datasets do not scale well when applied to bigger problems [6].
Currently, large scale stateoftheart defenses rely on the second use of gradients: suppression [17] and blockage [14, 18]. As defined by Athalye et al. [16], gradient obfuscation is the result of instabilities from vanishing or exploding gradients, and the use of stochastic or nondifferentiable preprocessing steps. All these alternatives can be modeled as lossy identities where the original signal contained in the input is preserved, while the adversarial perturbation is destroyed. The usefulness of this principle fits well with findings from a recent study showing that image classifiers only use a small fraction of the entire signal within original input. Therefore, a portion of its information can indeed be dropped without affecting performance [19].
In this paper, we propose an alternative defense that affects the information contained in gradients by reforming its classrelated signal into a structural one. Intuitively, we learn an identity function that encodes structure and decodes only the structural parts of the input necessary for classification, dropping everything else. To this end, an autoencoder (AE) is trained to approximate the identity function that preserves only the part of the signal that is useful for a target classifier. The structural information is preserved by training both encoder and decoder unsupervised, and finetuning only the decoder with gradients coming from an existing classifier. By using a function that only looks at structure, gradients are devoid of any classrelated information, therefore invalidating the fundamental assumptions about gradients that attackers rely on. We call this defense a StructureToSignal Network (S2SNet).
1.0.1 Formal Definitions
Let be an image classifier, and the portion of the signal in the original input that is effectively being used by . In other words, subject to , where is a measure of information e.g., the normalized mutual information [20]. Let be the space of all adversarial and nonadversarial perturbations under a given norm . We also define the subspace as the set of adversarial perturbations that are reachable from the input gradients , where represents the cost function, and is the ground truth used for training. Note that said spaces depend on and , but not on the class or the cost function . Hence, unless noted otherwise, we use the simplified notation from now on. Adversarial perturbations are elements such that ^{1}^{1}1Strictly speaking, adversarial perturbations can be reached through other domains that do not depend on gradients (as shown by Ngueyen et al. [21]) but so far, all instances of strong adversarial attack methods, base their entire strategy on the information of gradients.. As attacks use gradients to compute adversarial perturbations, and such gradients can only correspond to parts of the signal that are used by the model, it follows that . An S2SNet can hence be defined as a function such that . This way, defends by serving as a proxy for incoming, potentially malicious inputs, as well as exposing gradients via the function composition . The relation between and concerning information implies that produces lossy reconstructions of the input space. We train in a way that gradients lie in a different space than those leading attackers to (Figure 1). In other words, we train such that is minimized. This in turn, will cause the intersection to be smaller, resulting in perturbations that are nonadversarial (i.e., ). Further details of the architecture and training of and are discussed in Section 3.1.
The architecture of this defense offers important advantages when dealing with adversarial attacks:

Zero compromise for safe usecases: there is no drop in performance when the defense is deployed, but the network is not under attack. As the transformation done by S2SNet is preserving all required signal , using either or has no impact to the classifier when clean images are used.

Removing S2SNet is a defense strategy: a novel graybox defense (i.e., when the attacker knows the classifier but not the defense mechanism) works by giving away gradient information from an S2SNet to the attacker and then removing it, applying adversarial attacks (based on gradients from S2SNet) to the original classifier instead.

Attack agnostic: S2SNets rely on the same information used by adversarial attacks but not on the attacks themselves. Therefore, S2SNets do not require any assumptions with respect to the specific way an attack works.

Posthoc implementation: our defense uses gradients from a trained network, and can be used to defend models that are already in production. Likewise, no special considerations need to be made when training a new classifier from scratch.

Compatibility with other defenses: due to the compositional nature of this approach, any additional defense strategies that work with the original classifier can be implemented for the ensemble .
We test S2SNets on two highperforming image classifiers (ResNet50 [22] and Inceptionv3 [23]) against three attack methods (Fast GradientSign Method (FGSM) [3], Basic Iterative Method (BIM) [24] and CarliniWager (CW) [5]) on the large scale ImageNet [25] dataset. Experiments are conducted on both classifiers under whitebox and graybox conditions. An evaluation of the effectiveness of S2SNets with respect to regular AEs (e.g., as proposed in [26]) is presented in Section 3.1, empirically proving that S2SNets are a better approximation of the signal that is used by a classifier. Furthermore, an evaluation of the gradient space and is conducted, showing that their intersection does indeed approach the empty set.
The main contributions of this paper are threefold: First, we propose a novel way to interpret adversarial perturbations, namely in terms of the effective input signal that classifiers use . Second, we introduce a robust and flexible defense against largescale adversarial attacks based on S2SNets. And third, we provide a comprehensive baseline evaluation of adversarial attacks for several stateoftheart models on a large dataset.
2 Related Work
The fast growing interest in the phenomenon of adversarial attacks has gained momentum since its discovery [2] and has had three main areas of focus. The first area is the one that seeks new and more effective ways of reaching adversarial spaces. In [3], a first comprehensive analysis of the extent of adversarial spaces was explored, proposing a fast method to compute perturbations based on the sign of gradients. An iterative version of this method was later introduced [24]
and shown to work significantly better, even when applied to images that were physically printed and digitized again. A prominent exception to attacks based on gradients succeeded using evolutionary algorithms
[21]. Nevertheless, this has not been a practical wide spread method, mostly due to how costly it is to compute. Papernot et al. [1] showed how effective adversarial attacks could get, even with very few assumptions about the attacked model. Finally, there are methods that go beyond a greedy iteration over the gradient space and perform different kinds of optimization that maximize misclassification while minimizing the norm of the perturbation [4, 5].The second area focuses on understanding the properties of adversarial perturbations. The work of Goodfellow et al. [3] was already pointing at the linear nature of neural networks as the main enabler of adversarial attacks. This went in opposition of what was initially theorized, where the claim was that nonlinearities were the main vulnerability. Not only was it possible to perturb natural looking images to look like something completely different, but it was also possible to get models issuing predictions with high confidence using either noise images, or highly artificial patterns [2, 21]. The transferability of adversarial perturbations was shown to be possible by crafting attacks on one network and using them to fool a second classifier [27]. However, transferable attacks are limited to the simpler methods as iterative ones tend to exploit particularities of each model, and hence lose power when used in different architectures [6]. As it turns out, not only is adversarial noise transferable between models but it is also possible to transfer a single universal adversarial perturbation to all samples in a dataset to achieve high misclassification rates [7]. Said individual perturbations can even be applied to physical objects and bias a model towards a specific class [28].
The third and arguably the most popular area of research has focused on how networks can be protected against such attacks. Strategies include changing the optimization objective to account for possible adversarial spaces [10, 11], detection [9], dataset augmentation that includes adversarial examples [3, 12], suppressing perturbations [8, 15, 17, 26, 29]
or obfuscating the gradients to prevent attackers from estimating an effective perturbation
[13, 14, 18, 30, 31].In this work, we build on the idea of using AEs as a compressed representation of the input [15, 26], but tailored towards a specific characteristic of adversarial perturbations [17], using the notion of useful input signal (i.e., the one effectively used by a classifier) [19]. Furthermore, we explore the nature of adversarial perturbations and its relationship with network capacity, in terms of used signal.
3 Methods
This section explains in detail the architecture of an S2SNet and its particular signalpreserving training scheme, followed by an empirical evaluation of the gradients it provides. We start by testing the robustness of S2SNets in a whitebox setting, and compare it to a simple baseline using regular AEs. To recreate realistic attack conditions, we further test the ensemble network, simulating a reparametrization technique similar to [16], aimed at circumventing the defense, and explore further strategies to cope with this attack. Next, we examine the performance of S2SNets in a graybox scenario. Finally, we provide an evaluation of the transferability of single step attacks for different models, and correlate their overlap in terms of input signal from the perspective of adversarial perturbations.
3.1 StructuretoSignal Networks
S2SNets start out as plain AEs that are trained on the largescale YFCC100m data set [32]. Only a single pass is required, as the million images are more than sufficient to train the underlying SegNet architecture [33] to convergence^{2}^{2}2We choose a large architecture as an upper bound to the ideal identity function , because it was proven capable of encoding the semantics of the deep image classifiers tested in this work.. This network, referred to as , is able to reproduce input signals required by a diverse set of classifiers such that their top1 accuracy is within percentage points of the original classification performance [19].
To model the effective input signal used by a trained classifier , Palacio et al. propose to further finetune the decoder of using gradients from itself. This allows to learn a way to decode the input that retains the signal required by . This finetuned variant, called , reconstructs images such that the original top1 accuracy of is preserved, while amount of information in the reconstructed image (measured as normalized mutual information) decreases with respect to the original sample.
Note that, since the encoder of was trained unsupervised, any intermediate representations produced by
are entirely classagnostic. This means that, during backpropagation through an S2SNet, gradients that can be read at the shallowest layer correspond only to information about structure. Intuitively, gradients from
point to parts of the image that can be changed to influence the reconstruction error.In the following, we measure the extent to which gradients shift when images are forwarded through these networks. Furthermore we explore and quantify their emerging resilience to adversarial attacks in Section 4.
3.2 Properties of Gradient Distributions
To verify that the StructuretoSignal training scheme produces large shifts in the distribution of gradients, we forward images through a classifier, a pretrained SegNet, and through the finetuned counterpart to compare their gradients. We use the magnitude of the gradients instead of raw values to stress the differences of their spatial distribution. A large change in the position where gradients originally occur within the image is a good indicator that the information conveyed by gradients has changed.
Concretely, we quantify the structural similarity [34] (SSIM; a locally normalized mean square error measured in a sliding window) of gradients obtained by the same image when passed through a ResNet50 (), an S2SNet finetuned to defend the same ResNet50 (), a plain SegNet AE coupled with ResNet50 (), the finetuned S2SNet without the classifier (), and the plain SegNet AE alone (). While the first three models require gradients to be computed with respect to a class label, the last two are produced by measuring the reconstruction error. Also, note that for the true reconstruction cannot be directly obtained as it is indirectly defined by the classifier it was finetuned on. In this case, reconstruction gradients are computed by comparing its output to the original input.
Table 1 reports the mean SSIM of the gradients over ImageNet’s validation set for all combinations of network pairs. Note that the dissimilarity between and indicates that the StructuretoSignal training scheme has indeed changed the reconstruction process i.e., the identity function being computed based on the input. Similarly, comparing the SSIM values of and reveals a difference in the information contained by their gradients. Most importantly, the similarity between the gradients coming from classifier and either of the ensembles and is considerably smaller ( and ) than any combination involving AEs exclusively (). This disparity indicates how much the position of gradients change (and hence the information contained within), when passed through an S2SNet.
Further evidence for the classagnostic nature of gradients propagated through AEs can be found when comparing the SSIM of gradient magnitudes when different target labels are used to compute gradients. Let be some input image and its true label. We then randomly select a different label and compute
for all in ImageNet’s validation set, and . We measured mean SSIM values of for , for , and for . This means that the influence of the label is smaller when gradients are propagated through the AE, but also emphasizes how dissimilar gradients of just ResNet 50 are, compared to any AE at just to SSIM (Table 1).
Figure 2 visualizes this phenomenon. Here, gradient magnitudes observed for just AEs (, ) predominantly highlight edges as source of error. This is expected, since it is more difficult to accurately reproduce the high frequencies required by sharp edges, compared to the lower frequencies of blobs. Extracting magnitudes based on classification from AEs show similar structures (, ). Some coincidental overlap between ResNet 50 () and AE variants is unavoidable, since edges are also important for classification [35]. However, the classifier on its own differs considerably from all other patterns. Overall, SSIM is at least twice as high between AE variants, than between the classifier and any of the AE configurations.
4 Experiments
This section presents experiments quantifying the robustness of S2SNets when used as a defense against adversarial attacks (Figure 3). The experimental setup closely follows the conditions from Guo et al. [14] in order to facilitate comparability:

Dataset: We use ImageNet to test S2SNets in a challenging, largescale scenario. Classifiers are trained on its training set and evaluations are carried out on the full validation set (50000 images).

Image classifier under attack: we use ResNet 50 () and Inception v3 (), both pretrained on ImageNet as target models. The classifiers have been trained under clean conditions i.e., no special considerations with respect to adversarial attacks were made during training.

Defense: we train an S2SNet for each of the classifiers under attack, following the scheme described in Section 3.1. These defenses are denoted as and for and respectively.

Perturbation Magnitude: we use the normalized norm () between a clean sample and its adversary , as defined by Guo et al. [14]. Epsilon values for each attack are listed below.

Defense Strength Metric: vulnerability to adversarial attacks is measured in terms of the number of newly misclassified samples. More precisely, for any given attack to classifier , we calculate , where is the set of true positives and is the adversarial example generated by the attack, based on .

Attack Methods: protected models are tested against a single step method, an iterative variant, and an optimizationbased alternative. Note that to replicate realistic threat conditions, all resulting adversarial samples are cast to the discrete RGB range [0, 255].

Fast Gradient Sign Method (FGSM) [3]: a simple, yet effective attack that works also when transferred to different models. Epsilon values used for this method are .

Basic Iterative Method (BIM) [24]: an iterative version of FGSM that shows more attack effectiveness but less transferable properties. Epsilon values used for BIM are . The number of iterations is fixed at .

CarliniWagner (CW) [5]: an optimizationbased method that has proven to be effective even against hardened models. Note that this attack issues perturbations that lay in a continuous domain. Epsilon values used for CW are ; the number of iterations is fixed at 100, and .


Attack Conditions: we test the proposed defenses under two conditions.

Whitebox setting: the attacker has knowledge about the classifier and the defense. This includes reading access to both the predictions of the classifier, intermediate activations and backpropagated gradients. The attacker is forced to forward valid images through the defended network i.e., through the composed classifier or .

Graybox setting: the attacker has access to a classification model but is unaware of its defense strategy.

4.1 WhiteBox
For this setting, input images flow first through the S2SNet before reaching the original classifier. Similarly, gradients are read from the shallowest layer of the S2SNet. For comparison, attacks are also run on the unprotected versions of ResNet 50 and Inception v3. Results are summarized in Figure 4.
Under these conditions, plain models are only able to moderately resist FGSM attacks, and fail completely for the more capable BIM and CW attacks. In contrast, S2SNets provide high levels of protection. CW, being an optimization attack and generally the most capable in our comparison, is expected to be more effective at breaking through the S2SNet defense. However, it does so by introducing large perturbations even for small values of . In fact, none of the attacking configurations was able to entirely fool any of our defended classifiers for even the highest levels of in our tests. Note that these whitebox results are, to our surprise, already comparable with some stateoftheart graybox defenses [14, 18] (i.e., known classifier but unknown defense). Despite the less favorable conditions for a whitebox defense, S2SNets already match alternative stateoftheart protections that were tested under the more permissive assumptions allowed in graybox settings.
For completeness, we also evaluate the defense using a pretrained SegNet() instead of an S2SNet for ResNet 50 (shown in Figure 4 with light solid green). This is most similar to the defense proposed by Meng et al. [26]. We can confirm that having just an AE does not suffice to guard the classifier against adversarial attacks and its performance is consistently lower than S2SNets. For a visual analysis of the attacks and their reconstructions by S2SNets, we refer the reader to the supplementary material.
4.1.1 Bypassing S2SNets through Reparametrization
In order to push the limits of S2SNets, we now simulate a more hostile scenario where an attacker tries to actively circumvent the defense mechanism. Based on the work of Athalye et al. [16], a reparametrization of the input space can be implemented for defenses that operate under the composition of functions . This works by defining the original input as a function of a hidden state such that . Note that the main motivation behind reparametrization is to alleviate instabilities of gradients caused by defenses that rely on said instability. Although S2SNets are very deep architectures, the resilience of this defense lays in the directed change that has been induced within the information contained in gradients. In addition, there is no trivial way to come up with a suitable that does not end up inducing the same transformation in the gradients, as done by S2SNets.
Despite all these concerns, we assume for this experiment that such can be found, and that attacking does indeed circumvent S2SNets altogether. We simulate the potential strength of this attack by using gradients of the original (unprotected) classifier, and applying them to the input directly. The resulting adversarial attack is then passed through the hardened classifier. We refer to this hypothetical scenario as WhiteBox. Results are shown in Figure 5.
Under these conditions, we observe that hardened models revert back to the behavior shown by their corresponding unprotected versions for FGSM and BIM. In general, this condition is expected, and confirms once more that S2SNets are being trained to preserve the information that is useful to the classifier. Gradients collected directly from a vulnerable model, have by definition, only information that is useful for classification and hence, perturbations based on those gradients will be preserved by S2SNets. Interestingly, while CW is most successful in the previous whitebox experiment, its highly optimized perturbations are less effective here. We believe that CW is “overfitting” more strongly on the adversarial signal of the original model than what S2SNets find useful to preserve.
At this point, we can further exploit the benefits that S2SNets enjoy as a defense, by adding another layer of protection and showing how it affects the effectiveness of an attack. As shown by Palacio et al. [19], AEs following a structuretosignal training scheme exhibit strong resilience to random noise, as opposed to traditionally trained AEs. We demonstrate that it is straightforward to add a layer of protection based on random noise. Said stochastic strategy is added after the adversarial image has been computed but before it passes through an S2SNet defense. We experiment with three sources of noise:

Gaussian Noise: , where .

Uniform Noise: , where .

Sign Noise: , where
Figure 6 shows the results of a BIM attack for ResNet 50, with noise levels under WhiteBox conditions. Overall, as noise strength increases, the resilience to adversarial attacks improves. The initial degradation under zero adversarial attacks is dependent on the amount of noise which can be tuned as the tradeoff between maximum accuracy and adversarial robustness. We observe that both Gaussian and sign noise perform almost identically, while uniform noise offers the least effective protection.
4.2 GrayBox
In contrast to whitebox scenarios, graybox attacks assume that there is limited access to information about the attack target. More specifically, the conditions for graybox define that the attacker has knowledge about the network but not about the defense strategy.
Given the compositional nature of S2SNets, a novel way to defend a classification network under graybox conditions consists in giving access to the gradients of S2SNets. Once an attack is ready, the S2SNet is removed and the perturbed image is processed by the classifier only. In other words, for a classifier and corresponding defense , the attack is crafted based on the combined network but forwarded to just the classifier . We refer to this type of defense as GrayBox.
We run the attacks with the same experimental settings described in Section 4.1, but enforce a defense policy following GrayBox conditions. Results are presented in Figure 7. Overall, we see that this scenario is consistently robust to any adversarial attack. Fooling perturbations are clearly visible and cannot be considered adversarial anymore. With BIMbased attacks, classifiers gain back roughly half as much accuracy as in the whitebox case. Again, CW is more comparable to FGSM than BIM, i.e., mostly ineffective, and consistent with observations made in the WhiteBox case.
By combining the results of GrayBox and WhiteBox, we can reason about the relationship of and . First, the WhiteBox experiment tells us that attacks crafted for are valid attacks for . Following from the definition of these spaces, it holds that . Furthermore, looking at the GrayBox experiments, taking elements from to create an attack, produces elements in . As these do not attack , we conclude that those same elements do not lie in . We finally conclude that with the tested attacks cannot be reachable from .
4.3 Classifier Relationships in Adversarial Space
Given the high robustness of S2SNets in the GrayBox scenario, we do not expect further insights from traditional blackbox attacks. Instead, we explore the relationships between the signal that different image classifiers use, as reported in [19]. Intuitively, if a signal used by a classifier also encompasses the signal that another classifier uses, then an adversarial attack crafted for should also fool . Note that the relation is directional, and it may not hold in the opposite way.
To test this, we construct adversarial samples using FGSM for four reference classifiers: AlexNet [36], VGG 16 [37], ResNet 50, and Inception v3. For each classifier we run blackbox attacks with perturbations computed on the other three models. The resulting accuracies (in terms of their relative drop in accuracy) are shown in Figure 8.
As expected, adversarial examples from higher compatible architectures result in higher fooling rations (lower accuracy). Overall, adversarial samples created for AlexNet are the least compatible among the four, mainly due to its comparably lower accuracy and limited amount of input signal used. On the other hand, ResNet 50 produces the most compatible perturbations, being the model generating attacks that were most effective when tried on other architectures.
Furthermore, we can also analyze how easy to fool a network is, based on the following criteria:

Fastest drop in accuracy per classifier.

Mean accuracy of adversarial samples coming from other classifiers that have the same .
With either criterion, results clearly show that VGG 16 is the easiest to fool, coming up on first place, followed by ResNet 50, AlexNet, and finally Inception v3. This relationship in terms of useful signal, aligns with results in [19]. Furthermore, similar experiments in [7] based on universal perturbations, indicate that a selection of similar architectures (CaffeNet, VGG 19, ResNet 101 and GoogLeNet) maintain the same relationship.
5 Conclusions & Future Work
We have proposed S2SNets as a new method to defend neural networks against adversarial examples. We model the defense strategy as a transformation of the domain used by attackers, namely gradients coming from the attacked classifier. Instead of focusing on gradient obfuscation via nondifferentiable methods or any other instabilities, we purposely induce a transformation on the gradients that strip them from any semantic information. S2SNets work by masking the classifier via the function composition. That way, the information of inputs is preserved for classification, but gradients point to structural changes for reconstructing the original sample. Such a defense is possible by using a novel twostage StructuretoSignal training scheme for deep AEs. On the first stage, the AE is trained unsupervised in the traditional way. For the second part, only the decoder gets finetuned with gradients from the model that is to be defended.
We evaluate the proposed defense under whitebox and graybox settings using a large scale dataset, against three different attack methods, on two highly performing deep image classifiers. A baseline comparison shows that the twostaged training scheme performs better than using regular AEs. Most interestingly, we show that resiliency to adversarial noise under whitebox conditions, exhibit comparable performance to stateoftheart under more favorable graybox settings. Furthermore, we show how the properties of S2SNets can be exploited to add more defense mechanisms to maintain robustness even under the harshest, albeit currently hypothetical conditions, where the protection of S2SNets is circumvented. A graybox scenario was also tested where the defense consists on the removal of S2SNets, showing high levels of robustness for all attacks. Finally, a comparison between the resiliency of four wellknown deep CNNs is presented, providing further evidence that a relation of order exists between these classifiers, in terms of the amount of signal they use; this time, in terms of the effectiveness of adversarial noise.
S2SNets are only one way in which the transformation can occur in gradient space. We would like to explore other ways in which such transformation can occur, and even if the intersection between gradients yielding successful adversarial perturbations can be effectively zero. The signalpreserving nature of S2SNets make this defense a potential mechanism to explore and understand the nature of attacks. Comparing classification consistency of a clean sample, before and after being passed through a S2SNet, has potential implications for detection of adversarial attacks by learning abnormal distribution fluctuations.
Acknowledgments
This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and the NVIDIA AI Lab (NVAIL) program. We thank all members of the Deep Learning Competence Center at the DFKI for their comments and support.
References
 [1] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical blackbox attacks against deep learning systems using adversarial examples. arXiv preprint (2016)
 [2] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
 [3] Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)

[4]
Moosavi Dezfooli, S.M., Fawzi, A., Frossard, P.:
Deepfool: a simple and accurate method to fool deep neural networks.
In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Number EPFLCONF218057 (2016)
 [5] Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: Security and Privacy (SP), 2017 IEEE Symposium on, IEEE (2017) 39–57
 [6] Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial machine learning at scale. In: International Conference on Learning Representations. (2017)
 [7] MoosaviDezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. arXiv preprint (2017)
 [8] Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: Security and Privacy (SP), 2016 IEEE Symposium on, IEEE (2016) 582–597
 [9] Feinman, R., Curtin, R.R., Shintre, S., Gardner, A.B.: Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410 (2017)
 [10] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations. (2018)
 [11] Sinha, A., Namkoong, H., Duchi, J.: Certifiable distributional robustness with principled adversarial training. In: International Conference on Learning Representations. (2018)
 [12] Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.: Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations (2018)
 [13] Buckman, J., Roy, A., Raffel, C., Goodfellow, I.: Thermometer encoding: One hot way to resist adversarial examples. In: International Conference on Learning Representations. (2018)
 [14] Guo, C., Rana, M., Cisse, M., van der Maaten, L.: Countering adversarial images using input transformations. International Conference on Learning Representations (2018) accepted as poster.
 [15] Song, Y., Kim, T., Nowozin, S., Ermon, S., Kushman, N.: Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In: International Conference on Learning Representations. (2018)
 [16] Athalye, A., Carlini, N., Wagner, D.: Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420 (2018)
 [17] Liao, F., Liang, M., Dong, Y., Pang, T., Zhu, J., Hu, X.: Defense against adversarial attacks using highlevel representation guided denoiser. arXiv preprint arXiv:1712.02976 (2017)
 [18] Xie, C., Wang, J., Zhang, Z., Ren, Z., Yuille, A.: Mitigating adversarial effects through randomization. In: International Conference on Learning Representations. (2018)
 [19] Palacio, S., Folz, J., Hees, J., Raue, F., Borth, D., Dengel, A.: What do deep networks like to see. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2018)
 [20] Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research 3(Dec) (2002) 583–617

[21]
Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.:
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks.
In: Advances in Neural Information Processing Systems. (2016) 3387–3395  [22] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
 [23] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)
 [24] Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. CoRR abs/1607.02533 (2016)
 [25] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., FeiFei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211–252
 [26] Meng, D., Chen, H.: Magnet: a twopronged defense against adversarial examples. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ACM (2017) 135–147
 [27] Liu, Y., Chen, X., Liu, C., Song, D.: Delving into transferable adversarial examples and blackbox attacks. arXiv preprint arXiv:1611.02770 (2016)
 [28] Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017)
 [29] Dhillon, G.S., Azizzadenesheli, K., Bernstein, J.D., Kossaifi, J., Khanna, A., Lipton, Z.C., Anandkumar, A.: Stochastic activation pruning for robust adversarial defense. In: International Conference on Learning Representations. (2018)
 [30] Wang, Q., Guo, W., Zhang, K., Ororbia, I., Alexander, G., Xing, X., Giles, C.L., Liu, X.: Learning adversaryresistant deep neural networks. arXiv preprint arXiv:1612.01401 (2016)
 [31] Pouya Samangouei, Maya Kabkab, R.C.: Defensegan: Protecting classifiers against adversarial attacks using generative models. International Conference on Learning Representations (2018) accepted as poster.
 [32] Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.: The new data and new challenges in multimedia research. CoRR abs/1503.01817 (2015)
 [33] Badrinarayanan, V., Handa, A., Cipolla, R.: Segnet: A deep convolutional encoderdecoder architecture for robust semantic pixelwise labelling. arXiv preprint arXiv:1505.07293 (2015)
 [34] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4) (April 2004) 600–612
 [35] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European conference on computer vision, Springer (2014) 818–833
 [36] Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997 (2014)
 [37] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556 (2014)
Appendix 0.A On the Relationship between and : Formal Proof
Experiments of Section 3.2 show that the domain of attacks is being shifted such that their range does not lie on the vulnerable space . We conduct a formal analysis of the relationship between perturbation spaces and and show that one is a subset of the other. There are a few simplifications required for the proof, which will be accounted for at the end of this section. First, we use the results of BIM for an tested on ResNet 50 as a reference. Values for the baseline (no defense), WhiteBox and, GrayBox are 0.0, 0.0 and 0.672041 respectively. Computing the relative drop in performance (i.e., normalizing by 0.75004: the accuracy under no attack) yields 0.0, 0.0 and 0.8960. To simplify the handling of fuzzy sets, we assign the membership functions for each perturbation set to be either 0 or 1 if the fooling ratio is below random chance or above 0.8 respectively. With this in mind, we can define four axioms that come from the domain of the adversarial attack , and the aforementioned experiments with simplified membership functions. The proof is a simple proof by contradiction with case analysis on the first disjunction.
Proof

(Def. domain of )

(Baseline Exp.)

(WhiteBox)

(GrayBox)

(Assumption, )

(Skolemization )

(U.I. , 1)

(U.I. , 3)

(U.I. , 4)

(Assumption, 7)

( 10, 9)

(, 11, 6)

Contradiction! ↯

(Q.E.A. 10)

(D.Syllogism, 14, 7)

( 15, 8)

(, 6, 16)

Contradiction! ↯

(Q.E.A., 5)

(, 19)

(Distr. , 20)

(MI ., 21)

(Def. , 22)
Going back to the simplified membership function, it follows that different reference experiments (network, attack and ) will naturally yield different results. This is especially true if we take the raw accuracy as the membership function, instead of the simplified one. However, one can argue that the subset relationship is, in general terms, valid since overall, most experiments under the simplified membership function yield the same axioms). This is why we denote such a relationship by the approximate subset relationship to refer to this result.
A similar analysis can be done graphically as shown in Figure 9 (middle). Case is covered by GrayBox experiments which show that can only be reached 10.4% of the time, which we know now, actually corresponds to case due to the inclusion . This leaves cases which are covered by WhiteBox experiments. Here, using the BIM on ResNet 50 and = 0.05 as reference, yields that with a membership . That leaves case with perturbations in the remaining 0.667 in .
Likewise, for cases where the domain is , we see that they all fall into which we know lies in hence, they all fall into case , eliminating samples falling into the remaining ones .
Appendix 0.B Perturbed Images and their Reconstructions
The following figures (1016) show attack attempts on ResNet 50 () with correctly classified images from our random shuffle of the ImageNet validation set. (column 1) denotes the clean image and its perturbed variants. Columns 27 show a combination of attack method (FGSM, BIM, CW) and gradient sources (, ). Below are perturbations (row 2), reconstructions (row 3), and remaining perturbation in reconstructions (row 4). Perturbation images are subject to histogram equalization for increases visibility. These examples further illustrate the structural nature of attacks through S2SNets.
Appendix 0.C Raw Experiment Data
This section contains raw values for all experiments we conducted. A subset was used to create the plots seen in the paper, but we also provide results we ended up not using for our argumentation. In the following tables, denotes the normalized dissimilarity (Equation 1) wrt. reference image and another image . We also provide values, the usual infinity norm divided by (Equation 2). Settings for attacks besides are defined in Section 4.
(1) 
(2) 
Tables 2, 3, and 4 contain WhiteBox results for the respective attack method. They cover experiments described in Section 4.1 Tables 5, 6, and 7 contain results attacks where the network that supplied the gradients (source network) that the attack is based on is different from the network that is attacked (target network). They cover WhiteBox and GrayBox experiments described in Section 4.1 and 4.2. Finally, Table 8 contains results on BIM attacks mitigated by different types and strengths () of noise.
Network  top1 acc. (%)  …(adversarial)  

Network  top1 acc. (%)  …(adversarial)  

Network  top1 acc. (%)  …(adversarial)  

Source Network  Target Network  top1 acc. (%)  …(adversarial)  

< 