Adversarial Defense based on Structure-to-Signal Autoencoders

03/21/2018 ∙ by Joachim Folz, et al. ∙ 0

Adversarial attack methods have demonstrated the fragility of deep neural networks. Their imperceptible perturbations are frequently able fool classifiers into potentially dangerous misclassifications. We propose a novel way to interpret adversarial perturbations in terms of the effective input signal that classifiers actually use. Based on this, we apply specially trained autoencoders, referred to as S2SNets, as defense mechanism. They follow a two-stage training scheme: first unsupervised, followed by a fine-tuning of the decoder, using gradients from an existing classifier. S2SNets induce a shift in the distribution of gradients propagated through them, stripping them from class-dependent signal. We analyze their robustness against several white-box and gray-box scenarios on the large ImageNet dataset. Our approach reaches comparable resilience in white-box attack scenarios as other state-of-the-art defenses in gray-box scenarios. We further analyze the relationships of AlexNet, VGG 16, ResNet 50 and Inception v3 in adversarial space, and found that VGG 16 is the easiest to fool, while perturbations from ResNet 50 are the most transferable.



There are no comments yet.


page 8

page 19

page 20

page 21

page 22

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We nowadays see an increasing adoption of deep learning techniques in production systems, partially even safety-relevant ones (as exemplified in [1]). Hence, it is not surprising that the discovery of adversarial spaces for neural networks [2] has sparked a lot of interest. A growing community has quickly focused on different ways to reach those spaces [3, 4, 5], understand their properties [1, 6, 7], and protect vulnerable models from their malicious nature [8, 9, 10].

Most prevalent methods exploiting adversarial spaces [3, 4, 5, 10] use gradients as their main starting point to search for perturbations surrounding a clean sample. Due to the intractability of transformations modeled by neural networks and the limited amount of change that is allowed for perturbations to be considered adversarial, gradients from a classifier expose just enough information about which parts of the input correlate highly with the label that a model associates to it. With this in mind, defenses against adversarial attacks have been devised upon those same gradients in two fundamental ways: 1) They complement traditional optimization schemes with a two-fold objective that minimizes the overall prediction cost while maximizing the perturbation space around clean images that classifiers can withstand [10, 11, 12]. 2) Gradients are blocked or obfuscated in such a way that attacking algorithms can no longer use them to find effective adversarial perturbations [13, 14, 15]. Type 1 methods enjoy mathematical rigor and hence, provide formal guarantees with respect to the kind of perturbations they are robust to. However, note that this can also be disadvantageous since networks become attack-dependent. Any other strategy for finding perturbations could circumvent such a defense mechanism [16]. While effective for small-scale problems such as MNIST and CIFAR, we found no empirical evidence that these methods scale to larger problems such as ImageNet. It has even been shown that defenses for adversarial attacks tested on small datasets do not scale well when applied to bigger problems [6].

Currently, large scale state-of-the-art defenses rely on the second use of gradients: suppression [17] and blockage [14, 18]. As defined by Athalye et al. [16], gradient obfuscation is the result of instabilities from vanishing or exploding gradients, and the use of stochastic or non-differentiable preprocessing steps. All these alternatives can be modeled as lossy identities where the original signal contained in the input is preserved, while the adversarial perturbation is destroyed. The usefulness of this principle fits well with findings from a recent study showing that image classifiers only use a small fraction of the entire signal within original input. Therefore, a portion of its information can indeed be dropped without affecting performance [19].

In this paper, we propose an alternative defense that affects the information contained in gradients by reforming its class-related signal into a structural one. Intuitively, we learn an identity function that encodes structure and decodes only the structural parts of the input necessary for classification, dropping everything else. To this end, an autoencoder (AE) is trained to approximate the identity function that preserves only the part of the signal that is useful for a target classifier. The structural information is preserved by training both encoder and decoder unsupervised, and fine-tuning only the decoder with gradients coming from an existing classifier. By using a function that only looks at structure, gradients are devoid of any class-related information, therefore invalidating the fundamental assumptions about gradients that attackers rely on. We call this defense a Structure-To-Signal Network (S2SNet).

1.0.1 Formal Definitions

Let be an image classifier, and the portion of the signal in the original input that is effectively being used by . In other words, subject to , where is a measure of information e.g., the normalized mutual information [20]. Let be the space of all adversarial and non-adversarial perturbations under a given norm . We also define the sub-space as the set of adversarial perturbations that are reachable from the input gradients , where represents the cost function, and is the ground truth used for training. Note that said spaces depend on and , but not on the class or the cost function . Hence, unless noted otherwise, we use the simplified notation from now on. Adversarial perturbations are elements such that 111Strictly speaking, adversarial perturbations can be reached through other domains that do not depend on gradients (as shown by Ngueyen et al. [21]) but so far, all instances of strong adversarial attack methods, base their entire strategy on the information of gradients.. As attacks use gradients to compute adversarial perturbations, and such gradients can only correspond to parts of the signal that are used by the model, it follows that . An S2SNet can hence be defined as a function such that . This way, defends by serving as a proxy for incoming, potentially malicious inputs, as well as exposing gradients via the function composition . The relation between and concerning information implies that produces lossy reconstructions of the input space. We train in a way that gradients lie in a different space than those leading attackers to (Figure 1). In other words, we train such that is minimized. This in turn, will cause the intersection to be smaller, resulting in perturbations that are non-adversarial (i.e., ). Further details of the architecture and training of and are discussed in Section 3.1.

Figure 1: Overview of gradient and perturbation spaces, and their relation to S2SNets and adversarial attacks.

The architecture of this defense offers important advantages when dealing with adversarial attacks:

  • Zero compromise for safe use-cases: there is no drop in performance when the defense is deployed, but the network is not under attack. As the transformation done by S2SNet is preserving all required signal , using either or has no impact to the classifier when clean images are used.

  • Removing S2SNet is a defense strategy: a novel gray-box defense (i.e., when the attacker knows the classifier but not the defense mechanism) works by giving away gradient information from an S2SNet to the attacker and then removing it, applying adversarial attacks (based on gradients from S2SNet) to the original classifier instead.

  • Attack agnostic: S2SNets rely on the same information used by adversarial attacks but not on the attacks themselves. Therefore, S2SNets do not require any assumptions with respect to the specific way an attack works.

  • Post-hoc implementation: our defense uses gradients from a trained network, and can be used to defend models that are already in production. Likewise, no special considerations need to be made when training a new classifier from scratch.

  • Compatibility with other defenses: due to the compositional nature of this approach, any additional defense strategies that work with the original classifier can be implemented for the ensemble .

We test S2SNets on two high-performing image classifiers (ResNet50 [22] and Inception-v3 [23]) against three attack methods (Fast Gradient-Sign Method (FGSM) [3], Basic Iterative Method (BIM) [24] and Carlini-Wager (CW) [5]) on the large scale ImageNet [25] dataset. Experiments are conducted on both classifiers under white-box and gray-box conditions. An evaluation of the effectiveness of S2SNets with respect to regular AEs (e.g., as proposed in [26]) is presented in Section 3.1, empirically proving that S2SNets are a better approximation of the signal that is used by a classifier. Furthermore, an evaluation of the gradient space and is conducted, showing that their intersection does indeed approach the empty set.

The main contributions of this paper are threefold: First, we propose a novel way to interpret adversarial perturbations, namely in terms of the effective input signal that classifiers use . Second, we introduce a robust and flexible defense against large-scale adversarial attacks based on S2SNets. And third, we provide a comprehensive baseline evaluation of adversarial attacks for several state-of-the-art models on a large dataset.

2 Related Work

The fast growing interest in the phenomenon of adversarial attacks has gained momentum since its discovery [2] and has had three main areas of focus. The first area is the one that seeks new and more effective ways of reaching adversarial spaces. In [3], a first comprehensive analysis of the extent of adversarial spaces was explored, proposing a fast method to compute perturbations based on the sign of gradients. An iterative version of this method was later introduced [24]

and shown to work significantly better, even when applied to images that were physically printed and digitized again. A prominent exception to attacks based on gradients succeeded using evolutionary algorithms 

[21]. Nevertheless, this has not been a practical wide spread method, mostly due to how costly it is to compute. Papernot et al. [1] showed how effective adversarial attacks could get, even with very few assumptions about the attacked model. Finally, there are methods that go beyond a greedy iteration over the gradient space and perform different kinds of optimization that maximize misclassification while minimizing the norm of the perturbation [4, 5].

The second area focuses on understanding the properties of adversarial perturbations. The work of Goodfellow et al. [3] was already pointing at the linear nature of neural networks as the main enabler of adversarial attacks. This went in opposition of what was initially theorized, where the claim was that non-linearities were the main vulnerability. Not only was it possible to perturb natural looking images to look like something completely different, but it was also possible to get models issuing predictions with high confidence using either noise images, or highly artificial patterns [2, 21]. The transferability of adversarial perturbations was shown to be possible by crafting attacks on one network and using them to fool a second classifier [27]. However, transferable attacks are limited to the simpler methods as iterative ones tend to exploit particularities of each model, and hence lose power when used in different architectures [6]. As it turns out, not only is adversarial noise transferable between models but it is also possible to transfer a single universal adversarial perturbation to all samples in a dataset to achieve high misclassification rates [7]. Said individual perturbations can even be applied to physical objects and bias a model towards a specific class [28].

The third and arguably the most popular area of research has focused on how networks can be protected against such attacks. Strategies include changing the optimization objective to account for possible adversarial spaces [10, 11], detection [9], dataset augmentation that includes adversarial examples [3, 12], suppressing perturbations [8, 15, 17, 26, 29]

or obfuscating the gradients to prevent attackers from estimating an effective perturbation 

[13, 14, 18, 30, 31].

In this work, we build on the idea of using AEs as a compressed representation of the input [15, 26], but tailored towards a specific characteristic of adversarial perturbations [17], using the notion of useful input signal (i.e., the one effectively used by a classifier) [19]. Furthermore, we explore the nature of adversarial perturbations and its relationship with network capacity, in terms of used signal.

3 Methods

This section explains in detail the architecture of an S2SNet and its particular signal-preserving training scheme, followed by an empirical evaluation of the gradients it provides. We start by testing the robustness of S2SNets in a white-box setting, and compare it to a simple baseline using regular AEs. To recreate realistic attack conditions, we further test the ensemble network, simulating a re-parametrization technique similar to [16], aimed at circumventing the defense, and explore further strategies to cope with this attack. Next, we examine the performance of S2SNets in a gray-box scenario. Finally, we provide an evaluation of the transferability of single step attacks for different models, and correlate their overlap in terms of input signal from the perspective of adversarial perturbations.

3.1 Structure-to-Signal Networks

S2SNets start out as plain AEs that are trained on the large-scale YFCC100m data set [32]. Only a single pass is required, as the million images are more than sufficient to train the underlying SegNet architecture [33] to convergence222We choose a large architecture as an upper bound to the ideal identity function , because it was proven capable of encoding the semantics of the deep image classifiers tested in this work.. This network, referred to as , is able to reproduce input signals required by a diverse set of classifiers such that their top-1 accuracy is within percentage points of the original classification performance [19].

To model the effective input signal used by a trained classifier , Palacio et al. propose to further fine-tune the decoder of using gradients from itself. This allows to learn a way to decode the input that retains the signal required by . This fine-tuned variant, called , reconstructs images such that the original top-1 accuracy of is preserved, while amount of information in the reconstructed image (measured as normalized mutual information) decreases with respect to the original sample.

Note that, since the encoder of was trained unsupervised, any intermediate representations produced by

are entirely class-agnostic. This means that, during backpropagation through an S2SNet, gradients that can be read at the shallowest layer correspond only to information about structure. Intuitively, gradients from

point to parts of the image that can be changed to influence the reconstruction error.

In the following, we measure the extent to which gradients shift when images are forwarded through these networks. Furthermore we explore and quantify their emerging resilience to adversarial attacks in Section 4.

3.2 Properties of Gradient Distributions

To verify that the Structure-to-Signal training scheme produces large shifts in the distribution of gradients, we forward images through a classifier, a pre-trained SegNet, and through the fine-tuned counterpart to compare their gradients. We use the magnitude of the gradients instead of raw values to stress the differences of their spatial distribution. A large change in the position where gradients originally occur within the image is a good indicator that the information conveyed by gradients has changed.

Table 1: Pairwise mean SSIM of input gradient magnitudes for ResNet 50 () on the ImageNet validation set, with and without being passed through or . SSIM values of w.r.t. any AE variant show the least similarity.

Concretely, we quantify the structural similarity [34] (SSIM; a locally normalized mean square error measured in a sliding window) of gradients obtained by the same image when passed through a ResNet50 (), an S2SNet fine-tuned to defend the same ResNet50 (), a plain SegNet AE coupled with ResNet50 (), the fine-tuned S2SNet without the classifier (), and the plain SegNet AE alone (). While the first three models require gradients to be computed with respect to a class label, the last two are produced by measuring the reconstruction error. Also, note that for the true reconstruction cannot be directly obtained as it is indirectly defined by the classifier it was fine-tuned on. In this case, reconstruction gradients are computed by comparing its output to the original input.

Table 1 reports the mean SSIM of the gradients over ImageNet’s validation set for all combinations of network pairs. Note that the dissimilarity between and indicates that the Structure-to-Signal training scheme has indeed changed the reconstruction process i.e., the identity function being computed based on the input. Similarly, comparing the SSIM values of and reveals a difference in the information contained by their gradients. Most importantly, the similarity between the gradients coming from classifier and either of the ensembles and is considerably smaller ( and ) than any combination involving AEs exclusively (). This disparity indicates how much the position of gradients change (and hence the information contained within), when passed through an S2SNet.

Further evidence for the class-agnostic nature of gradients propagated through AEs can be found when comparing the SSIM of gradient magnitudes when different target labels are used to compute gradients. Let be some input image and its true label. We then randomly select a different label and compute

for all in ImageNet’s validation set, and . We measured mean SSIM values of for , for , and for . This means that the influence of the label is smaller when gradients are propagated through the AE, but also emphasizes how dissimilar gradients of just ResNet 50 are, compared to any AE at just to SSIM (Table 1).

Figure 2: Gradient magnitudes for ResNet 50 () given a single input , propagated through or . SSIM values are in comparison to

Figure 2 visualizes this phenomenon. Here, gradient magnitudes observed for just AEs (, ) predominantly highlight edges as source of error. This is expected, since it is more difficult to accurately reproduce the high frequencies required by sharp edges, compared to the lower frequencies of blobs. Extracting magnitudes based on classification from AEs show similar structures (, ). Some coincidental overlap between ResNet 50 () and AE variants is unavoidable, since edges are also important for classification [35]. However, the classifier on its own differs considerably from all other patterns. Overall, SSIM is at least twice as high between AE variants, than between the classifier and any of the AE configurations.

4 Experiments

Figure 3: Overview of analyzed attack/defense scenarios.

This section presents experiments quantifying the robustness of S2SNets when used as a defense against adversarial attacks (Figure 3). The experimental setup closely follows the conditions from Guo et al. [14] in order to facilitate comparability:

  • Dataset: We use ImageNet to test S2SNets in a challenging, large-scale scenario. Classifiers are trained on its training set and evaluations are carried out on the full validation set (50000 images).

  • Image classifier under attack: we use ResNet 50 () and Inception v3 (), both pre-trained on ImageNet as target models. The classifiers have been trained under clean conditions i.e., no special considerations with respect to adversarial attacks were made during training.

  • Defense: we train an S2SNet for each of the classifiers under attack, following the scheme described in Section 3.1. These defenses are denoted as and for and respectively.

  • Perturbation Magnitude: we use the normalized norm () between a clean sample and its adversary , as defined by Guo et al. [14]. Epsilon values for each attack are listed below.

  • Defense Strength Metric: vulnerability to adversarial attacks is measured in terms of the number of newly misclassified samples. More precisely, for any given attack to classifier , we calculate , where is the set of true positives and is the adversarial example generated by the attack, based on .

  • Attack Methods: protected models are tested against a single step method, an iterative variant, and an optimization-based alternative. Note that to replicate realistic threat conditions, all resulting adversarial samples are cast to the discrete RGB range [0, 255].

    • Fast Gradient Sign Method (FGSM) [3]: a simple, yet effective attack that works also when transferred to different models. Epsilon values used for this method are .

    • Basic Iterative Method (BIM) [24]: an iterative version of FGSM that shows more attack effectiveness but less transferable properties. Epsilon values used for BIM are . The number of iterations is fixed at .

    • Carlini-Wagner (CW) [5]: an optimization-based method that has proven to be effective even against hardened models. Note that this attack issues perturbations that lay in a continuous domain. Epsilon values used for CW are ; the number of iterations is fixed at 100, and .

  • Attack Conditions: we test the proposed defenses under two conditions.

    • White-box setting: the attacker has knowledge about the classifier and the defense. This includes reading access to both the predictions of the classifier, intermediate activations and backpropagated gradients. The attacker is forced to forward valid images through the defended network i.e., through the composed classifier or .

    • Gray-box setting: the attacker has access to a classification model but is unaware of its defense strategy.

4.1 White-Box

For this setting, input images flow first through the S2SNet before reaching the original classifier. Similarly, gradients are read from the shallowest layer of the S2SNet. For comparison, attacks are also run on the unprotected versions of ResNet 50 and Inception v3. Results are summarized in Figure 4.

Figure 4: White-box attacks on ResNet 50 () and Inception v3 (), with (solid) and without (dashed) S2SNet as a defense.

Under these conditions, plain models are only able to moderately resist FGSM attacks, and fail completely for the more capable BIM and CW attacks. In contrast, S2SNets provide high levels of protection. CW, being an optimization attack and generally the most capable in our comparison, is expected to be more effective at breaking through the S2SNet defense. However, it does so by introducing large perturbations even for small values of . In fact, none of the attacking configurations was able to entirely fool any of our defended classifiers for even the highest levels of in our tests. Note that these white-box results are, to our surprise, already comparable with some state-of-the-art gray-box defenses [14, 18] (i.e., known classifier but unknown defense). Despite the less favorable conditions for a white-box defense, S2SNets already match alternative state-of-the-art protections that were tested under the more permissive assumptions allowed in gray-box settings.

For completeness, we also evaluate the defense using a pre-trained SegNet() instead of an S2SNet for ResNet 50 (shown in Figure 4 with light solid green). This is most similar to the defense proposed by Meng et al. [26]. We can confirm that having just an AE does not suffice to guard the classifier against adversarial attacks and its performance is consistently lower than S2SNets. For a visual analysis of the attacks and their reconstructions by S2SNets, we refer the reader to the supplementary material.

4.1.1 Bypassing S2SNets through Reparametrization

In order to push the limits of S2SNets, we now simulate a more hostile scenario where an attacker tries to actively circumvent the defense mechanism. Based on the work of Athalye et al. [16], a reparametrization of the input space can be implemented for defenses that operate under the composition of functions . This works by defining the original input as a function of a hidden state such that . Note that the main motivation behind reparametrization is to alleviate instabilities of gradients caused by defenses that rely on said instability. Although S2SNets are very deep architectures, the resilience of this defense lays in the directed change that has been induced within the information contained in gradients. In addition, there is no trivial way to come up with a suitable that does not end up inducing the same transformation in the gradients, as done by S2SNets.

Despite all these concerns, we assume for this experiment that such can be found, and that attacking does indeed circumvent S2SNets altogether. We simulate the potential strength of this attack by using gradients of the original (unprotected) classifier, and applying them to the input directly. The resulting adversarial attack is then passed through the hardened classifier. We refer to this hypothetical scenario as White-Box. Results are shown in Figure 5.

Figure 5: White-Box attacks on ResNet 50 () and Inception v3 () classifiers.

Under these conditions, we observe that hardened models revert back to the behavior shown by their corresponding unprotected versions for FGSM and BIM. In general, this condition is expected, and confirms once more that S2SNets are being trained to preserve the information that is useful to the classifier. Gradients collected directly from a vulnerable model, have by definition, only information that is useful for classification and hence, perturbations based on those gradients will be preserved by S2SNets. Interestingly, while CW is most successful in the previous white-box experiment, its highly optimized perturbations are less effective here. We believe that CW is “overfitting” more strongly on the adversarial signal of the original model than what S2SNets find useful to preserve.

Figure 6: White-Box attacks on with BIM, mitigated with additive sign, random and Gaussian noise of varying strength .

At this point, we can further exploit the benefits that S2SNets enjoy as a defense, by adding another layer of protection and showing how it affects the effectiveness of an attack. As shown by Palacio et al. [19], AEs following a structure-to-signal training scheme exhibit strong resilience to random noise, as opposed to traditionally trained AEs. We demonstrate that it is straightforward to add a layer of protection based on random noise. Said stochastic strategy is added after the adversarial image has been computed but before it passes through an S2SNet defense. We experiment with three sources of noise:

  • Gaussian Noise: , where .

  • Uniform Noise: , where .

  • Sign Noise: , where

Figure 6 shows the results of a BIM attack for ResNet 50, with noise levels under White-Box conditions. Overall, as noise strength increases, the resilience to adversarial attacks improves. The initial degradation under zero adversarial attacks is dependent on the amount of noise which can be tuned as the trade-off between maximum accuracy and adversarial robustness. We observe that both Gaussian and sign noise perform almost identically, while uniform noise offers the least effective protection.

4.2 Gray-Box

In contrast to white-box scenarios, gray-box attacks assume that there is limited access to information about the attack target. More specifically, the conditions for gray-box define that the attacker has knowledge about the network but not about the defense strategy.

Given the compositional nature of S2SNets, a novel way to defend a classification network under gray-box conditions consists in giving access to the gradients of S2SNets. Once an attack is ready, the S2SNet is removed and the perturbed image is processed by the classifier only. In other words, for a classifier and corresponding defense , the attack is crafted based on the combined network but forwarded to just the classifier . We refer to this type of defense as Gray-Box.

Figure 7: Gray-Box attacks on ResNet 50 () and Inception v3 ().

We run the attacks with the same experimental settings described in Section 4.1, but enforce a defense policy following Gray-Box conditions. Results are presented in Figure 7. Overall, we see that this scenario is consistently robust to any adversarial attack. Fooling perturbations are clearly visible and cannot be considered adversarial anymore. With BIM-based attacks, classifiers gain back roughly half as much accuracy as in the white-box case. Again, CW is more comparable to FGSM than BIM, i.e., mostly ineffective, and consistent with observations made in the White-Box case.

By combining the results of Gray-Box and White-Box, we can reason about the relationship of and . First, the White-Box experiment tells us that attacks crafted for are valid attacks for . Following from the definition of these spaces, it holds that . Furthermore, looking at the Gray-Box experiments, taking elements from to create an attack, produces elements in . As these do not attack , we conclude that those same elements do not lie in . We finally conclude that with the tested attacks cannot be reachable from .

4.3 Classifier Relationships in Adversarial Space

Given the high robustness of S2SNets in the Gray-Box scenario, we do not expect further insights from traditional black-box attacks. Instead, we explore the relationships between the signal that different image classifiers use, as reported in [19]. Intuitively, if a signal used by a classifier also encompasses the signal that another classifier uses, then an adversarial attack crafted for should also fool . Note that the relation is directional, and it may not hold in the opposite way.

To test this, we construct adversarial samples using FGSM for four reference classifiers: AlexNet [36], VGG 16 [37], ResNet 50, and Inception v3. For each classifier we run black-box attacks with perturbations computed on the other three models. The resulting accuracies (in terms of their relative drop in accuracy) are shown in Figure 8.

Figure 8: Relative accuracy for black-box attacks on AlexNet (), VGG 16 (), ResNet 50 (), and Inception v3 ().

As expected, adversarial examples from higher compatible architectures result in higher fooling rations (lower accuracy). Overall, adversarial samples created for AlexNet are the least compatible among the four, mainly due to its comparably lower accuracy and limited amount of input signal used. On the other hand, ResNet 50 produces the most compatible perturbations, being the model generating attacks that were most effective when tried on other architectures.

Furthermore, we can also analyze how easy to fool a network is, based on the following criteria:

  • Fastest drop in accuracy per classifier.

  • Mean accuracy of adversarial samples coming from other classifiers that have the same .

With either criterion, results clearly show that VGG 16 is the easiest to fool, coming up on first place, followed by ResNet 50, AlexNet, and finally Inception v3. This relationship in terms of useful signal, aligns with results in [19]. Furthermore, similar experiments in [7] based on universal perturbations, indicate that a selection of similar architectures (CaffeNet, VGG 19, ResNet 101 and GoogLeNet) maintain the same relationship.

5 Conclusions & Future Work

We have proposed S2SNets as a new method to defend neural networks against adversarial examples. We model the defense strategy as a transformation of the domain used by attackers, namely gradients coming from the attacked classifier. Instead of focusing on gradient obfuscation via non-differentiable methods or any other instabilities, we purposely induce a transformation on the gradients that strip them from any semantic information. S2SNets work by masking the classifier via the function composition. That way, the information of inputs is preserved for classification, but gradients point to structural changes for reconstructing the original sample. Such a defense is possible by using a novel two-stage Structure-to-Signal training scheme for deep AEs. On the first stage, the AE is trained unsupervised in the traditional way. For the second part, only the decoder gets fine-tuned with gradients from the model that is to be defended.

We evaluate the proposed defense under white-box and gray-box settings using a large scale dataset, against three different attack methods, on two highly performing deep image classifiers. A baseline comparison shows that the two-staged training scheme performs better than using regular AEs. Most interestingly, we show that resiliency to adversarial noise under white-box conditions, exhibit comparable performance to state-of-the-art under more favorable gray-box settings. Furthermore, we show how the properties of S2SNets can be exploited to add more defense mechanisms to maintain robustness even under the harshest, albeit currently hypothetical conditions, where the protection of S2SNets is circumvented. A gray-box scenario was also tested where the defense consists on the removal of S2SNets, showing high levels of robustness for all attacks. Finally, a comparison between the resiliency of four well-known deep CNNs is presented, providing further evidence that a relation of order exists between these classifiers, in terms of the amount of signal they use; this time, in terms of the effectiveness of adversarial noise.

S2SNets are only one way in which the transformation can occur in gradient space. We would like to explore other ways in which such transformation can occur, and even if the intersection between gradients yielding successful adversarial perturbations can be effectively zero. The signal-preserving nature of S2SNets make this defense a potential mechanism to explore and understand the nature of attacks. Comparing classification consistency of a clean sample, before and after being passed through a S2SNet, has potential implications for detection of adversarial attacks by learning abnormal distribution fluctuations.


This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and the NVIDIA AI Lab (NVAIL) program. We thank all members of the Deep Learning Competence Center at the DFKI for their comments and support.


Appendix 0.A On the Relationship between and : Formal Proof

Experiments of Section 3.2 show that the domain of attacks is being shifted such that their range does not lie on the vulnerable space . We conduct a formal analysis of the relationship between perturbation spaces and and show that one is a subset of the other. There are a few simplifications required for the proof, which will be accounted for at the end of this section. First, we use the results of BIM for an tested on ResNet 50 as a reference. Values for the baseline (no defense), White-Box and, Gray-Box are 0.0, 0.0 and 0.672041 respectively. Computing the relative drop in performance (i.e., normalizing by 0.75004: the accuracy under no attack) yields 0.0, 0.0 and 0.8960. To simplify the handling of fuzzy sets, we assign the membership functions for each perturbation set to be either 0 or 1 if the fooling ratio is below random chance or above 0.8 respectively. With this in mind, we can define four axioms that come from the domain of the adversarial attack , and the aforementioned experiments with simplified membership functions. The proof is a simple proof by contradiction with case analysis on the first disjunction.

  1. (Def. domain of )

  2. (Baseline Exp.)

  3. (White-Box)

  4. (Gray-Box)


  5. (Assumption, )

  6. (Skolemization )

  7. (U.I. , 1)

  8. (U.I. , 3)

  9. (U.I. , 4)


  10. (Assumption, 7)

  11. ( 10, 9)

  12. (, 11, 6)

  13. Contradiction! ↯


  14. (Q.E.A. 10)

  15. (D.Syllogism, 14, 7)

  16. ( 15, 8)

  17. (, 6, 16)

  18. Contradiction! ↯


  19. (Q.E.A., 5)

  20. (, 19)

  21. (Distr. , 20)

  22. (MI ., 21)

  23. (Def. , 22)

Going back to the simplified membership function, it follows that different reference experiments (network, attack and ) will naturally yield different results. This is especially true if we take the raw accuracy as the membership function, instead of the simplified one. However, one can argue that the subset relationship is, in general terms, valid since overall, most experiments under the simplified membership function yield the same axioms). This is why we denote such a relationship by the approximate subset relationship to refer to this result.

VulnerableS2SNetsClassifierStructuralSemanticGradient SpacePerturbation Space Adversarial Space AdversarialAttackVulnerableS2SNetsClassifierStructuralSemanticGradient SpacePerturbation Space AdversarialAttackAdversarialAttack
Figure 9: Analysis of the relationship between gradient space and perturbation space. By doing case analysis (middle) and a proof by contradiction, we infer that the perturbation space approximates

A similar analysis can be done graphically as shown in Figure 9 (middle). Case is covered by Gray-Box experiments which show that can only be reached 10.4% of the time, which we know now, actually corresponds to case due to the inclusion . This leaves cases which are covered by White-Box experiments. Here, using the BIM on ResNet 50 and = 0.05 as reference, yields that with a membership . That leaves case with perturbations in the remaining 0.667 in .

Likewise, for cases where the domain is , we see that they all fall into which we know lies in hence, they all fall into case , eliminating samples falling into the remaining ones .

Appendix 0.B Perturbed Images and their Reconstructions

The following figures (10-16) show attack attempts on ResNet 50 () with correctly classified images from our random shuffle of the ImageNet validation set. (column 1) denotes the clean image and its perturbed variants. Columns 2-7 show a combination of attack method (FGSM, BIM, CW) and gradient sources (, ). Below are perturbations (row 2), reconstructions (row 3), and remaining perturbation in reconstructions (row 4). Perturbation images are subject to histogram equalization for increases visibility. These examples further illustrate the structural nature of attacks through S2SNets.

Figure 10: Example of different attacks on ResNet 50 () with and without , along with perturbations and reconstructions. (left to right): , , , , , .
Figure 11: Example of different attacks on ResNet 50 () with and without , along with perturbations and reconstructions. (left to right): , , , , , .
Figure 12: Example of different attacks on ResNet 50 () with and without , along with perturbations and reconstructions. (left to right): , , , , , .
Figure 13: Example of different attacks on ResNet 50 () with and without , along with perturbations and reconstructions. (left to right): , , , , , .
Figure 14: Example of different attacks on ResNet 50 () with and without , along with perturbations and reconstructions. (left to right): , , , , , .
Figure 15: Example of different attacks on ResNet 50 () with and without , along with perturbations and reconstructions. (left to right): , , , , , .
Figure 16: Example of different attacks on ResNet 50 () with and without , along with perturbations and reconstructions. (left to right): , , , , , .

Appendix 0.C Raw Experiment Data

This section contains raw values for all experiments we conducted. A subset was used to create the plots seen in the paper, but we also provide results we ended up not using for our argumentation. In the following tables, denotes the normalized dissimilarity (Equation 1) wrt. reference image and another image . We also provide values, the usual infinity norm divided by (Equation 2). Settings for attacks besides are defined in Section 4.


Tables 2, 3, and 4 contain White-Box results for the respective attack method. They cover experiments described in Section 4.1 Tables 5, 6, and 7 contain results attacks where the network that supplied the gradients (source network) that the attack is based on is different from the network that is attacked (target network). They cover White-Box and Gray-Box experiments described in Section 4.1 and 4.2. Finally, Table 8 contains results on BIM attacks mitigated by different types and strengths () of noise.

Network top-1 acc. (%) …(adversarial)
Table 2: Accuracy and distance statistics for FGSM attacks.
Network top-1 acc. (%) …(adversarial)
Table 3: Accuracy and distance statistics for BIM attacks.
Network top-1 acc. (%) …(adversarial)
Table 4: Accuracy and distance statistics for CW attacks.
Source Network Target Network top-1 acc. (%) …(adversarial)