Threat Model-Agnostic Adversarial Defense using Diffusion Models

by   Tsachi Blau, et al.

Deep Neural Networks (DNNs) are highly sensitive to imperceptible malicious perturbations, known as adversarial attacks. Following the discovery of this vulnerability in real-world imaging and vision applications, the associated safety concerns have attracted vast research attention, and many defense techniques have been developed. Most of these defense methods rely on adversarial training (AT) – training the classification network on images perturbed according to a specific threat model, which defines the magnitude of the allowed modification. Although AT leads to promising results, training on a specific threat model fails to generalize to other types of perturbations. A different approach utilizes a preprocessing step to remove the adversarial perturbation from the attacked image. In this work, we follow the latter path and aim to develop a technique that leads to robust classifiers across various realizations of threat models. To this end, we harness the recent advances in stochastic generative modeling, and means to leverage these for sampling from conditional distributions. Our defense relies on an addition of Gaussian i.i.d noise to the attacked image, followed by a pretrained diffusion process – an architecture that performs a stochastic iterative process over a denoising network, yielding a high perceptual quality denoised outcome. The obtained robustness with this stochastic preprocessing step is validated through extensive experiments on the CIFAR-10 dataset, showing that our method outperforms the leading defense methods under various threat models.


page 7

page 12


Perceptual Adversarial Robustness: Defense Against Unseen Threat Models

We present adversarial attacks and defenses for the perceptual adversari...

Defending Neural Backdoors via Generative Distribution Modeling

Neural backdoor attack is emerging as a severe security threat to deep l...

Adversarial Purification through Representation Disentanglement

Deep learning models are vulnerable to adversarial examples and make inc...

Toward a Mathematical Vulnerability Propagation and Defense Model in Smart Grid Networks

For reducing threat propagation within an inter-connected network, it is...

Proactive Allocation as Defense for Malicious Co-residency in Sliced 5G Core Networks

Malicious co-residency in virtualized networks poses a real threat. The ...

Defense-VAE: A Fast and Accurate Defense against Adversarial Attacks

Deep neural networks (DNNs) have been enormously successful across a var...

Fast and Scalable Adversarial Training of Kernel SVM via Doubly Stochastic Gradients

Adversarial attacks by generating examples which are almost indistinguis...

1 Introduction

Deep neural network (DNN) image-classifiers are highly sensitive to malicious perturbations in which the input image is slightly modified so as to change the classification prediction to a wrong class. Amazingly, such attacks can be effective even with imperceptible changes to the input images. These perturbations are known as adversarial attacks  Goodfellow et al. (2014); Kurakin et al. (2016); Szegedy et al. (2013). With the introduction of these DNN classifiers to real-world applications, such as autonomous driving, this vulnerability has attracted vast research attention, leading to the development of many attacks and robustification techniques.

Amongst the many types of adversarial attacks, the most common ones are norm-bounded to some radius , where the norm and the radius define a threat model. The attack is posed as an optimization task in which one seeks the most effective deviation to the input image, , in terms of modifying the classification output, while constraining this deviation to satisfy . One way to robustify a network against such attacks is by training it to correctly-classify attacked examples from a specific threat model  Madry et al. (2017); Zhang et al. (2019); Gowal et al. (2020). These methods, known as Adversarial Training (AT), lead to state-of-the-art performance when trained and tested on the same threat model. However, a well-known limitation of such methods is their poor generalization to unseen attacks, which is discussed in length in  Hendrycks et al. (2021); Bai et al. (2021) as one of the unsolved problems of adversarial defense.

A different type of robustification techniques proposes a preprocessing step before feeding the image into the classifier  Song et al. (2017); Samangouei et al. (2018); Yang et al. (2019); Grathwohl et al. (2019); Du and Mordatch (2019); Hill et al. (2020); Yoon et al. (2021). Since an adversarial example can be seen as a summation of an image and an adversarial perturbation , using such a procedure to remove or even attenuate this second term is reasonable. The authors of  Song et al. (2017); Samangouei et al. (2018); Grathwohl et al. (2019); Du and Mordatch (2019); Hill et al. (2020); Yoon et al. (2021) use a generative model in the preprocessing phase in various ways. They either use the pretrained classifier directly or re-train a classifier on the generative model’s outputs. In general, these kind of methods are very appealing since they are capable of robustifying any publicly-available non-robust classifier and do not require a computational expensive specialized adversarial training. Furthermore, such methods are oblivious of the threat model being used.

In this work we introduce a novel and highly effective preprocessing robustification method for image classifiers. We choose a preprocessing-based approach based on a generative model since we aim to remove or weaken the adversarial perturbation while effectively projecting it onto the learned image manifold, where the classifier’s accuracy is likely to be high. While a generative model is typically used to sample from

, the probability of images in general, our approach initializes this process with

at the appropriate diffusion step, where is the noisy attacked image. This process effectively denoises the attacked image while targeting perfect perceptual quality  Kawar et al. (2021b); Ohayon et al. (2021). More specifically, we use a diffusion model - an iterative process that uses a pretrained MMSE (Minimum Mean Squared Error) denoiser and Langevin dynamics. The later involves an injection of Gaussian noise, which helps to robustify our samplers against attacks, even if they are aware of our defense strategy. Our method relies on a preprocessing model and a classification one, where both are trained independently on clean images. Hence, our architecture is inherently threat model agnostic, achieving robustness for unseen attacks. In our experiments we propose a way to evaluate the threat model-agnostic robustness by presenting two measurements. The first is the average on a wide range of attacks, and the second is the average across the unseen attacks. We consider the following threat models: , , , . In summary, our main contributions are:

  • [leftmargin=*]

  • A novel stochastic diffusion-based preprocessing robustification is proposed, aiming to be a model-agnostic adversarial defense.

  • The effectivnes of the proposed defense strategy is demonstrated in extensive experiments, showing state-of-the-art results.

Figure 1: Our method flow. In the “Adversarial Attack” block, an attacker calculates the attack “Additive perturbation” and adds it to the “Original image” in order to create the “Attacked image”. As a preparation for the diffusion process, in the “Add Noise” block, we add an i.i.d Gaussian noise to the attacked image according to Equation 3. We proceed by feeding it into the “Diffusion” block, consisting of diffusion steps that include a denoising and an addition of a Gaussian noise. This effectively samples a new image from the diffusion model initialized by , the noisy attacked image (see more in Section  2.2). Lastly, we feed the preprocessed obtained image to a classifier.

2 Background

2.1 Adversarial Robustness

Since the discovery of the phenomenon of adversarial examples in neural networks  Goodfellow et al. (2014); Kurakin et al. (2016); Szegedy et al. (2013), classifiers’ robustness has been extensively studied. Numerous works have been focusing on new methods for constructing adversarial examples and/or defending from them. In the following we bring the very fundamental results referring to adversarial defense and attack methods, as a background to our work.

Let us start with how adversarial attacks are created. Given an image and a classifier , an adversarial attack is a small norm-bounded perturbation , added to the input image , that leads to its misclassification. There exist several mainstream settings for crafting adversarial examples that differ from each other in their assumptions regarding the defense method’s characteristics and the access to the model and its gradients. We describe below such key attack configurations.

White-Box Attacks are applied when the attacker has full access to the full system architecture (including both the classifier and the defense mechanism), which is assumed to be differentiable. This is a rich and a widely used group of attacks that contains some of the most common ones, such as Fast Gradient Signed Method (FGSM)  Goodfellow et al. (2014), Projected Gradient Decent (PGD) Madry et al. (2017) and CW  Carlini and Wagner (2017). While there exist numerous white-box attack strategies, PGD is the cornerstone of their most modern embodiments. It is an iterative gradient-based algorithm that increases the classifier’s loss in each step by perturbing the input data. We describe PGD in Algorithm  1 below.

Input classifier , input , target label , norm radius , step size , number of steps

1:procedure PGD
3:     for  in  do
5:     end for
6:end procedure
Algorithm 1 -based Projected Gradient Descent

The operator is a projection onto the norm of radius . In the case, is just the clamp operation into .

Since white-box attacks have assumptions that do not always hold, they can not be used in every setup. For example, such a setup can be a defense method that relies on a non-differentiable preprocessing. Since white-box attacks are gradient-based, they are likely to fail in this case. Another example is stochastic preprocessing, which poses a challenging configuration for white-box attacks. This stems from the fact that the ideal crafted attack might not be optimal during inference due to randomness. In order to better adjust gradient-based adversarial attacks to such scenarios, alternative approaches were developed, as we describe hereafter.

Grey-Box Attack is used when the attacker has access to the classifier but not to the preprocessing model defending it, . This approach is limited due to the fact that the attack in such a case is constructed upon while being evaluated with . As a consequence, the malicious perturbation created is necessarily sub-optimal and thus less effective.

Backward Pass Differentiable Approximation (BPDA) Attack  Athalye et al. (2018) is an attack method for cases in which the preprocessing function is non-differentiable or impractical to differentiate, implying that is not differentiable as well. In many cases we can invoke the assumption that , reflecting the fact that preprocessing methods do not perform significant modifications to the input images, but rather try to remove the already small malicious perturbations. In order to attack such architecture we use the forward pass of the preprocessing and approximate its derivative with , producing . With this in place, the attacker can perform white-box attacks without completely disregarding the preprocessing steps.

Expectation-Over-Transformation (EOT) Attack  Athalye et al. (2018) is used when the preprocessing step

is stochastic. Attacking such a method is harder for gradient-based methods, since the crafted deviation vector

might not remain optimal during inference due to the randomness. EOT calculates the attack’s gradients by , differentiating through both the classifier and preprocessing with an expectation. In practice, EOT empirically approximates the expectation with a fixed number of drawn samples from .

We move now to discuss adversarial defense approaches. In the past few years, numerous such methods were proposed to improve the robustness of classifiers to adversarial attacks. While there are many types of robustification algorithms, we focus below on two such families.

Adversarial Training (AT) Defense proposes to utilize adversarial examples during the training process of the classifier. More specifically, the idea is to train the model to classify such examples correctly. Several recent works  Madry et al. (2017); Zhang et al. (2019); Gowal et al. (2020) follow this line of reasoning, leading to the current state-of-the-art in robustifying classifiers.

Preprocessing is a substantially different type of robustification method that relies on a preceding operation on the classifier’s input as its name suggests. Since adversarial examples contain small imperceptible perturbations, using preprocessing steps to “clean” them seems to be an is intuitive step. Many works rely on various generative models for such preprocessing  Song et al. (2017); Samangouei et al. (2018); Du and Mordatch (2019); Hill et al. (2020); Yoon et al. (2021). More specifically, these models are used to project the attacked image into a valid clean one in its vicinity, with the hope that the processed image is more likely to be classified correctly.

2.2 Diffusion Models

Diffusion models  Sohl-Dickstein et al. (2015); Ho et al. (2020); Song and Ermon (2019)

are Markov Chain Monte Carlo (MCMC)-based generative techniques, which consist of a chain of images

of the same size as the given image

. These methods are based on two closely related processes. The first is the forward process of gradually adding Gaussian noise to the data according to a decaying variance schedule parametrized by

. The following defines this chain of steps, for where is the given clean image :


Posed differently, the forward process can be described as a simple weighting between the image and a Gaussian noise vector,


so we can express as


When is close to zero, is close to a pure standard Gaussian noise, independent of . Thus, we can set as initialization for the backward process, which is explained next.

The second and the more intricate process is the backward direction, which gradually removes the noise from the image. Intuitively, this stage denoises the image by pealing layers of noise gradually. A key ingredient in this process is a pretrained noise estimator neural network,

. This denoiser serves as an approximation to the score function  Kadkhodaie and Simoncelli (2020), bringing the knowledge about the image statistics into this sampling procedure. The noise estimator is conditioned on the time , trying to estimate the noise of the latent variable . Sampling, or generating an image, is performed by iteratively applying the following update rule for :


where the first term is a denoising stage – an estimation of , while the second term stands for an attenuated version of the estimated additive noise in . is a stochastic addition, where

is a hyperparameter controlling the stochasticity of the process, and


The sampling process posed in Equation (4) tends to be very slow, requiring () passes through the denoising network. Methods for speeding up this process are discussed in  Nichol and Dhariwal (2021); Song et al. (2020); Kawar et al. (2022). There are various use-cases for diffusion models beyond image synthesis. The ones relevant to our work are discussed in  Meng et al. (2021); Kawar et al. (2021b, a, 2022) where inverse problems are being considered. Following  Meng et al. (2021), instead of sampling from the ideal image distribution , the diffusion process we implement is initialized with , where 111More on the relation between and is given below. is the given noisy image. Thus, the outcome can be considered as a stochastic high perceptual quality denoising of .

3 Our Method

Figure 2: Our method incorporates a diffusion model and a classifier. In every diffusion step, we add Gaussian noise multiplied by the corresponding , which is a user-controlled hyperparameter. The variables constitute the MCMC, and the last step’s output of the diffusion model , is the final output, to be sent to the classifier.

In this section we present our adversarial defense method, depicted in Figure 1. We start by adding noise to the attacked image, and then proceed by preprocessing the obtained image using a generative diffusion model, effectively projecting it onto the learned image manifold. The outcome of this diffusion is fed into a vanilla classifier, which is trained on the same image distribution that the diffusion model attempts to sample from. Thus, our framework is comprised of two main components – a denoiser that drives the diffusion model and a classifier.

Intuitively, we would like to sample images that are semantically close to an input image by starting the diffusion process from some intermediate time step () rather than the beginning (). Recall that stands for a pure Gaussian noise, whereas would be the noisy image we embark from. To this end, we modify the image to fit the diffusion model at this time step by applying Equation  3 – simply multiplying by a scalar and adding an appropriate Gaussian noise, resulting in . We feed this processed image into the diffusion model at time step and complete the diffusion process, running with , and outputting . Such a partial diffusion is similar to the image editing process presented in Meng et al. (2021), and close in spirit to the posterior sampler that is discussed in  Kadkhodaie and Simoncelli (2020). We provide a comprehensive description of our method in Algorithm 2.

Input image , maximum depth , diffusion model denoiser ,
      variance schedule , stochasticity hyperparameters ,

1:procedure Sampling
4:     for  in  do
8:     end for
9:     return
10:end procedure
Algorithm 2 Our Preprocessing Defense Method

An important hyperparameter for the success of our method is the initial diffusion depth , since different values of it yield significant changes in . To better understand the importance of a careful choice of , we intuitively analyze its effect. On the one hand, when starting from , we sample a random image from the generative diffusion model, which obviously eliminates the adversarial perturbation. However, as the resulting image is independent of , this will necessarily change class-related semantics of the image, which in turn would lead to misclassification. On the other hand, choosing results in the same input image , which does not remove the perturbation from the image, hence probably leading to misclassification as well. In other words, we need to choose that balances the trade-off between cleaning the adversarial noise, and keeping the semantic properties of the input image . Choosing such that successfully balances these properties is crucial to the success of our adversarial defense algorithm.

We utilize the above described sampling algorithm with one goal in mind – sampling an image that is not contaminated with an adversarial attack while keeping it semantically similar to the original input image . We believe that our algorithm is suited for this task because the Gaussian noise injections are much larger than the adversarial perturbation. Hence, the noise overshadows the adversarial attack, reducing its effect. This leads to a sampling process that answers both of our demands, removal of the contamination while remaining semantically close to .

As mentioned previously, our method is comprised of a diffusion model denoiser and a classifier, both trained on clean images. This framework is very useful from a practical point of view, since we can utilize publicly available pretrained models to a completely different task than they were trained on – adversarial defense. The fact that these models were trained without adversarial attacks in mind gives our method a significant advantage – it is inherently threat model-agnostic. This essentially avoids the challenged generalization to unseen attacks problem Hendrycks et al. (2021); Bai et al. (2021), according to which classifiers trained on a specific adversarial threat model are vulnerable to attacks under a different threat regime.

A method close in spirit to ours is the Adaptive Denoising Purification (ADP) Yoon et al. (2021), which uses a score-based model as an adversarial defense. Despite this similarity, there are some fundamental differences that we would like to highlight. ADP suggests a score-based gradient ascent algorithm as a preprocessing step for robustifying a pretrained classifier. More specifically, they add Gaussian noise to the input image only at the beginning, and then apply a deterministic gradient ascent process with an adaptive step size. In contrast, we propose a stochastic diffusion-based preprocessing step, in which we inject noise into every diffusion iteration. This effectively samples from the learned image distribution, initialized with a noisy version of the input image. The increased stochasticity is a key property of our method that enables us to wipe the malicious attack, while effectively projecting the attacked image onto the learned image manifold, achieving robustness to unseen attacks.

4 Experiments

We proceed by empirically demonstrating the improved performance attained by our proposed adversarial defense method. First, we provide supporting evidence for our method when applied to a synthetic dataset. Next, we compare our method with another preprocessing method  Yoon et al. (2021) under grey-box, BPDA+EOT, and white-box attacks. Finally, we compare our method to various state-of-the-art (SoTA) methods on white box attacks. Additional experiments are reported in the supplementary material.

Throughout our experiments, we use the pretrained diffusion model from  Song et al. (2020) and a vanilla classifier, both trained on clean images from CIFAR-10  Krizhevsky (2009) train set (50,000 examples). More specifically, we set the diffusion model maximal depth to and the sub-sequence of the time steps to . In addition, we use a WideResNet-28-10  Zagoruyko and Komodakis (2016) architecture as our classifier and evaluate the performance on the CIFAR-10 test set (10,000 examples).

4.1 Synthetic Dataset Experimets

We create a synthetic 2D dataset (see Figure 3) and investigate the effect of a diffusion process on the decision boundaries of the classification. The dataset consists of two classes – red and blue points – consisting altogether of examples, drawn from two mixtures of Gaussians, each consisting of concentrated groups. We train a fully connected neural network model to classify this data, having layers of width . The training is done via epochs. As for the diffusion preprocess, we use an analytic score-function of the known distribution, following the work of  Song and Ermon (2019). We set and values of in the range .

After training the classifier, we calculate its decision rule and present it in Figure 2(a), where the background colors represent the predicted label. As can be seen, the classifier achieves perfect performance, as all the red points are located in the red zone, and all the blue ones are surounded by a blue background. Nevertheless, the classifier decision boundaries are very close to the data, which is a well-known phenomenon of vanilla classifiers Shamir et al. (2021). This illustrates why small perturbations to the data, such as adversarial attacks, can change the classification decision from the correct to the wrong ones.

When applying our preprocessing scheme, our method leads to a larger margin between the data points and the decision boundaries, as can be seen in Figure 2(b). These results are encouraging because in the adversarial attack regime, every data point is allowed to perturbed with an norm ball around it. When the decision boundaries are far enough from the data points, an -bounded attack would necessarily fail.

(a) Original classifier
(b) Our method
Figure 3: Decision boundary comparison between a vanilla classifier with and without our method on a 2D synthetic dataset.

4.2 CIFAR-10 Experimets

First, we compare our method to ADP Yoon et al. (2021), a leading preprocessing method, using the following attacks: grey-box, BPDA+EOT, and white-box, where the EOT is approximated over repetitions. As can be seen in Table  1, our method outperforms ADP by up to . We should note that the results are lower than presented in Yoon et al. (2021), this was also observed in Croce et al. (2022).

Defense Attack Base Classifier Preprocessed
Clean Adversarial Clean Adversarial
ADP Yoon et al. (2021) grey-box 95.60 00.00 86.39 80.49
Ours grey-box 95.60 00.00 86.28 82.33
ADP Yoon et al. (2021) BPDA+EOT 95.60 00.00 86.39 44.79
Ours BPDA+EOT 95.60 00.00 86.28 77.65
ADP Yoon et al. (2021) white-box 95.60 00.00 86.39 31.42
Ours white-box 95.60 00.00 86.28 63.40
Table 1: CIFAR-10 robust accuracies of preprocessing methods under the following attack: grey-box, BPDA + EOT, white-box PGD. All using the same threat model .

Next, we compare our method to baseline state-of-the-art (SoTA) methods, under PGD attacks using four different threat models – ), , , - more details are given in supplementary material. To assess the generalization ability to unseen attacks, we average the results in two ways: (i) Average of All: accuracy average of all the attacks; and (ii) Average of Unseen Attack: accuracy average of the attacks not seen at training time (if applicable). While the first is a simple average that also considers the performance on the attack used in training time, the second showcases the generalization capabilities to unseen attacks. Note that because our method is not trained on any threat model, (i) and (ii) are the same. As can be seen in Table 2, adversarial training methods excel on the specific threat model that they trained on. However, they generalize poorly, as discussed in Bai et al. (2021); Hendrycks et al. (2021), while our method achieves SoTA performance in both of the examined metrics.

TTM Attack AwT AoA Architecture

AT Madry et al. (2017) rn-50

Trades Zhang et al. (2019) wrn-34-10

Gowal et al.  Gowal et al. (2020) wrn-28-10
PAT -  Laidlaw et al. (2020) rn-50
Ours 39.70 39.70 wrn-28-10
Table 2: CIFAR-10 robust accuracies under white + EOT attacks. For every compared method, we state the threat model that was used in training in the first column Trained Threat Model (TTM) column. The next four columns are the four different threat models used for evaluation. The next two columns are the two averages that we use for evaluation, Average without Training (AwT), and Average of All (AoA). In the last column we state the classifier architecture that is used.

4.3 Diffusion Depth and Sampling

Figure 4: The obtained robust accuracy under white box attacks as a function of the max depth of the diffusion model. There are two graphs, both are attaked using the same threat model , the first is the robust accuracy under white-box attack, and the other refers to a white-box + EOT.

When deploying the proposed diffusion defense, two critical parameters should be discussed - the choice of (referred to as depth) and the time-step skips to use. In this Subsection we discuss the effect of both.

We start by showing the influence of the depth of the diffusion model on the robust accuracy. As we change the maximal depth of the diffusion model , we depict the robust accuracy obtained by our method, and present it in Figure  4. As discussed in Section  3, the diffusion depth controls the trade-off between clearing the attack perturbation and sampling an image that is semantically similar to the input image . We track the diffusion model behavior as we increase the diffusion model’s first step. When setting to a shallow diffusion step, we effectively sample images that are closer to the input image , and since the image is contaminated by a malicious attack, the classification accuracy is low. As we increase the depth we reach a sweet-spot in which we clean the malicious perturbation while keeping a small perceptual distance to , which leads to the highest accuracy. When the depth is too big, we clear the attack but lose perceptual similarity to , and the accuracy is reaching , meaning that we sample random images.

We now move to explore the influence of the skips to the time-steps in the diffusion process. Attacking our preprocessing method necessarily consumes a lot of time and memory, making it hard to break, as indeed claimed in  Hill et al. (2020). This is due to the fact that an attack process requires keeping a computational graph of all the time steps of the diffusion process for computing derivatives. In contrast, our defense mechanism is lighter, as no derivatives are required, and only forward passes through the denoiser are performed.

When evaluating our defense method under the strongest known attack, white-box + EOT, we must lighten further our protection by reducing the number of diffusion steps. This is done by using only of the DDIM diffusion steps  Song et al. (2020), requiring all-together steps. For uniformity of our experiments, we use this sub-sequence of steps for all attacks.

We should note that if the proposed preprocess diffusion is applied in full (no subsampling), this would increase both the attack and defense runtime and memory consumptions by a factor of 10. Such an approach would not worsen the robust accuracy, and perhaps even improve it, as can be seen in the supplementary material. Both these effects have one clear conclusion – when using our defense in practice, we can increase the diffusion model sampling, harming the attacker, while preserving the robust accuracy.

5 Related Work

The goal of preprocessing methods is to clean the adversarial attacks from the input images, leading to correct prediction by deep neural network classifier. Preliminary work on preprocessing defense methods include rescaling  Xie et al. (2017), thermometer encoding  Buckman et al. (2018), feature squeezing  Xu et al. (2017), GAN for reconstruction  Samangouei et al. (2018), ensemble of transformations  Raff et al. (2019), addition of Gaussian noise  Cohen et al. (2019) and mask and reconstruction  Yang et al. (2019). It was shown by  Athalye et al. (2018); Tramer et al. (2020)

that such preprocessing, even if it includes stochasticity and non-diferentiability, can be broken when evaluated properly by adjusting the projected-gradient-descent attack, using backward-pass-differentiable-approximation and expectation-over-transformation algorithms. A new preprocessing group of work has recently emerged, trying to utilize Energy-Based-Model (EBM) to the task of cleaning adversarial pertubation from images. The intuition is that generative models are capable of sampling images from the image manifold, hopefully projecting attacked images that were deviated from the image manifold, back onto it. To this end, some EBM preprocessing methods were developed: purification by pixelCNN  

Song et al. (2017), restore corrupt image with EBM  Du and Mordatch (2019) and density aware classifier  Grathwohl et al. (2019). Most recent methods includes: long-run Langevin sampling  Hill et al. (2020) and gradient ascent score based-model  Yoon et al. (2021). In contrast to many of these methods that require retraining the classifier, our method does not have this requirement, the diffusion model and classifier are both pretrained on clean images.

Defense to unseen attacks methods: Recently, an attention for defense to unseen attacks has emerged. Previouse methods that include Adversarial Training (AT) do not generalize well to unseen attacks, as shown in  Hendrycks et al. (2021); Bai et al. (2021)

. For this end, a new robustness evaluation metric to unseen attacks was suggested  

Kang et al. (2019). Moreover, the authors of  Laidlaw et al. (2020) suggested perceptual-adversarial-training, which takes into account the perceptual similarity, leading to a new method that generalizes to unseen attacks.

6 Conclusion

This work presents a novel preprocessing defense mechanism against adversarial attacks, based on a generative diffusion model. Since this generative model relies on pretraining on clean images, it has the capability to generalize to unseen attacks. We evaluate our method across different attacks and demonstrate its superior performance. Our method can be used to defend against any attack, and does not require retraining the vanilla classifier.


  • [1] A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In

    International conference on machine learning

    pp. 274–283. Cited by: §2.1, §2.1, §5.
  • [2] T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang (2021) Recent advances in adversarial training for adversarial robustness. arXiv preprint arXiv:2102.01356. Cited by: §1, §3, §4.2, §5.
  • [3] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow (2018) Thermometer encoding: one hot way to resist adversarial examples. In International Conference on Learning Representations, Cited by: §5.
  • [4] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §2.1.
  • [5] J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: §5.
  • [6] F. Croce, S. Gowal, T. Brunner, E. Shelhamer, M. Hein, and T. Cemgil (2022) Evaluating the adversarial robustness of adaptive test-time defenses. arXiv preprint arXiv:2202.13711. Cited by: §4.2.
  • [7] Y. Du and I. Mordatch (2019) Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems 32. Cited by: §1, §2.1, §5.
  • [8] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, B. Tran, and A. Madry (2019) Adversarial robustness as a prior for learned representations. arXiv preprint arXiv:1906.00945. Cited by: Attack Structure.
  • [9] R. Ganz and M. Elad (2021) BIGRoC: boosting image generation via a robust classifier. arXiv. External Links: Document, Link Cited by: Attack Structure.
  • [10] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §2.1, §2.1.
  • [11] S. Gowal, C. Qin, J. Uesato, T. Mann, and P. Kohli (2020) Uncovering the limits of adversarial training against norm-bounded adversarial examples. arXiv preprint arXiv:2010.03593. Cited by: §1, §2.1, Table 2, Figure 6.
  • [12] W. Grathwohl, K. Wang, J. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky (2019) Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263. Cited by: §1, §5.
  • [13] D. Hendrycks, N. Carlini, J. Schulman, and J. Steinhardt (2021) Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916. Cited by: §1, §3, §4.2, §5.
  • [14] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations. Cited by: Robustness to CIFAR-10-C Perturbations.
  • [15] M. Hill, J. Mitchell, and S. Zhu (2020) Stochastic security: adversarial defense using long-run dynamics of energy-based models. arXiv preprint arXiv:2005.13525. Cited by: §1, §2.1, §4.3, §5.
  • [16] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, pp. 6840–6851. Cited by: §2.2.
  • [17] Z. Kadkhodaie and E. P. Simoncelli (2020) Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint arXiv:2007.13640. Cited by: §2.2, §3.
  • [18] D. Kang, Y. Sun, D. Hendrycks, T. Brown, and J. Steinhardt (2019) Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016. Cited by: §5.
  • [19] B. Kawar, M. Elad, S. Ermon, and J. Song (2022) Denoising diffusion restoration models. arXiv preprint arXiv:2201.11793. Cited by: §2.2.
  • [20] B. Kawar, G. Vaksman, and M. Elad (2021) Snips: solving noisy inverse problems stochastically. Advances in Neural Information Processing Systems 34. Cited by: §2.2.
  • [21] B. Kawar, G. Vaksman, and M. Elad (2021) Stochastic image denoising by sampling from the posterior distribution. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 1866–1875. Cited by: §1, §2.2.
  • [22] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §4, Robustness to CIFAR-10-C Perturbations.
  • [23] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1, §2.1.
  • [24] C. Laidlaw, S. Singla, and S. Feizi (2020) Perceptual adversarial robustness: defense against unseen threat models. arXiv preprint arXiv:2006.12655. Cited by: Table 2, §5, Figure 6.
  • [25] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

    Towards deep learning models resistant to adversarial attacks

    arXiv preprint arXiv:1706.06083. Cited by: §1, §2.1, §2.1, Table 2, Figure 6.
  • [26] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J. Zhu, and S. Ermon (2021) SDEdit: guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, Cited by: §2.2, §3.
  • [27] A. Q. Nichol and P. Dhariwal (2021) Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. Cited by: §2.2.
  • [28] G. Ohayon, T. Adrai, G. Vaksman, M. Elad, and P. Milanfar (2021) High perceptual quality image denoising with a posterior sampling cgan. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1805–1813. Cited by: §1.
  • [29] E. Raff, J. Sylvester, S. Forsyth, and M. McLean (2019) Barrage of random transforms for adversarially robust defense. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 6528–6537. Cited by: §5.
  • [30] P. Samangouei, M. Kabkab, and R. Chellappa (2018) Defense-gan: protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605. Cited by: §1, §2.1, §5.
  • [31] S. Santurkar, A. Ilyas, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Image synthesis with a single (robust) classifier. Advances in Neural Information Processing Systems 32. Cited by: Attack Structure.
  • [32] A. Shamir, O. Melamed, and O. BenShmuel (2021) The dimpled manifold model of adversarial examples in machine learning. arXiv preprint arXiv:2106.10151. Cited by: §4.1, Attack Structure.
  • [33] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015)

    Deep unsupervised learning using nonequilibrium thermodynamics

    In International Conference on Machine Learning, pp. 2256–2265. Cited by: §2.2.
  • [34] J. Song, C. Meng, and S. Ermon (2020) Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. Cited by: §2.2, §4.3, §4.
  • [35] Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems 32. Cited by: §2.2, §4.1.
  • [36] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman (2017) Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §1, §2.1, §5.
  • [37] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §2.1.
  • [38] F. Tramer, N. Carlini, W. Brendel, and A. Madry (2020) On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems 33, pp. 1633–1645. Cited by: §5.
  • [39] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2018)

    Robustness may be at odds with accuracy

    arXiv preprint arXiv:1805.12152. Cited by: Attack Structure.
  • [40] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille (2017) Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991. Cited by: §5.
  • [41] W. Xu, D. Evans, and Y. Qi (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: §5.
  • [42] Y. Yang, G. Zhang, D. Katabi, and Z. Xu (2019) Me-net: towards effective adversarial robustness with matrix estimation. arXiv preprint arXiv:1905.11971. Cited by: §1, §5.
  • [43] J. Yoon, S. J. Hwang, and J. Lee (2021) Adversarial purification with score-based generative models. In International Conference on Machine Learning, pp. 12062–12072. Cited by: §1, §2.1, §3, §4.2, Table 1, §4, §5.
  • [44] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.
  • [45] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472–7482. Cited by: §1, §2.1, Table 2, Figure 6.

Attack Structure

Working with adversarial perturbation of images has the advantages of enabling the analysis of the attack , better understanding it, and getting an intuition about it. When an attack changes the classification prediction of an image, one might expect the perceptual structure of the image to change accordingly, just as is accomplished in order to change a human’s prediction. However, this is not always the case when fooling a deep-neural-network classifier.

A geometrical explanation for this phenomenon is given in Shamir et al. (2021), showing that trained vanilla classifiers tend to produce decision boundaries that are nearly parallel to the data manifold. As such, fooling the network amounts to a very small step orthogonal to this manifold, thus having no “visual meaning”. In contrast, robust classifiers behave differently, exhibiting Perceptual Aligned Gradients (PAG) Tsipras et al. (2018); Santurkar et al. (2019); Engstrom et al. (2019); Ganz and Elad (2021).

White-box attacks of the form we consider in this work are based on computing the gradients of the attacked classifier. Therefore, when a classifier exhibits a PAG property in its gradients, this would imply a highly desired robustness behavior. Armed with this insight, we consider the following question: Given a system comprising of both the vanilla classifier and our diffusion-based defense mechanism, does this overall system have PAG?

We answer the above question and present some empirical evidence of this phenomenon in Figure 5. In the first row we show several original images from CIFAR-10. In the second row we present a white-box attack on a vanilla classifier, an attack lacking perceptual meaning. In the third row we present white-box + EOT attack under our method, exhibiting PAG - the obtained gradients concentrate on the object, aiming to modify its appearance. When attacking the defended classifier, the attacker use white-box + EOT, an attack that was crafted for stochastic defenses. Every attack’s step is the expectation over multiple realizations of the defense.

Figure 5: The attack structure of white-box+EOT attack, norm, radius . First row: Five CIFAR-10 images. Second row: The attack under a white-box attack, where the attacked classifier is a vanilla one. Third row: The attack on our method, where we preprocess the image before inputing into a vanilla classifier.

Robustness to CIFAR-10-C Perturbations

In most of our discussion we focused on a robustness to norm- bounded attacks. We turn now to introduce a robust classification under attacks that are based on augmentation. These refer to modifications of the image in various ways such as motion blur, zoom blur, snow, JPEG compression, contrast variation, etc. CIFAR-10-C Hendrycks and Dietterich (2019) is such a corrupted images dataset that was created by performing numerous augmentations on CIFAR-10 Krizhevsky (2009) dataset. CIFAR-10-C is commonly used for evaluating the robustness performance under broad attacks.

As our method is inherently attack agnostic, it is natural to evaluate it on this class of attacks. We compare our method versus other leading techniques, achieving state-of-the-art results. This experiment requires adjustment of the diffusion model maximal depth parameter . When we set , we outperform the other methods, as depicted in Figure 6.

Figure 6: Robustness accuracy under CIFAR-10-C as a function of the diffusion model maximal depth . We compare our method with the results reported in Gowal et al. (2020); Zhang et al. (2019); Madry et al. (2017); Laidlaw et al. (2020).

Computational Resources

Our proposed defense method relies on an application of a diffusion model as a preprocessing stage for purifying adversarial perturbations. To perform a gradient-based attack, one needs to backpropagate the gradients through the classifier and the diffusion model. This process is very expensive, both in terms of memory and computations, since the attacker needs to keep the entire computational graph in memory and backpropagate from the classifier through all of the diffusion time steps.

When evaluating our defense method under our most challenging attack, white-box + EOT, we must further lighten our approach by reducing the number of diffusion steps. We do so by using only of the diffusion steps, i.e., times instead of . This reduction decreases the computational needs and enables us to perform such an attack, using 8 NVIDIA A4000 GPUs. As shown in Table 3 the robust accuracy of our method is slightly reduced, while significantly improving the computational cost and achieving state-of-the-art performance.

Attack AwT AoA
Ours - full sampling 54.77 54.77
Table 3: CIFAR-10 robust accuracies under white + EOT attacks. We persent two samplings of the diffusion model time steps. The first uses while the second applies a full sampling . We compare the two sampling performance. In the “Attack” columns we present the accuracy under different threat models. The last two columns are two averages used for evaluation: Average without Training (AwT), and Average of All (AoA). It was evaluated on the first test images of CIFAR10