On the Robustness of the CVPR 2018 White-Box Adversarial Example Defenses

by   Anish Athalye, et al.

Neural networks are known to be vulnerable to adversarial examples. In this note, we evaluate the two white-box defenses that appeared at CVPR 2018 and find they are ineffective: when applying existing techniques, we can reduce the accuracy of the defended models to 0



page 1


Stochastic Substitute Training: A Gray-box Approach to Craft Adversarial Examples Against Gradient Obfuscation Defenses

It has been shown that adversaries can craft example inputs to neural ne...

Certifying Joint Adversarial Robustness for Model Ensembles

Deep Neural Networks (DNNs) are often vulnerable to adversarial examples...

Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models

Server breaches are an unfortunate reality on today's Internet. In the c...

Improving Adversarial Robustness by Data-Specific Discretization

A recent line of research proposed (either implicitly or explicitly) gra...

Likelihood Landscapes: A Unifying Principle Behind Many Adversarial Defenses

Convolutional Neural Networks have been shown to be vulnerable to advers...

On Need for Topology-Aware Generative Models for Manifold-Based Defenses

ML algorithms or models, especially deep neural networks (DNNs), have sh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training neural networks so they will be robust to adversarial examples (Szegedy et al., 2013) is a major challenge. Two defenses that appear at CVPR 2018 attempt to address this problem: “Deflecting Adversarial Attacks with Pixel Deflection” (Prakash et al., 2018) and “Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser” (Liao et al., 2018).

In this note, we show these two defenses are not effective in the white-box threat model. We construct adversarial examples that reduce the classifier accuracy to

on the ImageNet dataset

(Deng et al., 2009) when bounded by a small perturbation of , a stricter bound than considered in the original papers. Our attacks can construct targeted adversarial examples with over success.

Our methods are a direct application of existing techniques.

2 Background

We assume familiarity with neural networks, adversarial examples (Szegedy et al., 2013), generating strong attacks against adversarial examples (Madry et al., 2018), and computing adversarial examples for neural networks with non-differentiable layers (Athalye et al., 2018). We briefly review the key details and notation.

Adversarial examples (Szegedy et al., 2013) are instances that are very close to an instance with respect to some distance metric ( distance, in this paper), but where the classification of is not the same as the classification of . Targeted adversarial examples are instances whose label is equal to a given target label .

We examine two defenses: Pixel Deflection and High-level Representation Guided Denoiser. We are grateful to the authors of these defenses for releasing their source code and pre-trained models.

Figure 1: Original images from ImageNet validation set (row 1). Targeted adversarial examples (with randomly chosen targets) for Pixel Deflection (row 2) and High-level representation Guided Denoiser (row 3), with a perturbation of .

Pixel Deflection (Prakash et al., 2018)

proposes a non-differentiable preprocessing of inputs. Some pixels (a tunable hyperparameter) are randomly replaced with near-by pixels. This resulting image is often noisy, and to restore accuracy, a denoising operation is applied.

High-level representation Guided Denoiser (HGR) (Liao et al., 2018) proposes denoising inputs using a trained neural network before passing them to a standard classifier. This denoiser is a differentiable, non-randomized neural network. This defense has also been evaluated by Uesato et al. (2018) and found to be ineffective.

2.1 Methods

We evaluate these defenses under the white-box threat model. We generate adversarial examples with Projected Gradient Descent (PGD) (Madry et al., 2018) maximizing the cross-entropy loss and bounding distortion by .

What is the right threat model to evaluate against? Many papers only claim white-box security against an attacker who is completely unaware the defense is being applied. HGD, for example, says “the white-box attacks defined in this paper should be called oblivious attacks according to Carlini and Wagner’s definition” (Liao et al., 2018).

Unfortunately, security against oblivious attacks is not useful. We only defined this threat model in our prior work (Carlini & Wagner, 2017) to study the case of an extremely weak attacker, to show that some defenses are not even robust under this model. Furthermore, many previously published schemes already achieve security against oblivious attacks. In practice, any serious attacker would certainly consider the possibility that a defense is in place and try to circumvent it, if there is a reasonable way to do so.

Thus, security against oblivious attacks is far from sufficient to be interesting or useful in practice. Even the black-box threat model allows for an attacker to be aware that the defense is being applied, and only holds the exact parameters of the defense as private data. Also, our experience is that schemes that are insecure against white-box attacks also tend to be insecure against black-box attacks (Carlini & Wagner, 2017). Accordingly, in this note, we evaluate schemes against white-box attacks.

3 Methodology

3.1 Pixel Deflection

We now show that Pixel Deflection is not robust. We analyze the defense as implemented by the authors 111https://github.com/iamaaditya/pixel-deflection. Our evaluation code is publicly available 222https://github.com/carlini/pixel-deflection.

We apply BPDA (Athalye et al., 2018) to Pixel Deflection for its non-differentiable replacement operation. Our attack reduces the accuracy of the defended classifier to .

In a targeted setting, we succeed with probability. (Because the defense is randomized, we report success only if the image is classified as the adversarial target label times out of .)

3.2 High-Level Representation Guided Denoiser

Next, we show that using a High-level representation Guided Denoiser is not robust in the white-box threat model. We analyze the defense as implemented by the authors 333https://github.com/lfz/Guided-Denoise. Our evaluation code is publicly available 444https://github.com/anishathalye/Guided-Denoise.

We apply PGD (Madry et al., 2018) end-to-end with no modification. It reduces the accuracy of the defended classifier to and achieves success at generating targeted adversarial examples.

4 Conclusion

As this note demonstrates, Pixel Deflection and High-level representation Guided Denoiser (HGD) are not robust to adversarial examples.


We are grateful to Aleksander Madry and David Wagner for comments on an early draft of this paper.

We thank Aaditya Prakash and Fangzhou Liao for discussing their defenses with us, and we thank the authors of both papers for releasing source code and pre-trained models.