MagNet and "Efficient Defenses Against Adversarial Attacks" are Not Robust to Adversarial Examples

11/22/2017 ∙ by Nicholas Carlini, et al. ∙ 0

MagNet and "Efficient Defenses..." were recently proposed as a defense to adversarial examples. We find that we can construct adversarial examples that defeat these defenses with only a slight increase in distortion.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

MagNet and “Efficient Defenses…” were recently proposed as a defense to adversarial examples. We find that we can construct adversarial examples that defeat these defenses with only a slight increase in distortion.

1 Introduction

It is an open question how to train neural networks so they will be robust to adversarial examples

[11]. Recently, three defenses have been proposed to make neural networks robust to adversarial examples:

  • MagNet [8] was proposed as an approach to make neural networks robust against adversarial examples through two complementary approaches: adversarial examples near the data manifold are reformed

    to lie on the data manifold that are classified correctly, whereas adversarial examples far away from the data manifold are

    detected and rejected before classification. MagNet does not argue robustness in the white-box setting; rather, the authors argue that MagNet is robust in the grey-box setting where the adversary is aware the defense is in place, knows the parameters of the base classifier, but not the parameters of the defense.

  • An efficient defense [12]

    was proposed to make neural networks more robust against adversarial examples by performing Gaussian data augmentation during training, and using the BReLU activation function. The authors do not claim perfect security, but claim this makes attacks visually detectable.

  • Adversarial Perturbation Elimination GAN (APE-GAN) [10] is similar to MagNet, only adversarial examples are projected onto the data manifold using a Generative Adversarial Network (GAN) [2]. We did not set out to bypass this defense, but found it to be very similar to MagNet and so we analyze it too.

In this short paper, we demonstrate these three defenses are not effective on the MNIST [5] and CIFAR-10 [4] datasets. We show that we are able to bypass MagNet with greater than success, and the latter two with , with only a slight increase in distortion.

We defeat MagNet by making use of the transferability [11] property of adversarial examples: the adversary trains their own copy of the defense, constructs adversarial examples on their model, and supplies these adversarial examples to the defender. It turns out that these examples will also fool the defender’s model.

We defeat “Efficient Defenses Against Adversarial Attack” and APE-GAN by showing that existing attack can defeat them with success without modification. Adversarial examples are not more visually detectable than an undefended network.

2 Background

We assume familiarity with neural networks [5], adversarial examples [11], transferability [6], generating strong attacks against adversarial examples [1] and MagNet [8]. We briefly review the key details and notation.


Notation Let

be a neural network used for classification outputting a probability distribution. Call the second-to-last layer (the layer before the the softmax layer)

, so that . Each output corresponds to the predicted probability that the object is labelled as class . Let correspond to the classification of on . In this paper we are concerned with neural networks used to classify images (on MNIST and CIFAR-10).


Adversarial examples [11] are instances that are very close to a normal instance with respect to some distance metric ( distance, in this paper), but where for any target chosen by the adversary.

We generate adversarial examples with Carlini and Wagner’s attack algorithm [1]. Specifically, we solve

where the loss function

is defined as

and the constant is chosen via binary search.

Perhaps the most surprising property of adversarial examples is transferability [11]: given two different models that perform the same task, trained on different datasets, adversarial examples constructed to fool one one model often fool the other as well. This effect can be amplified [6] by constructing the adversarial examples to fool multiple local models before applying it to the target model.


MagNet [8] was recently proposed as a defense to adversarial examples under the grey box threat model: here, the adversary is aware that MagNet is being used to secure a given classifier, as well as the parameters of the classifier, but is not aware of the parameters of the defensive models. MagNet has two defense components:

  1. The detector examines inputs and decides if they are adversarial or benign. There can be multiple detectors, in which case if any report “adversarial” the input is rejected.

  2. The reformer takes inputs classified as benign by the detector(s) and reforms them to remove any remaining minor adversarial nature to the instances. The defender trains multiple reformers and selects one at random, to prevent the adversary from learning its parameters.

The authors argue these defenses complement each other: the detector should be able to detect any adversarial examples with large distortion and are far away from the true data manifold. The reformer takes instances which are close to the data manifold and maps them directly on to the manifold to remove any minor adversarial perturbations that may not have been detected.

The specific instantiation of the detector and reformer suggested by the authors are based on auto-encoders. Given an auto-encoder , to detect if an input is adversarial, compare the mean squared error with a threshold chosen to minimize the false positive rate. On CIFAR, the authors suggest one further detector: compute the Jensen-Shannon divergence with temperature (set to and ), and report “adversarial” if the difference is greater than a different threshold , also chosen to minimize the false positive rate.

Similarly, to reform an adversarial example run it through the auto-encoder and let : since the auto-encoder is trained to reconstruct typical instances, this operation projects it back to the data manifold.


Efficient Defenses Against Adversarial Attacks [12]

works by making two modifications to standard neural networks. First, the authors propose use of the Bounded ReLU activation function, defined as

instead of standard ReLU [9] which is unbounded above. Second, instead of training on the standard training data they train on where is chosen fresh for each training instance. On MNIST, ; for CIFAR, . The authors claim that despite training on noise, it is successful on [1].


APE-GAN [10] works by constructing a pre-processing network trained to project both normal instances and adversarial examples back to the data manifold as done in MagNet. The network

is trained with a GAN instead of an auto-encoder. Note that unlike a standard GAN which takes as input a noise vector and must produce an output image, the generator in APE-GAN takes in an adversarial example and must make it appear non-adversarial. During training, the authors train on adversarial examples generated with the Fast Gradient Sign algorithm

[3]; despite this, the authors claim robustness on a wide range of attacks (including [1]).


Defense Models. We take the MagNet implementation from the authors’ open-source code 111https://github.com/Trevillie/MagNet and train our own models. Since the provided code does not include an implementation of the CIFAR defense and classifier, we implement it as described in the paper.

We were unable to obtain source code for “Efficient Defenses…”. We therefore re-implement the proposed defense based on the description in the paper. We take the APE-GAN implementation from the authors open-source code 222https://github.com/shenqixiaojiang/APE-GAN.

We use the C&W attack given by the authors. We also release our code 333https://github.com/carlini/MagNet 444https://github.com/carlini/breaking_efficient_defenses 555https://github.com/carlini/APE-GAN.

3 Breaking MagNet

We demonstrate MagNet is not robust by generating transferable adversarial examples with Carlini and Wagner’s attack. To do this, we (as the adversary) train local auto-encoders to use as detectors and reformers (different than the 16 the defender will use). To generate adversarial examples, we modify the optimization formulation to the following, where reformers are denoted and the detectors :

where the classifier loss function is defined as

and the detector loss function is defined as

Here, and are chosen via binary search simultaneously. The threshold is chosen such that of validation images are detected as adversarial. (The true defender sets their remote to reject only of instances [8]: ours is more strict to help the attack construct adversarial examples that fool the remote detectors more often.) We set the confidence so that the adversarial examples are more likely to transfer.

Dataset Model Success Distortion ()
MNIST Unsecured 100% 1.64
MagNet 99% 2.25
CIFAR Unsecured 100% 0.30
MagNet 100% 0.45
Table 1: The success rate of our attack on MagNet. The last column shows the mean distance to the nearest targeted adversarial example, across the first 1000 test instances, with the target chosen uniformly at random from the incorrect classes.

We attack by performing iterations of gradient descent with a learning rate of

. We did not perform hyperparameter search (e.g., picking

auto-encoders, , ); improved search over these parameters would yield lower distortion adversarial examples.

Figures 1 and 2 contain images of targeted adversarial examples on on the secured network, and Table 1 the mean distortion required across the first 1000 instances of the test set with targets chosen uniformly at random among the incorrect classes.

Figure 1: MagNet targeted adversarial examples for each source/target pair of images on MNIST. We achieve a grey-box success (the attack failed to transfer).
Figure 2: MagNet targeted adversarial examples for each source/target pair of images on CIFAR. We achieve a grey-box success.

4 Breaking “Efficient Defenses…”

We demonstrate this defense is not robust by generating adversarial examples with Carlini and Wagner’s attack. We do nothing more than apply the attack to the defended network.

Figure 3 contains images of adversarial examples on the secured network, and Table 2 the mean distortion required across the first 1000 instances of the test set with targets chosen at random among the incorrect classes.

Dataset Model Distortion ()
MNIST Unsecured 2.04
BReLU 2.14
Gaussian Noise 2.66
Gaussian Noise + BReLU 2.58
CIFAR Unsecured 0.56
BReLU 0.58
Gaussian Noise 0.66
Gaussian Noise + BReLU 0.67
Table 2: Neither adding Gaussian data augmentation during training nor using the BReLU activation significantly increases robustness to adversarial examples on the MNIST or CIFAR-10 datasets; success rate is always .

On MNIST, the full defense increases mean distance to the nearest adversarial example by , and on CIFAR by . This is in contrast with other forms of retraining, such as adversarial retraining [7], which increase distortion by a significantly larger amount. Interestingly, we find that BReLU provides some increase in distortion when trained without Gaussian augmentation, but when trained with it, does not help.

5 Breaking APE-GAN

We demonstrate APE-GAN is not robust by generating adversarial examples with Carlini and Wagner’s attack. We do nothing more than apply the attack to defended network. That is, we change the loss function to account for the fact that the manifold-projection is done before classification. Specifically, we let

and solve the same minimization formulation.

Figure 4 contains images of adversarial examples on APE-GAN, and Table 3 the mean distortion required across the first 1000 instances of the test set with targets chosen at random among the incorrect classes.


Investigating APE-GAN’s Failure. Why are we able to fool APE-GAN? We compare (a) the mean distance between the original inputs and the adversarial examples, and (b) the mean distance between the original inputs and the recovered adversarial examples. We find that the recovered adversarial examples are less similar to the original than the adversarial examples. Specifically, the mean distortion between the adversarial examples and the original instances is , whereas the mean distortion between the recovered instances and original instances is .

This indicates that what our adversarial examples have done is fool the generator into giving reconstructions that are even less similar from the original than the adversarial example. This effect can be observed in Figure 4: faint lines introduced become more pronounced after reconstruction.

Dataset Model Success Distortion ()
MNIST Unsecured 2.04
APE-GAN 2.17
CIFAR Unsecured 0.43
APE-GAN 0.72
Table 3: APE-GAN does not significantly increase robustness to adversarial examples on the MNIST or CIFAR-10 datasets.

6 Conclusion

As this short paper demonstrates, MagNet is not robust to transferable adversarial examples, and combining Gaussian data augmentation and BReLU activations does not significantly increase the robustness of a neural network against strong iterative attacks. Surprisingly, we found that while all three defenses take different approaches to increasing the robustness against adversarial examples, they all give approximately the same increase in robustness ().

We recommend that researchers who propose defenses attempt adaptive white-box attacks against their schemes before claiming robustness. Or, if arguing security in the grey-box threat model, we recommend researchers generate adversarial examples targeting the specific defense by using a copy of that defense as the source model. Just because the adversary is not aware of the exact parameters, does not mean the best that can be done is to transfer from an unsecured model: as we have shown here, transfering from a local copy of the model improves the attack success rate.

Figure 3: Attacks on “Efficient Defenses…” on MNIST and CIFAR-10: (a) original reference image; (b) adversarial example on the defense with only BReLU; (c) adversarial example on the complete defense with Gaussian noise and BReLU.
Figure 4: Attacks on APE-GAN on MNIST and CIFAR-10: (a) original reference image; (b) adversarial example on APE-GAN; (c) reconstructed adversarial example.

References