MagNet and “Efficient Defenses…” were recently proposed as a defense to adversarial examples. We find that we can construct adversarial examples that defeat these defenses with only a slight increase in distortion.
It is an open question how to train neural networks so they will be robust to adversarial examples. Recently, three defenses have been proposed to make neural networks robust to adversarial examples:
MagNet  was proposed as an approach to make neural networks robust against adversarial examples through two complementary approaches: adversarial examples near the data manifold are reformed
to lie on the data manifold that are classified correctly, whereas adversarial examples far away from the data manifold aredetected and rejected before classification. MagNet does not argue robustness in the white-box setting; rather, the authors argue that MagNet is robust in the grey-box setting where the adversary is aware the defense is in place, knows the parameters of the base classifier, but not the parameters of the defense.
Adversarial Perturbation Elimination GAN (APE-GAN)  is similar to MagNet, only adversarial examples are projected onto the data manifold using a Generative Adversarial Network (GAN) . We did not set out to bypass this defense, but found it to be very similar to MagNet and so we analyze it too.
In this short paper, we demonstrate these three defenses are not effective on the MNIST  and CIFAR-10  datasets. We show that we are able to bypass MagNet with greater than success, and the latter two with , with only a slight increase in distortion.
We defeat MagNet by making use of the transferability  property of adversarial examples: the adversary trains their own copy of the defense, constructs adversarial examples on their model, and supplies these adversarial examples to the defender. It turns out that these examples will also fool the defender’s model.
We defeat “Efficient Defenses Against Adversarial Attack” and APE-GAN by showing that existing attack can defeat them with success without modification. Adversarial examples are not more visually detectable than an undefended network.
We assume familiarity with neural networks , adversarial examples , transferability , generating strong attacks against adversarial examples  and MagNet . We briefly review the key details and notation.
Notation Let, so that . Each output corresponds to the predicted probability that the object is labelled as class . Let correspond to the classification of on . In this paper we are concerned with neural networks used to classify images (on MNIST and CIFAR-10).
Adversarial examples  are instances that are very close to a normal instance with respect to some distance metric ( distance, in this paper), but where for any target chosen by the adversary.
We generate adversarial examples with Carlini and Wagner’s attack algorithm . Specifically, we solve
where the loss functionis defined as
and the constant is chosen via binary search.
Perhaps the most surprising property of adversarial examples is transferability : given two different models that perform the same task, trained on different datasets, adversarial examples constructed to fool one one model often fool the other as well. This effect can be amplified  by constructing the adversarial examples to fool multiple local models before applying it to the target model.
MagNet  was recently proposed as a defense to adversarial examples under the grey box threat model: here, the adversary is aware that MagNet is being used to secure a given classifier, as well as the parameters of the classifier, but is not aware of the parameters of the defensive models. MagNet has two defense components:
The detector examines inputs and decides if they are adversarial or benign. There can be multiple detectors, in which case if any report “adversarial” the input is rejected.
The reformer takes inputs classified as benign by the detector(s) and reforms them to remove any remaining minor adversarial nature to the instances. The defender trains multiple reformers and selects one at random, to prevent the adversary from learning its parameters.
The authors argue these defenses complement each other: the detector should be able to detect any adversarial examples with large distortion and are far away from the true data manifold. The reformer takes instances which are close to the data manifold and maps them directly on to the manifold to remove any minor adversarial perturbations that may not have been detected.
The specific instantiation of the detector and reformer suggested by the authors are based on auto-encoders. Given an auto-encoder , to detect if an input is adversarial, compare the mean squared error with a threshold chosen to minimize the false positive rate. On CIFAR, the authors suggest one further detector: compute the Jensen-Shannon divergence with temperature (set to and ), and report “adversarial” if the difference is greater than a different threshold , also chosen to minimize the false positive rate.
Similarly, to reform an adversarial example run it through the auto-encoder and let : since the auto-encoder is trained to reconstruct typical instances, this operation projects it back to the data manifold.
Efficient Defenses Against Adversarial Attacks 
works by making two modifications to standard neural networks. First, the authors propose use of the Bounded ReLU activation function, defined asinstead of standard ReLU  which is unbounded above. Second, instead of training on the standard training data they train on where is chosen fresh for each training instance. On MNIST, ; for CIFAR, . The authors claim that despite training on noise, it is successful on .
APE-GAN  works by constructing a pre-processing network trained to project both normal instances and adversarial examples back to the data manifold as done in MagNet. The network
is trained with a GAN instead of an auto-encoder. Note that unlike a standard GAN which takes as input a noise vector and must produce an output image, the generator in APE-GAN takes in an adversarial example and must make it appear non-adversarial. During training, the authors train on adversarial examples generated with the Fast Gradient Sign algorithm; despite this, the authors claim robustness on a wide range of attacks (including ).
Defense Models. We take the MagNet implementation from the authors’ open-source code 111https://github.com/Trevillie/MagNet and train our own models. Since the provided code does not include an implementation of the CIFAR defense and classifier, we implement it as described in the paper.
We were unable to obtain source code for “Efficient Defenses…”. We therefore re-implement the proposed defense based on the description in the paper. We take the APE-GAN implementation from the authors open-source code 222https://github.com/shenqixiaojiang/APE-GAN.
We use the C&W attack given by the authors. We also release our code 333https://github.com/carlini/MagNet 444https://github.com/carlini/breaking_efficient_defenses 555https://github.com/carlini/APE-GAN.
3 Breaking MagNet
We demonstrate MagNet is not robust by generating transferable adversarial examples with Carlini and Wagner’s attack. To do this, we (as the adversary) train local auto-encoders to use as detectors and reformers (different than the 16 the defender will use). To generate adversarial examples, we modify the optimization formulation to the following, where reformers are denoted and the detectors :
where the classifier loss function is defined as
and the detector loss function is defined as
Here, and are chosen via binary search simultaneously. The threshold is chosen such that of validation images are detected as adversarial. (The true defender sets their remote to reject only of instances : ours is more strict to help the attack construct adversarial examples that fool the remote detectors more often.) We set the confidence so that the adversarial examples are more likely to transfer.
We attack by performing iterations of gradient descent with a learning rate of
. We did not perform hyperparameter search (e.g., pickingauto-encoders, , ); improved search over these parameters would yield lower distortion adversarial examples.
Figures 1 and 2 contain images of targeted adversarial examples on on the secured network, and Table 1 the mean distortion required across the first 1000 instances of the test set with targets chosen uniformly at random among the incorrect classes.
4 Breaking “Efficient Defenses…”
We demonstrate this defense is not robust by generating adversarial examples with Carlini and Wagner’s attack. We do nothing more than apply the attack to the defended network.
Figure 3 contains images of adversarial examples on the secured network, and Table 2 the mean distortion required across the first 1000 instances of the test set with targets chosen at random among the incorrect classes.
|Gaussian Noise + BReLU||2.58|
|Gaussian Noise + BReLU||0.67|
On MNIST, the full defense increases mean distance to the nearest adversarial example by , and on CIFAR by . This is in contrast with other forms of retraining, such as adversarial retraining , which increase distortion by a significantly larger amount. Interestingly, we find that BReLU provides some increase in distortion when trained without Gaussian augmentation, but when trained with it, does not help.
5 Breaking APE-GAN
We demonstrate APE-GAN is not robust by generating adversarial examples with Carlini and Wagner’s attack. We do nothing more than apply the attack to defended network. That is, we change the loss function to account for the fact that the manifold-projection is done before classification. Specifically, we let
and solve the same minimization formulation.
Figure 4 contains images of adversarial examples on APE-GAN, and Table 3 the mean distortion required across the first 1000 instances of the test set with targets chosen at random among the incorrect classes.
Investigating APE-GAN’s Failure. Why are we able to fool APE-GAN? We compare (a) the mean distance between the original inputs and the adversarial examples, and (b) the mean distance between the original inputs and the recovered adversarial examples. We find that the recovered adversarial examples are less similar to the original than the adversarial examples. Specifically, the mean distortion between the adversarial examples and the original instances is , whereas the mean distortion between the recovered instances and original instances is .
This indicates that what our adversarial examples have done is fool the generator into giving reconstructions that are even less similar from the original than the adversarial example. This effect can be observed in Figure 4: faint lines introduced become more pronounced after reconstruction.
As this short paper demonstrates, MagNet is not robust to transferable adversarial examples, and combining Gaussian data augmentation and BReLU activations does not significantly increase the robustness of a neural network against strong iterative attacks. Surprisingly, we found that while all three defenses take different approaches to increasing the robustness against adversarial examples, they all give approximately the same increase in robustness ().
We recommend that researchers who propose defenses attempt adaptive white-box attacks against their schemes before claiming robustness. Or, if arguing security in the grey-box threat model, we recommend researchers generate adversarial examples targeting the specific defense by using a copy of that defense as the source model. Just because the adversary is not aware of the exact parameters, does not mean the best that can be done is to transfer from an unsecured model: as we have shown here, transfering from a local copy of the model improves the attack success rate.
- Carlini and Wagner  N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. IEEE Symposium on Security and Privacy, 2017.
- Goodfellow et al. [2014a] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014a.
- Goodfellow et al. [2014b] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014b.
- Krizhevsky and Hinton  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- LeCun et al.  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Liu et al.  Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
- Madry et al.  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Meng and Chen  D. Meng and H. Chen. MagNet: a two-pronged defense against adversarial examples. In ACM Conference on Computer and Communications Security (CCS), 2017. arXiv preprint arXiv:1705.09064.
Nair and Hinton 
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
- Shen et al.  S. Shen, G. Jin, K. Gao, and Y. Zhang. APE-GAN: Adversarial Perturbation Elimination with GAN. arXiv preprint arXiv:1707.05474, 2017.
- Szegedy et al.  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ICLR, 2013.
- Zantedeschi et al.  V. Zantedeschi, M.-I. Nicolae, and A. Rawat. Efficient defenses against adversarial attacks. arXiv preprint arXiv:1707.06728, 2017.