Adversarial Detection and Correction by Matching Prediction Distributions

02/21/2020 ∙ by Giovanni Vacanti, et al. ∙ 19

We present a novel adversarial detection and correction method for machine learning classifiers.The detector consists of an autoencoder trained with a custom loss function based on the Kullback-Leibler divergence between the classifier predictions on the original and reconstructed instances.The method is unsupervised, easy to train and does not require any knowledge about the underlying attack. The detector almost completely neutralises powerful attacks like Carlini-Wagner or SLIDE on MNIST and Fashion-MNIST, and remains very effective on CIFAR-10 when the attack is granted full access to the classification model but not the defence. We show that our method is still able to detect the adversarial examples in the case of a white-box attack where the attacker has full knowledge of both the model and the defence and investigate the robustness of the attack. The method is very flexible and can also be used to detect common data corruptions and perturbations which negatively impact the model performance. We illustrate this capability on the CIFAR-10-C dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial examples (Szegedy et al., 2013) are instances which are carefully crafted by applying small perturbations to the original data with the goal to trick the machine learning classifier and change the predicted class. As a result, the classification model makes erroneous predictions which poses severe security issues for the deployment of machine learning systems in the real world. Achieving an acceptable level of security against adversarial attacks is a milestone that must be reached in order to trust and act on the predictions of safety-critical machine learning systems at scale. The issue regards many emerging technologies in which the use of machine learning models is prominent. For example, an attacker could craft adversarial images in order to induce an autonomous driving system to interpret a STOP sign as a RIGHT OF WAY sign and compromise the safety of the autonomous vehicle.

Given the crucial importance of the subject, a number of proposals claiming valid defence methods against adversarial attacks have been put forth in recent years (see Section 2 for more details). Some of these proposals have obtained promising results on basic benchmark datasets for grey-box attacks, i.e. attacks where the attacker has full knowledge of the model but not of the defence system. However, even in the grey-box scenario most of these approaches usually fail to generalise to more complex datasets or they are impractical. Moreover, effective and practical defences against white-box attacks, i.e. attacks where the attacker has full knowledge of the model and the defence mechanism, are still out of reach.

We argue that autoencoders trained with loss functions based on a distance metric between the input data and the reconstructed instances by the autoencoder network are flawed for the task of adversarial detection since they do not take the goal of the attacker into account. The attack applies near imperceptible perturbations to the input which change the class predicted by the classifier. Since the impact of the attack is most prominent in the model output space, this is where the defence should focus on during training. We propose a novel method for adversarial detection and correction based on an autoencoder network with a model dependent loss function designed to match the prediction probability distributions of the original and reconstructed instances. The output of the autoencoder can be seen as a

symmetric example since it is crafted to mimic the prediction distribution produced by the classifier on the original instance and does not contain the adversarial artefact anymore. We also define an adversarial score based on the mismatch between the prediction distributions of the classifier on an instance and its symmetric counterpart. This score is highly effective to detect both grey-box and white-box attacks. The defence mechanism is trained in an unsupervised fashion and does not require any knowledge about the underlying attack. Because the method is designed to extract knowledge from the classifier’s output probability distribution, it bears some resemblance to defensive distillation (Papernot et al., 2016).

Besides detecting malicious adversarial attacks, the adversarial score also proves to be an effective measure for more common data corruptions and perturbations which degrade the machine learning model’s performance. Our method is in principle applicable to any machine learning classifier vulnerable to adversarial attacks, regardless of the modality of the data.

In the following, Section 2 gives a brief summary of current developments in the field of adversarial defence. In Section 3 we describe our method in more detail while Section 4 discusses the results of our experiments. We validate our method against a variety of state-of-the-art grey-box and white-box attacks on the MNIST (LeCun and Cortes, 2010), Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky, 2009) datasets. We also evaluate our method as a data drift detector on the CIFAR-10-C dataset (Hendrycks and Dietterich, 2019).

2 Related Work

2.1 Adversarial attacks

Since the discovery of adversarial examples (Szegedy et al., 2013), a variety of methods have been proposed to generate such instances via adversarial attacks. The aim of the attack is to craft an instance that changes the predicted class of the machine learning classifier without noticeably altering the original instance . In other words, the attack tries to find the smallest perturbation such that the model predicts different classes for and .

If the attack is targeted, the classifier prediction on is restricted to a predetermined class . For an untargeted attack, any class apart from the one predicted on is sufficient for the attack to be successful. From here on we only consider untargeted attacks as they are less constricted and easier to succeed. Formally, the attack tries to solve the following optimisation problem:

(1)

where is the norm, and

represents the prediction probability vector of the classifier.

We validate our adversarial defence on three different attacks: Carlini-Wagner (Carlini and Wagner, 2016), SLIDE (Tramèr and Boneh, 2019) and the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015a). The Carlini-Wagner (C&W) and SLIDE attacks are very powerful and able to reduce the accuracy of machine learning classifiers on popular datasets such as CIFAR-10 to nearly % while keeping the adversarial instances visually indistinguishable from the original ones. FGSM on the other hand is a fast but less powerful attack, resulting in more obvious adversarial instances. Although many other attack methods like EAD (Chen et al., 2018), DeepFool (Moosavi-Dezfooli et al., 2016) or Iterative FGSM (Kurakin et al., 2017) exist, the scope covered by C&W, SLIDE and FGSM is sufficient to validate our defence.

Carlini-Wagner

We use the version of the C&W attack. The -C&W attack approximates the minimisation problem given in Equation 1 as

(2)

where is a custom loss designed to be negative if and only if the class predicted by the model for is equal to the target class and where denotes the norm.

Slide

The SLIDE attack is an iterative attack based on the norm. Given the loss function of the classifier at each iteration the gradients with respect to the input are calculated. The unit vector determines the direction of the perturbation and the components of are updated according to

(3)

where represent the -th percentile of the gradients’ components. The perturbation is then updated as where is the step size of the attack.

Fgsm

The Fast Gradient Sign Method is designed to craft a perturbation in the direction of the gradients of the model’s loss function with respect to the input according to

(4)

where is a small parameter fixing the size of the perturbation and is the loss function of the classifier.

2.2 Adversarial Defences

Different defence mechanisms have been developed to deal with the negative impact of adversarial attacks on the classifier’s performance. Adversarial training augments the training data with adversarial instances to increase the model robustness (Szegedy et al., 2013; Goodfellow et al., 2015b). Adversarial training tailored to a specific attack type can however leave the model vulnerable to other perturbation types (Tramèr and Boneh, 2019).

A second approach attempts to remove the adversarial artefacts from the example and feed the purified instance to the machine learning classifier. Defense-GAN (P. Samangouei, 2018) uses a Wasserstein GAN (Arjovsky et al., 2017) which is trained on the original data. The difference between the GAN’s (Goodfellow et al., 2014) generator output and the adversarial instance is minimised with respect to . The generated instance is then fed to the classifier. Defense-GAN comes with a few drawbacks. GAN training can be notoriously unstable and suffer from issues like mode collapse which would reduce the effectiveness of the defence. The method also needs to apply gradient descent optimisation steps with random restarts at inference time, making it computationally more expensive. MagNet (Meng and Chen, 2017) uses one or more autoencoder-based detectors and reformers to respectively flag adversarial instances and transform the input data before feeding it to the classifier. The autoencoders are trained with the mean squared error (MSE) reconstruction loss, which is suboptimal for adversarial detection as it focuses on reconstructing the input without taking the decision boundaries into account. Other defences using autoencoders either require knowledge about the attacks like (Li et al., 2019; Li and Ji, 2018) or class labels (Hwang et al., 2019). PixelDefend (Song et al., 2018) uses a generative model to purify the adversarial instance.

A third approach, defensive distillation (Papernot et al., 2016), utilises model distillation (Hinton et al., 2015) as a defence mechanism. A second classification model is trained by minimising the cross entropy between the class probabilities of the original classifier and the predictions of the distilled model. Defensive distillation reduces the impact of gradients used in crafting adversarial instances and increases the number of features that need to be changed. Although our method uses an autoencoder to mitigate the adversarial attack, it can also relate to defensive distillation since we optimise for the K-L divergence between the model predictions on the original and reconstructed instances.

3 Method

3.1 Threat Model

The threat model describes the capabilities and knowledge of the attacker. These assumptions are crucial to evaluate the defence mechanism and allow like-for-like comparisons.

The attack is only allowed to change the input features by a small perturbation such that the predicted class is different from . Since we assume that the attack is untargeted, is not restricted to a predefined class . The attack is however not allowed to modify the weights or architecture of the machine learning model. A most important part of the threat model is the knowledge of the attack about both the model and the defence. We consider three main categories:

Black-box attacks

This includes all attack types where the attacker only has access to the input and output of the classifier under attack. The output can either be the predicted class or the output probability distribution over all the classes.

Grey-box attacks

The attacker has full knowledge about the classification model but not the defence mechanism. This includes information about the model’s architecture, weights, loss function and gradients.

White-box attacks

On top of full knowledge about the classifier, the attack also has complete access to the internals of the defence mechanism. This includes the logic of the defence, loss function as well as model weights and gradients for a differentiable defence system. Security against white-box attacks implies security against all the less powerful grey-box and black-box attacks.

We validate the strength of our proposed defence mechanism for a variety of grey-box and white-box attacks on the MNIST, Fashion-MNIST and CIFAR-10 datasets.

3.2 Defence Mechanism


Figure 1: Adversarial detection and correction examples on CIFAR-10 after a C&W attack. The first two columns show respectively the original and adversarial instances with the class predictions and adversarial scores. The last column visualises the reconstructed instance of the adversarial image by the autoencoder with the corrected prediction.

Our novel approach is based on an autoencoder network. An autoencoder consists of an encoder which maps vectors in the input space to vectors in a latent space with , and a decoder which maps back to vectors in . The encoder and decoder are jointly trained to approximate an input transformation which is defined by the optimisation objective, or loss function.

There have been multiple attempts to employ conventionally trained autoencoders for adversarial example detection (Meng and Chen, 2017; Hwang et al., 2019; Li et al., 2019). Usually, autoencoders are trained to find a transformation that reconstructs the input instance as accurately as possible with loss functions that are suited to capture the similarities between and such as the reconstruction error . However, these types of loss functions suffer from a fundamental flaw for the task of adversarial detection and correction. In essence, the attack tries to introduce a minimal perturbation in the input space while maximising the impact of on the model output space to ensure . If the autoencoder is trained with a reconstruction error loss, will lie very close to and will be sensitive to the same adversarial perturbation crafted around . There is no guarantee that the transformation is able to remove the adversarial perturbation from the input since the autoencoder’s objective is only to reconstruct as truthful as possible in the input space.

The novelty of our proposal relies on the use of a model-dependent loss function based on a distance metric in the output space of the model to train the autoencoder network. Given a model we optimise the weights of an auto-encoder using the following objective function:

(5)

where denotes the K-L divergence and represents the prediction probability vector of the classifier. Training of the autoencoder is unsupervised since we only need access to the model prediction probabilities and the normal training instances. The classifier weights are frozen during training. Note that the fundamental difference between our approach and other defence systems based on autoencoders relies on the fact that the minimisation objective is suited to capture similarities between instances in the output space of the model rather than in the input feature space.

Without the presence of a reconstruction loss term like , simply tries to make sure that the prediction probabilities and match without caring about the proximity of to . As a result, is allowed to live in different areas of the input feature space than with different decision boundary shapes with respect to the model . The carefully crafted adversarial perturbation which is effective around does not transfer to the new location of in the feature space, and the attack is therefore neutralised. This effect is visualised by Figure 1. The adversarial instance is close to the original image in the pixel space but the reconstruction of the adversarial attack by the autoencoder lives in a different region of the input space than and looks like noise at first glance. does however not contain the adversarial artefacts anymore and the model prediction returns the corrected class.

The adversarial instances can also be detected via the adversarial score :

(6)

where is again a distance metric like the K-L divergence. will assume high values for adversarial examples given the probability distribution difference between predictions on the adversarial and reconstructed instance, making it a very effective measure for adversarial detection.


Figure 2: An input instance is transformed into by the autoencoder The K-L divergence between the output distributions and is calculated and used as loss function for training and as the adversarial signal at inference time. Based on some appropriate threshold , is used to flag adversarial instances. If an instance is flagged, the correct prediction is retrieved through the transformed instance Note that the model’s weights are frozen during training.

Both detection and correction mechanisms can be combined in a simple yet effective adversarial defence system, illustrated in Figure 2. First the adversarial score is computed. If the score exceeds a predefined threshold the instance is flagged as an adversarial example . Similar to MagNet the threshold can be set using only normal data by limiting the false positive rate to a small fraction . The adversarial instance is fed to the autoencoder which computes the transformation . The classifier finally makes a prediction on . If the adversarial score is below the threshold , the model makes a prediction on the original instance .

The method is also well suited for drift detection, i.e. for the detection of corrupted or perturbed instances that are not necessarily adversarial by nature but degrade the model performance.

3.3 Method Extensions

The performance of the correction mechanism can be improved by extending the training methodology to one of the hidden layers. We extract a flattened feature map from the hidden layer, feed it into a linear layer and apply the softmax function:

(7)

The autoencoder is then trained by optimising

(8)

During training of , the K-L divergence between the model predictions on and

is minimised. If the entropy from the output of the model’s softmax layer is high, it becomes harder for the autoencoder to learn clear decision boundaries. In this case, it can be beneficial to sharpen the model’s prediction probabilities through temperature scaling:

(9)

The loss to minimise becomes . The temperature itself can be tuned on a validation set.

4 Experiments

4.1 Experimental Setup

The adversarial attack experiments are conducted on the MNIST, Fashion-MNIST and CIFAR-10 datasets. The autoencoder architecture is similar across the datasets and consists of 3 convolutional layers in both the encoder and decoder. The MNIST and Fashion-MNIST classifiers are also similar and reach test set accuracies of respectively % and %. For CIFAR-10, we train both a stronger ResNet-56 (He et al., 2016) model up to % accuracy and a weaker model which achieves % accuracy on the test set. More details about the exact architecture and training procedure of the different models can be found in the appendix.

The defence mechanism is tested against Carlini-Wagner (C&W), SLIDE and FGSM attacks with varying perturbation strength

. The attack hyperparameters and examples of adversarial instances for each attack can be found in the appendix. The attacks are generated using the open source Foolbox library

(Rauber et al., 2017).

We consider two settings under which the attacks take place: grey-box and white-box. Grey-box attacks have full knowledge of the classification model but not the adversarial defence. White-box attacks on the other hand assume full knowledge of both the model and defence and can propagate gradients through both. As a result, white-box attacks try to fool .

4.2 Grey-Box Attacks


Figure 3: The reconstructed images by the adversarial autoencoder in the bottom row correct classifier mistakes on MNIST.

Attack No Attack No Defence
CW
SLIDE
FGSM,
FGSM,
FGSM,
Table 1: Test set accuracy for the MNIST classifier on both the original and adversarial instances with and without the defence. and are the defence mechanisms trained with respectively the MSE and loss functions.

Attack No Attack No Defence
CW
SLIDE
FGSM,
FGSM,
FGSM,
Table 2: Test set accuracy for the Fashion-MNIST classifier on both the original and adversarial instances with and without the defence. and are the defence mechanisms trained with respectively the MSE and loss functions.

Mitigating adversarial attacks on classification tasks consists of two steps: detection and correction. Table 1 to Table 4 highlight the consistently strong performance of the correction mechanism across the different datasets for various attack types. On MNIST, strong attacks like C&W and SLIDE which reduce the model accuracy to almost % are corrected by the detector and the attack is neutralised, nearly recovering the original accuracy of %. It is also remarkable that when we evaluate the accuracy of the classifier predictions where is the original test set, the accuracy equals %, surpassing the performance of . Figure 3 shows a few examples of instances that are corrected by , as well as their reconstruction

. The corrected instances are outliers in the pixel space for which the autoencoder manages to capture the decision boundary. The correction accuracy drops slightly from

% to % for FGSM attacks when increasing the perturbation strength from to . Higher values of lead to noisier adversarial instances which are easy to spot with the naked eye and result in higher adversarial scores. The results on Fashion-MNIST follow the same narrative. The adversarial correction mechanism largely restores the model accuracy after powerful attacks which can reduce the accuracy without the defence up to %.


Attack No Attack No Defence
CW
SLIDE
FGSM,
FGSM,
FGSM,
Table 3: CIFAR-10 test set accuracy for a simple CNN classifier on both the original and adversarial instances with and without the defence. and are the defence mechanisms trained with respectively the MSE and loss functions. includes temperature scaling and extends the methodology to one of the hidden layers.

Attack No Attack No Defence
CW
SLIDE
FGSM,
FGSM,
FGSM,
Table 4: CIFAR-10 test set accuracy for a ResNet-56 classifier on both the original and adversarial instances with and without the defence. and are the defence mechanisms trained with respectively the MSE and loss functions. includes temperature scaling and extends the methodology to one of the hidden layers.

Figure 4: Entropy of the classifier predictions on the test set for MNIST, Fashion-MNIST and CIFAR-10.

The classification accuracy uplift on MNIST or Fashion-MNIST from training the autoencoder with instead of the mean squared error between and is limited from % to % for adversarial instances generated by C&W or SLIDE. Table 3 and Table 4 however show that the autoencoder defence mechanism trained with the K-L divergence outperforms the MSE equivalent

by respectively over 15% and almost 65% on CIFAR-10 using the simple classifier and the ResNet-56 model. The performance difference is even more pronounced when we simplify the autoencoder architecture. An autoencoder with only one hidden dense layer and ReLU activation function

(Hahnloser and Seung, 2000) in the encoder and one hidden dense layer before the output layer in the decoder is still able to detect and correct adversarial attacks on the CIFAR-10 ResNet-56 model when trained with . The correction accuracy for C&W and SLIDE reaches % and % compared to around %, or similar to random predictions, if the same autoencoder is trained with the MSE loss. The exact architecture can be found in the appendix.

The ROC curves for the adversarial scores and corresponding AUC values in Figure 5 highlight the effectiveness of the method for the different datasets. The AUC for the strong C&W and SLIDE grey-box attacks are equal to for MNIST and approximately for Fashion-MNIST. For the ResNet-56 classifier on CIFAR-10, the AUC is still robust at for C&W and on SLIDE. As expected, increasing for the FGSM attacks results in slightly higher AUC values.

Figure 5: ROC curves and AUC values for adversarial instance detection on MNIST, Fashion-MNIST and CIFAR-10 for C&W, SLIDE and FGSM grey-box attacks. The curves and values are computed on the combined original and attacked test sets.
Figure 6: ROC curves and AUC values for adversarial instance detection on MNIST, Fashion-MNIST and CIFAR-10 for C&W, SLIDE and FGSM white-box attacks. The curves and values are computed on the combined original and attacked test sets.

Figure 4 shows that the classifier’s entropy on CIFAR-10 is higher than on MNIST or Fashion-MNIST. As a result, we apply temperature scaling on during training of the autoencoders for the CIFAR-10 models. We find that the optimal value of for the classifiers on CIFAR-10 equals for the strong attacks. Table 3 and Table 4 show that temperature scaling improves the classifier accuracy on the adversarial instances generated by strong attacks like C&W and SLIDE by an additional % to % for the CIFAR-10 classifiers. Decreasing the temperature too much leads to increasing overconfidence on potentially incorrect predictions since the true labels are not used during training of the autoencoder.

Table 3 and Table 4 also illustrate that including the hidden layer divergence using Equation 8 for the CIFAR-10 classifiers leads to an accuracy improvement between % and

% on the C&W and SLIDE attacks compared to our basic defence mechanism. The feature maps are extracted after the max-pooling layer for the simple CIFAR-10 model and before the activation function in the last residual block of the ResNet-56. As shown in the appendix, the improvement is robust with respect to the choice of extracted hidden layer.

4.3 White-Box Attacks

In the case of white-box attacks, C&W and SLIDE are able to reduce the accuracy of the predictions on all datasets to almost %. This does not mean that the attack goes unnoticed or cannot be corrected in practice.

Moreover, while manages to bypass the correction mechanism it has a limited impact on the model accuracy itself. For the ResNet-56 classifier on CIFAR-10, the prediction accuracy of equals % on the original test set. The accuracy of the model predictions on the adversarial instances only drops to respectively % and % after C&W and SLIDE attacks.

Importantly, the adversarial attacks are not very transferable. Assume that the autoencoder model under attack has been trained for epochs. Swapping it for the same model but only trained for epochs drastically reduces the effectiveness of the attack and brings the classification accuracy of back up from almost % to respectively % and % for C&W and SLIDE. In practice this means that we can cheaply boost the strength of the defence by ensembling different checkpoints during the training of . When the adversarial score is above a threshold value, the prediction on is found by applying a weighted majority vote over a combination of the ensemble and the classifier:

(10)

with

(11)

By combining autoencoders with different architectures we can further improve the diversification and performance of the ensemble. In order to guarantee the success of the white-box attack, it needs to find a perturbation that fools the weighted majority of the diverse defences and the classifier. MagNet (Meng and Chen, 2017) also uses a collection of autoencoders, but trained with the usual MSE loss.

The detector is still effective at flagging the adversarial instances generated by the white-box attacks. This is evidenced in Figure 6 by the robust AUC values consistently above for MNIST on all attacks. The AUC for Fashion-MNIST is equal to respectively and for C&W and SLIDE and up to for FGSM. On CIFAR-10, the AUC’s for C&W and SLIDE are and , and close to for FGSM for different values of .

4.4 Data Drift Detection


Figure 7:

Mean adversarial scores with standard deviations (lhs) and ResNet-56 accuracies (rhs) for increasing data corruption severity levels on CIFAR-10-C. Level 0 corresponds to the original CIFAR-10 test set. Harmful scores are scores from instances which have been flipped from the correct to an incorrect prediction because of the corruption. Not harmful means that the prediction was unchanged after the corruption.

It is important that safety-critical applications do not suffer from common data corruptions and perturbations. Detection of subtle input changes which reduce the model accuracy is therefore crucial. Rabanser et al. (2019) discuss several methods to identify distribution shift and highlight the importance to quantify the harmfulness of the shift. The adversarial detector proves to be very flexible and can be used to measure the harmfulness of the data drift on the classifier. We evaluate the detector on the CIFAR-10-C dataset (Hendrycks and Dietterich, 2019). The instances in CIFAR-10-C have been corrupted and perturbed by various types of noise, blur, brightness etc. at different levels of severity, leading to a gradual decline in model performance.

Figure 7 visualises the adversarial scores at different levels of corruption severity for the ResNet-56 classifier on CIFAR-10-C compared to CIFAR-10. The average scores for the instances where the predicted class was changed from the correct to an incorrect class due to the data corruption is between x and x higher than the scores for the instances where the class predictions were not affected by the perturbations. The average of the negatively affected instances declines slightly with increasing severity because the changes to are stronger and is not as large. For each level of corruption severity, a two-sided Kolmogorov-Smirnov two sample test (Smirnov, 1939)

rejects the null hypothesis that the negatively affected instances are drawn from the same distribution as the samples unaffected by the data corruption with a p-value of

. As a result, the drift detector provides a robust measure for the harmfulness of the distribution shift.

5 Conclusions

We introduced a novel method for adversarial detection and correction based on an autoencoder trained with a custom loss function. The loss function aims to match the prediction probability distributions between the original data and the reconstructed instance by the autoencoder. We validate our approach on a variety of grey-box and white-box attacks on the MNIST, Fashion-MNIST and CIFAR-10 datasets. The defence mechanism is very effective at detecting and correcting strong grey-box attacks like Carlini-Wagner or SLIDE and remains efficient for white-box attack detection. Interestingly, the white-box attacks are not very transferable between different autoencoders or even between the defence mechanism and the standalone classification model. This is a promising area for future research and opens opportunities to build robust adversarial defence systems. The method is also successful in detecting common data corruptions and perturbations which harm the classifier’s performance. We illustrate the effectiveness of the method on the CIFAR-10-C dataset. To facilitate the practical use of the adversarial detection and correction system we provide an open source library with our implementation of the method (Van Looveren et al., 2020).

Acknowledgements

The authors would like to thank Janis Klaise and Alexandru Coca for fruitful discussions on adversarial detection, and Seldon Technologies Ltd. for providing the time and computational resources to complete the project.

References

  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.2.
  • N. Carlini and D. Wagner (2016)

    Towards evaluating the robustness of neural networks

    .
    External Links: 1608.04644 Cited by: §2.1.
  • P. Chen, Y. Sharma, H. Zhang, J. Yi, and C. Hsieh (2018) EAD: elastic-net attacks to deep neural networks via adversarial examples. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    , S. A. McIlraith and K. Q. Weinberger (Eds.),
    pp. 10–17. External Links: Link Cited by: §2.1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015a) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015b) Explaining and harnessing adversarial examples. arXiv:1412.6572v3 (), pp. . Cited by: §2.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. External Links: Link Cited by: §2.2.
  • R. H. R. Hahnloser and H. S. Seung (2000) Permitted and forbidden sets in symmetric threshold-linear networks. In Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), pp. 217–223. External Links: Link Cited by: §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document, ISSN 1063-6919 Cited by: §4.1.
  • D. Hendrycks and T. G. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1, §4.4.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §2.2.
  • U. Hwang, J. Park, H. Jang, S. Yoon, and N. I. Cho (2019) PuVAE: a variational autoencoder to purify adversarial examples. IEEE Access 7, pp. 126582–126593. Cited by: §2.2, §3.2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §1.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, External Links: Link Cited by: §2.1.
  • Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §1.
  • H. Li, Q. Xiao, S. Tian, and J. Tian (2019) Purifying adversarial perturbation with adversarially trained auto-encoders. arXiv preprint arXiv:1905.10729. Cited by: §2.2, §3.2.
  • X. Li and S. Ji (2018) Defense-vae: A fast and accurate defense against adversarial attacks. CoRR abs/1812.06570. External Links: Link, 1812.06570 Cited by: §2.2.
  • D. Meng and H. Chen (2017) MagNet: a two-pronged defense against adversarial examples. External Links: 1705.09064 Cited by: §2.2, §3.2, §4.3.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: A simple and accurate method to fool deep neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2574–2582. External Links: Link, Document Cited by: §2.1.
  • R. C. P. Samangouei (2018) Defense-gan: protecting classifiers against adversarial attacks using generative models. arXiv:1805.06605v2 (), pp. . Cited by: §2.2.
  • N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, pp. 582–597. External Links: Link, Document Cited by: §1, §2.2.
  • S. Rabanser, S. Günnemann, and Z. Lipton (2019) Failing loudly: an empirical study of methods for detecting dataset shift. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 1394–1406. External Links: Link Cited by: §4.4.
  • J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. External Links: Link, 1707.04131 Cited by: §4.1.
  • N. V. Smirnov (1939) Estimate of deviation between empirical distribution functions in two independent samples. Bulletin Moscow University 2 (2), pp. 3–16. Cited by: §4.4.
  • Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman (2018) PixelDefend: leveraging generative models to understand and defend against adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. External Links: 1312.6199 Cited by: §1, §2.1, §2.2.
  • F. Tramèr and D. Boneh (2019) Adversarial training and robustness for multiple perturbations. External Links: 1904.13000 Cited by: §2.1, §2.2.
  • A. Van Looveren, G. Vacanti, J. Klaise, and A. Coca (2020) Alibi-Detect: algorithms for outlier and adversarial instance detection, concept drift and metrics. External Links: Link Cited by: §5.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: §1.

Appendix A Models

All the models are trained on a NVIDIA GeForce RTX 2080 GPU.

a.1 Mnist

The classification model consists of 2 convolutional layers with respectively 64 and 32 filters and ReLU activations. Each convolutional layer is followed by a max-pooling layer and a dropout with fraction 30%. The output of the second pooling layer is flattened and fed into a fully connected layer of size 256 with ReLU activation and 50% dropout. This dense layer is followed by a softmax output layer over the 10 classes. The model is trained with an Adam optimizer for 20 epochs with batch size 128 and learning rate 0.001 on MNIST images scaled to and reaches a test accuracy of 99.28%.

The autoencoder for MNIST has 3 convolutional layers in the encoder with respectively 64, 128 and 512

filters with stride 2,

ReLU

activations and zero padding. The output of the last convolution layer in the encoder is flattened and fed into a linear layer which outputs a 10-dimensional latent vector. The decoder takes this latent vector, feeds it into a linear layer with

ReLU activation and output size of 1568. This output is reshaped and passed through 3 transposed convolution layers with 64, 32 and 1 filters and zero padding. The first 2 layers have stride 2 while the last layer has a stride of 1. The autoencoder is trained with the different custom loss terms for 50 epochs using an Adam optimizer with batch size 128 and learning rate 0.001.

a.2 Fashion-MNIST

The classification model is very similar to the MNIST classifier. It consists of 2 blocks of convolutional layers. Each block has 2 convolutional layers with ReLU activations and zero padding followed by a max-pooling layer and dropout with fraction 30%. The convolutions in the first and second block have respectively 64 and 32 filters. The output of the second block is flattened and fed into a fully connected layer of size 256 with ReLU activation and 50% dropout. This dense layer is followed by a softmax output layer over the 10 classes. The model is trained with an Adam optimizer for 40 epochs with batch size 128 on Fashion-MNIST images scaled to and reaches a test accuracy of 93.62%.

The autoencoder architecture and training procedure is exactly the same as the one used for the MNIST dataset.

a.3 Cifar-10

We train 2 different classification models on CIFAR-10: a simple network with test set accuracy of 80.24% and a ResNet-56111https://github.com/tensorflow/models with an accuracy of 93.15%. The simple classifier has the same architecture as the Fashion-MNIST model and is trained for 300 epochs. The ResNet-56 is trained with an SGD optimizer with momentum 0.9 for 300 epochs with batch size 128. The initial learning rate is 0.01, which is decreased with a factor of 10 after 91, 136 and 182 epochs. The CIFAR-10 images are standardised on an image-by-image basis for each model.

The autoencoder for CIFAR-10 has 3 convolutional layers in the encoder with respectively 32, 64 and 256 filters with stride 2, ReLU activations, zero padding and regularisation. The output of the last convolution layer in the encoder is flattened and fed into a linear layer which outputs a 40-dimensional latent vector. The decoder takes this latent vector, feeds it into a linear layer with ReLU activation and output size of 2048. This output is reshaped and passed through 3 transposed convolution layers with 256, 64 and 3 filters with stride 2, zero padding and regularisation. The autoencoder is trained with the different custom loss terms for 50 epochs using an Adam optimizer with batch size 128 and learning rate 0.001.

The adversarial detection mechanism is also tested with an autoencoder where the input is flattened, fed into a dense layer with ReLU activation, regularisation and output size 512 before being projected by a linear layer on the 40-dimensional latent space. The decoder consists of one hidden dense layer with ReLU activation, regularisation and output size 512 and a linear output layer which projects the data back to the feature space after reshaping. Again, the autoencoder is trained with the different custom loss terms for 50 epochs using an Adam optimizer with batch size 128 and learning rate 0.001.

Appendix B Attacks

b.1 Carlini-Wagner (C&W)

On the MNIST dataset, the initial constant is equal to 100. 7 binary search steps are applied to update to a more suitable value. The maximum number of iterations for the attack for each is 200 and the learning rate of the Adam optimizer used during the iterations equals 0.1. Except for the number of binary search steps which is increased to 9, the hyperparameters for Fashion-MNIST are the same as for MNIST. For CIFAR-10, the initial constant is set at 1 with 9 binary search steps to find the optimal value. The learning rate and maximum number of iterations are decreased to respectively 0.01 and 100. Figure 8, Figure 9 and Figure 10 illustrate a number of examples for the C&W attack on each dataset. The first row shows the original instance with the correct model prediction and adversarial score. The second row illustrates the adversarial example with the adversarial score and incorrect prediction. The last row shows the reconstruction by the adversarial detector with the corrected model prediction.


Figure 8: C&W attack on MNIST. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

Figure 9: C&W attack on Fashion-MNIST. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

Figure 10: C&W attack on CIFAR-10 using the ResNet-56 model. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

b.2 Slide

The hyperparameters of the attack remain unchanged for the different datasets. The percentile is equal to 80, the -bound is set at 0.1, the step size equals 0.05 and the number of steps equals 10. Figure 11, Figure 12 and Figure 13 show a number of examples for the SLIDE attack on each dataset. Similar to C&W, the rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.


Figure 11: SLIDE attack on MNIST. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

Figure 12: SLIDE attack on Fashion-MNIST. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

Figure 13: SLIDE attack on CIFAR-10 using the ResNet-56 model. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

b.3 Fast Gradient Sign Method (FGSM)

values of 0.1, 0.2 and 0.3 are used for the FGSM attack on each dataset. The attacks last for 1000 iterations. Figure 14, Figure 15 and Figure 16 show a number of examples for the FGSM attack on each dataset. Similar to C&W, the rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.


Figure 14: FGSM attack with 0.2 on MNIST. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

Figure 15: FGSM attack with 0.1 on Fashion-MNIST. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

Figure 16: FGSM attack with 0.1 on CIFAR-10 using the ResNet-56 model. The rows illustrate respectively the original, adversarial and reconstructed instance with their model predictions and adversarial scores.

Appendix C Hidden Layer K-L Divergence

Table 5 and Table 6 show the robustness of the choice of hidden layer to extract the feature map from before feeding it into a linear layer and applying the softmax function.


Attack No Attack No Defence
CW
SLIDE
Table 5: CIFAR-10 test set accuracy for a simple CNN classifier on both the original and adversarial instances with and without the defence. is the defence mechanisms trained with the loss function. extends the methodology to one of the hidden layers. HL1, HL2, HL3 and HL4 refer to respectively the first max-pooling layer, the third convolutional layer, the dropout layer after the second convolution block and the output of the flattening layer in the CNN model. HL1 is projected on a 50-dimensional vector, HL2 and HL3 on 40-dimensional vectors and HL4 on 10 dimensions.

Attack No Attack No Defence
CW
SLIDE
Table 6: CIFAR-10 test set accuracy for a ResNet-56 classifier on both the original and adversarial instances with and without the defence. is the defence mechanisms trained with the loss function. extends the methodology to one of the hidden layers. HL1, HL2, HL3 and HL4 refer to respectively hidden layers 140, 160, 180 and 200 in the ResNet-56 model. HL1 to HL4 are all projected on 20-dimensional vectors.